RuntimeError: CUDA error: device-side assert triggered #44

azamatolegen · 2019-11-14T04:26:31Z

My code:

from simpletransformers.classification import ClassificationModel
import pandas as pd
train_df = pd.read_csv('data/train.csv', header=None)
eval_df = pd.read_csv('data/test.csv', header=None)
train_df[0] = (train_df[0] == 2).astype(int)
eval_df[0] = (eval_df[0] == 2).astype(int)
train_df = pd.DataFrame({
'text': train_df[1].replace(r'\n', ' ', regex=True),
'label':train_df[0]
})
eval_df = pd.DataFrame({
'text': eval_df[1].replace(r'\n', ' ', regex=True),
'label':eval_df[0]
})
model = ClassificationModel('xlm', 'model/', args=({'fp16': False}))
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

Error:

Features loaded from cache at cache_dir/cached_train_xlm_128_binary
Epoch: 0%| | 0/1 [00:00<?, ?it/s/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "run1.py", line 24, in
model.train_model(train_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 162, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 235, in train
print("\rRunning loss: %f" % loss, end="")
RuntimeError: CUDA error: device-side assert triggered

Could you please help me figure it out how to fix that?Thank you!

The text was updated successfully, but these errors were encountered:

ThilinaRajapakse · 2019-11-14T04:44:39Z

Try reprocessing the data as its being loaded from the cache here.

model = ClassificationModel('xlm', 'model/', args=({'fp16': False, 'reprocess_input_data': True'}))

Also, make sure that the model you are loading from model/ was built with the same num_classes as your data.

azamatolegen · 2019-11-14T07:50:20Z

It helped thanks!
I am facing another error now:

Converting to features started.
100%|###################################| 560000/560000 [03:49<00:00, 2441.14it/s]
Running loss: 0.724930Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "run1.py", line 24, in | 29999/70000 [1:47:18<2:23:03, 4.66it/s]
model.train_model(train_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 162, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 277, in train
model_to_save.save_pretrained(output_dir_current)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_utils.py", line 204, in save_pretrained
torch.save(model_to_save.state_dict(), output_model_file)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 260, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 185, in _with_file_like
return body(f)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 260, in
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 338, in _save
serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: write(): fd 29 failed with No space left on device

I am using RTX 2080 8 gb. How can I deal with that? Thank you very much for your time and help!

ThilinaRajapakse · 2019-11-14T08:01:20Z

You are running out of storage (HDD/SSD). Clearing some space on your drive should do the trick.

azamatolegen closed this as completed Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: device-side assert triggered #44

RuntimeError: CUDA error: device-side assert triggered #44

azamatolegen commented Nov 14, 2019 •

edited

ThilinaRajapakse commented Nov 14, 2019 •

edited

azamatolegen commented Nov 14, 2019

ThilinaRajapakse commented Nov 14, 2019

RuntimeError: CUDA error: device-side assert triggered #44

RuntimeError: CUDA error: device-side assert triggered #44

Comments

azamatolegen commented Nov 14, 2019 • edited

ThilinaRajapakse commented Nov 14, 2019 • edited

azamatolegen commented Nov 14, 2019

ThilinaRajapakse commented Nov 14, 2019

azamatolegen commented Nov 14, 2019 •

edited

ThilinaRajapakse commented Nov 14, 2019 •

edited