Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered #44

Closed
azamatolegen opened this issue Nov 14, 2019 · 3 comments
Closed

RuntimeError: CUDA error: device-side assert triggered #44

azamatolegen opened this issue Nov 14, 2019 · 3 comments

Comments

@azamatolegen
Copy link

azamatolegen commented Nov 14, 2019

My code:

from simpletransformers.classification import ClassificationModel
import pandas as pd
train_df = pd.read_csv('data/train.csv', header=None)
eval_df = pd.read_csv('data/test.csv', header=None)
train_df[0] = (train_df[0] == 2).astype(int)
eval_df[0] = (eval_df[0] == 2).astype(int)
train_df = pd.DataFrame({
'text': train_df[1].replace(r'\n', ' ', regex=True),
'label':train_df[0]
})
eval_df = pd.DataFrame({
'text': eval_df[1].replace(r'\n', ' ', regex=True),
'label':eval_df[0]
})
model = ClassificationModel('xlm', 'model/', args=({'fp16': False}))
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df)

Error:

Features loaded from cache at cache_dir/cached_train_xlm_128_binary
Epoch: 0%| | 0/1 [00:00<?, ?it/s/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [5,0,0] Assertion t >= 0 && t < n_classes failed.
/opt/conda/conda-bld/pytorch_1570710853631/work/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "run1.py", line 24, in
model.train_model(train_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 162, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 235, in train
print("\rRunning loss: %f" % loss, end="")
RuntimeError: CUDA error: device-side assert triggered

Could you please help me figure it out how to fix that?Thank you!

@ThilinaRajapakse
Copy link
Owner

ThilinaRajapakse commented Nov 14, 2019

Try reprocessing the data as its being loaded from the cache here.

model = ClassificationModel('xlm', 'model/', args=({'fp16': False, 'reprocess_input_data': True'}))

Also, make sure that the model you are loading from model/ was built with the same num_classes as your data.

@azamatolegen
Copy link
Author

It helped thanks!
I am facing another error now:

Converting to features started.
100%|###################################| 560000/560000 [03:49<00:00, 2441.14it/s]
Running loss: 0.724930Traceback (most recent call last): | 0/1 [00:00<?, ?it/s]
File "run1.py", line 24, in | 29999/70000 [1:47:18<2:23:03, 4.66it/s]
model.train_model(train_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 162, in train_model
global_step, tr_loss = self.train(train_dataset, output_dir, show_running_loss=show_running_loss, eval_df=eval_df)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/simpletransformers/classification/classification_model.py", line 277, in train
model_to_save.save_pretrained(output_dir_current)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/modeling_utils.py", line 204, in save_pretrained
torch.save(model_to_save.state_dict(), output_model_file)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 260, in save
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 185, in _with_file_like
return body(f)
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 260, in
return _with_file_like(f, "wb", lambda f: _save(obj, f, pickle_module, pickle_protocol))
File "/home/data/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/serialization.py", line 338, in _save
serialized_storages[key]._write_file(f, _should_read_directly(f))
RuntimeError: write(): fd 29 failed with No space left on device

I am using RTX 2080 8 gb. How can I deal with that? Thank you very much for your time and help!

@ThilinaRajapakse
Copy link
Owner

You are running out of storage (HDD/SSD). Clearing some space on your drive should do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants