Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

Closed
blueardour opened this issue Feb 14, 2019 · 8 comments

Comments

@blueardour
Copy link

hi, @ycszen

Sorry to disturb you again. After some struggle on the code, I was stuck at the Criterion part. It gave RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

I add the CUDA_LAUNCH_BLOCKING=1 before run the script to enable more accuracy message:

0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [766,0,0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [767,0,0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [800,0,0] Assertiont >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=128 error=59 : device-side assert triggered
Traceback (most recent call last):

loss = model(imgs, gts, cgts)

File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/chenp/workspace/git/TorchSeg/model/dfn/voc.dfn.R101_v1c/network.py", line 137, in forward
loss0 = self.criterion(pred_out[0], label)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 1792, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

`

Do you have any experience or advise on it ?

@ycszen
Copy link
Owner

ycszen commented Feb 15, 2019

Hi, according to my experience, this is manily because the value of label is out of the range 0 - class_num. Thus, you should check if there is negative value or value greater than the class_number by printing the value.

@hubutui
Copy link

hubutui commented Feb 26, 2019

@ycszen Hi, I try Cityscapes dataset. And the value of the label is out of the range 0-class_num. It seems that you don't do the label_id to idx job in cityscape dataset? I remap label to range 0-class_num, but still get this error:

26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
[00:00<?,?it/s]Traceback (most recent call last):
  File "train.py", line 123, in <module>
    loss = model(imgs, gts)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/parallel/distributed.py", line 459, in forward
    result = self.module(*inputs, **kwargs)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/Projects/TorchSeg-BiSeNet/model/bisenet/cityscapes.bisenet.R18/network.py", line 105, in forward
    aux_loss0 = self.ohem_criterion(self.heads[0](pred_out[0]), label)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/Projects/TorchSeg-BiSeNet/furnace/seg_opr/loss_opr.py", line 84, in forward
    index = mask_prob.argsort()
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/tensor.py", line 248, in argsort
    return torch.argsort(self, dim, descending)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/functional.py", line 648, in argsort
    return torch.sort(input, -1, descending)[1]
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 157, in <module>
    config.log_dir_link)
  File "/home/USER/Projects/TorchSeg-BiSeNet/furnace/engine/engine.py", line 154, in __exit__
    torch.cuda.empty_cache()
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 374, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered

Exception ignored in: <bound method Event.__del__ of <torch.cuda.Event 0x34cf360>>
Traceback (most recent call last):
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/streams.py", line 167, in __del__
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 208, in check_error
torch.cuda.CudaError: device-side assert triggered (59)
terminate called without an active exception

Any suggestion?

@ycszen
Copy link
Owner

ycszen commented Mar 5, 2019

@hubutui Could you show more details? Did you do a correct remap?

@hubutui
Copy link

hubutui commented Mar 5, 2019

@ycszen Hi, I use skimage.segmentation.relabel_sequential to remap label:

diff --git a/furnace/datasets/BaseDataset.py b/furnace/datasets/BaseDataset.py
index 7d8f6ef..bfcaa6b 100755
--- a/furnace/datasets/BaseDataset.py
+++ b/furnace/datasets/BaseDataset.py
@@ -10,6 +10,7 @@ import time
 import cv2
 import torch
 import numpy as np
+from skimage.segmentation import relabel_sequential
 
 import torch.utils.data as data
 
@@ -62,6 +63,15 @@ class BaseDataset(data.Dataset):
         if self.preprocess is not None and extra_dict is not None:
             output_dict.update(**extra_dict)
 
+        gt = gt.cpu().numpy()
+        trans_labels = [7, 8, 11, 12, 13, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27,
+                        28, 31, 32, 33, 0]
+        trans_labels = np.array(trans_labels)
+        _, fw, inv = relabel_sequential(trans_labels)
+        gt[gt==255] = 0
+        gt = fw[gt]
+        gt = torch.from_numpy(np.ascontiguousarray(gt)).long()
+        output_dict['label'] = gt
         return output_dict
 
     def _fetch_data(self, img_path, gt_path, dtype=None):

@hanamizukigakki
Copy link

hanamizukigakki commented Mar 14, 2019

Hi, have you solved this problem? I have got the same question with above, the Error code is some thing as below,

File "train.py", line 137, in
loss = model(imgs, gts)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/PyCharm_Projs/TorchSeg/model/bisenet/cityscapes.bisenet.R18.speed/network.py", line 109, in forward
aux_loss0 = self.ohem_criterion(self.heads0, label)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/PyCharm_Projs/TorchSeg/furnace/seg_opr/loss_opr.py", line 84, in forward
index = mask_prob.argsort()
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/tensor.py", line 248, in argsort
return torch.argsort(self, dim, descending)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/functional.py", line 648, in argsort
return torch.sort(input, -1, descending)[1]
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
], thread: [83,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Hope suggestions. Thanks.

@hanamizukigakki
Copy link

I have solved the question by mapping the labels with cityscapesscripts. Thanks a lot.

@jhch1995
Copy link

@blueardour hi, I meet the same problem with you. Can you tell me the way to solve this question? Thanks a lot.

@tkingcer
Copy link

@jhch1995 it might caused by the wrong match bewteen label ids. You can try createTrainIdLabelImgs.py to generate ***_gtFine_labelTrainIds.png files and use these files as training labels

@ycszen ycszen closed this as completed Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants