RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

blueardour · 2019-02-14T08:38:10Z

hi, @ycszen

Sorry to disturb you again. After some struggle on the code, I was stuck at the Criterion part. It gave RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

I add the CUDA_LAUNCH_BLOCKING=1 before run the script to enable more accuracy message:

0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [766,0,0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [767,0,0] Assertiont >= 0 && t < n_classesfailed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [11,0,0], thread: [800,0,0] Assertiont >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu line=128 error=59 : device-side assert triggered
Traceback (most recent call last):

loss = model(imgs, gts, cgts)

File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/chenp/workspace/git/TorchSeg/model/dfn/voc.dfn.R101_v1c/network.py", line 137, in forward
loss0 = self.criterion(pred_out[0], label)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 904, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 1970, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/chenp/.pyenv/versions/3.6.8/lib/python3.6/site-packages/torch/nn/functional.py", line 1792, in nll_loss
ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THCUNN/generic/SpatialClassNLLCriterion.cu:128

`

Do you have any experience or advise on it ?

The text was updated successfully, but these errors were encountered:

yu-changqian · 2019-02-15T07:13:58Z

Hi, according to my experience, this is manily because the value of label is out of the range 0 - class_num. Thus, you should check if there is negative value or value greater than the class_number by printing the value.

hubutui · 2019-02-26T13:28:40Z

@ycszen Hi, I try Cityscapes dataset. And the value of the label is out of the range 0-class_num. It seems that you don't do the label_id to idx job in cityscape dataset? I remap label to range 0-class_num, but still get this error:

26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
26 21:25:43 PyTorch Version 1.0.1.post2, Furnace Version 0.1.1
[00:00<?,?it/s]Traceback (most recent call last):
  File "train.py", line 123, in <module>
    loss = model(imgs, gts)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/apex-0.1-py3.5-linux-x86_64.egg/apex/parallel/distributed.py", line 459, in forward
    result = self.module(*inputs, **kwargs)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/Projects/TorchSeg-BiSeNet/model/bisenet/cityscapes.bisenet.R18/network.py", line 105, in forward
    aux_loss0 = self.ohem_criterion(self.heads[0](pred_out[0]), label)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/USER/Projects/TorchSeg-BiSeNet/furnace/seg_opr/loss_opr.py", line 84, in forward
    index = mask_prob.argsort()
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/tensor.py", line 248, in argsort
    return torch.argsort(self, dim, descending)
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/functional.py", line 648, in argsort
    return torch.sort(input, -1, descending)[1]
RuntimeError: CUDA error: device-side assert triggered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 157, in <module>
    config.log_dir_link)
  File "/home/USER/Projects/TorchSeg-BiSeNet/furnace/engine/engine.py", line 154, in __exit__
    torch.cuda.empty_cache()
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 374, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered

Exception ignored in: <bound method Event.__del__ of <torch.cuda.Event 0x34cf360>>
Traceback (most recent call last):
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/streams.py", line 167, in __del__
  File "/home/USER/BiSeNet-official-env/lib/python3.5/site-packages/torch/cuda/__init__.py", line 208, in check_error
torch.cuda.CudaError: device-side assert triggered (59)
terminate called without an active exception

Any suggestion?

yu-changqian · 2019-03-05T08:11:16Z

@hubutui Could you show more details? Did you do a correct remap?

hubutui · 2019-03-05T08:29:19Z

@ycszen Hi, I use skimage.segmentation.relabel_sequential to remap label:

diff --git a/furnace/datasets/BaseDataset.py b/furnace/datasets/BaseDataset.py
index 7d8f6ef..bfcaa6b 100755
--- a/furnace/datasets/BaseDataset.py
+++ b/furnace/datasets/BaseDataset.py
@@ -10,6 +10,7 @@ import time
 import cv2
 import torch
 import numpy as np
+from skimage.segmentation import relabel_sequential
 
 import torch.utils.data as data
 
@@ -62,6 +63,15 @@ class BaseDataset(data.Dataset):
         if self.preprocess is not None and extra_dict is not None:
             output_dict.update(**extra_dict)
 
+        gt = gt.cpu().numpy()
+        trans_labels = [7, 8, 11, 12, 13, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27,
+                        28, 31, 32, 33, 0]
+        trans_labels = np.array(trans_labels)
+        _, fw, inv = relabel_sequential(trans_labels)
+        gt[gt==255] = 0
+        gt = fw[gt]
+        gt = torch.from_numpy(np.ascontiguousarray(gt)).long()
+        output_dict['label'] = gt
         return output_dict
 
     def _fetch_data(self, img_path, gt_path, dtype=None):

hanamizukigakki · 2019-03-14T05:51:29Z

Hi, have you solved this problem? I have got the same question with above, the Error code is some thing as below,

File "train.py", line 137, in
loss = model(imgs, gts)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/PyCharm_Projs/TorchSeg/model/bisenet/cityscapes.bisenet.R18.speed/network.py", line 109, in forward
aux_loss0 = self.ohem_criterion(self.heads0, label)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/root/PyCharm_Projs/TorchSeg/furnace/seg_opr/loss_opr.py", line 84, in forward
index = mask_prob.argsort()
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/tensor.py", line 248, in argsort
return torch.argsort(self, dim, descending)
File "/root/conda/envs/conda_py36/lib/python3.6/site-packages/torch/functional.py", line 648, in argsort
return torch.sort(input, -1, descending)[1]
RuntimeError: merge_sort: failed to synchronize: device-side assert triggered
], thread: [83,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Hope suggestions. Thanks.

hanamizukigakki · 2019-03-14T13:13:13Z

I have solved the question by mapping the labels with cityscapesscripts. Thanks a lot.

jhch1995 · 2019-03-28T06:55:58Z

@blueardour hi， I meet the same problem with you. Can you tell me the way to solve this question? Thanks a lot.

tkingcer · 2019-04-12T07:29:48Z

@jhch1995 it might caused by the wrong match bewteen label ids. You can try createTrainIdLabelImgs.py to generate ***_gtFine_labelTrainIds.png files and use these files as training labels

bryanyzhu mentioned this issue Feb 28, 2019

what is "min_kept" in ProbOhemCrossEntropy2d function? #23

Closed

yu-changqian closed this as completed Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

blueardour commented Feb 14, 2019

yu-changqian commented Feb 15, 2019

hubutui commented Feb 26, 2019

yu-changqian commented Mar 5, 2019

hubutui commented Mar 5, 2019

hanamizukigakki commented Mar 14, 2019 •

edited

Loading

hanamizukigakki commented Mar 14, 2019

jhch1995 commented Mar 28, 2019

tkingcer commented Apr 12, 2019

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/ #10

Comments

blueardour commented Feb 14, 2019

yu-changqian commented Feb 15, 2019

hubutui commented Feb 26, 2019

yu-changqian commented Mar 5, 2019

hubutui commented Mar 5, 2019

hanamizukigakki commented Mar 14, 2019 • edited Loading

hanamizukigakki commented Mar 14, 2019

jhch1995 commented Mar 28, 2019

tkingcer commented Apr 12, 2019

hanamizukigakki commented Mar 14, 2019 •

edited

Loading