GPU goes away after an error occurs #1010

jihunchoi · 2017-03-16T02:35:47Z

After upgrading to the latest source-code version, weird error messages occur when I write some wrong code.
For example, when I run the below CPU-only codes, neat error messages occur:

emb = nn.Embedding(10, 10)
inds = Variable(torch.LongTensor([1, -1]))
emb(inds)

RuntimeError: index out of range at /home/my_name/pytorch/torch/lib/TH/generic/THTensorMath.c:273

However, when I run the codes in GPU, the error messages become:

emb.cuda()
emb(inds.cuda())

RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/my_name/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

which is much less comprehensible than the CPU-version codes.

After some googling, I found the related issue submitted on cutorch repo: torch/cutorch#708.

The first problem is OK, since I can anyway debug codes and fix the codes by adding some environment variables.
However, when I terminate the Python process where the error occurred, OS starts to ignore the GPU device, and the only way (as far as I know) to revive the GPU is to reboot the entire system.
Specifically, when I run nvidia-smi after terminating the process, OS cannot find the device:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/my_name/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

In [10]: !nvidia-smi
Thu Mar 16 11:27:20 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0     Off |                  N/A |
|  0%   46C    P2    45W / 166W |    237MiB /  8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1802    C   /home/my_name/anaconda3/bin/python              235MiB |
+-----------------------------------------------------------------------------+

In [11]:
Do you really want to exit ([y]/n)? y


~ ⌚ 11:27:32
$ nvidia-smi
No devices were found

The text was updated successfully, but these errors were encountered:

apaszke · 2017-03-18T21:12:15Z

@ngimel can triggering an assert cause the GPU to go down? Does one have to reset it afterwards?

ngimel · 2017-03-18T21:44:25Z

After triggering an assert cudaDeviceReset has to be called http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion. But device should be reset when the process terminates anyway, so it's strange that reboot is needed. Perhaps python is not cleaning up something.

Maratyszcza · 2017-03-19T11:14:43Z

@jihunchoi Do you see any errors in kernel log (dmesg)?

jihunchoi · 2017-03-19T12:07:03Z

Quite strange -- I've watched the same problem for several repeats of run and reboot, however suddenly it was solved like a charm.
The only weird thing is that the GPU utility is 100% even after terminating the process.
However I can anyway reuse the GPU and the utility becomes 0% when I reuse.

In dmesg, I can't see any logs related to GPU. (I grepped dmesg output with keywords NVIDIA and GPU but it emits no special log.)
I think I am the only person who experienced this kind of problem, so I think this issue can be closed. If the same problem occurs again or if I get what caused the problem, I will ping again.

chengyangfu · 2017-03-26T04:29:09Z

Have the same problem.
THCudaCheck FAIL file=/net/bvisionserver1/playpen2/chengyangfu/pytorch/torch/lib/THC/generic/THCTensorCopy.c line=18 error=59 : device-side assert triggered RuntimeError: RuntimeE...y.c:18',)

This error happens in this line
loss_l = Variable(torch.FloatTensor([0])).cuda()

This also happened when I updated the source-code version (3~5) days ago.

soumith · 2017-03-26T19:28:58Z

@chengyangfu device side assert happens elsewhere, but you see a stack-trace over there. To get exact location of device-side assert, run your program with CUDA_LAUNCH_BLOCKING=1 python main.py (or whatever is the replacement for main.py in your case)

chengyangfu · 2017-03-28T19:07:03Z

@soumith Thanks for the advice. It seems my code accidentally uses -1 as the label in cross entropy loss function. This trigger the runtime error.

ankitmishra262 · 2017-06-08T05:44:00Z

Another common reason for this error could be if the number of classes in the labels does not match with the number of units in the final softmaxed Linear Layer for a classification problem. I recently encountered this accidentally.

ShoufaChen · 2017-12-20T07:07:03Z

Commen reanons I have seen:

the input of the loss function may be incorrect, e.g. the BCELoss() requires the input, target both
x>=0 && x <=1, which result in that we may need a sigmoid or softmax after the fc-layer. Also, it's a better choice to use BCEWithLogitsLoss()
mentioned like above, the number of labels may be inconsistent with the number of net outputs classes.

drcege · 2018-01-21T14:39:28Z

@ShoufaChen Recently, I often get Assertion input >= 0. && input <= 1. failed when using BCELoss. I used some custom operations before feeding into BCELoss, actually, product of two softmax. So, in theory, the product will never greater than one.
Based on my debugging, when the error occur, the input is 1. However, I cannot see the exact number. I think this may be related to floating-point precision?

nabihach · 2018-09-09T15:30:35Z

I often encounter this error when the initialization of a layer is bad. For instance, yesterday I was getting this error because I initialized an attention layer using nn.init.uniform_(tensor, 0, 1). When I changed it to nn.init.uniform(tensor, -0.5, 0.5), the issue went away.

…1_revert_ncclAllToAll Deactive ncclAllToAll

apaszke added the awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it label Mar 18, 2017

jihunchoi closed this as completed Mar 19, 2017

AlfredXiangWu mentioned this issue Aug 11, 2017

THCudaCheck FAIL AlfredXiangWu/LightCNN#5

Closed

leehigh mentioned this issue Sep 29, 2017

cuda runtime error isht7/pytorch-deeplab-resnet#14

Closed

karandwivedi42 mentioned this issue Mar 7, 2018

RuntimeError: cuda runtime error (59) : device-side assert triggered carpedm20/ENAS-pytorch#7

Closed

jaglinux pushed a commit to jaglinux/pytorch that referenced this issue May 19, 2022

Merge pull request pytorch#1010 from ROCmSoftwarePlatform/release/1.1…

1700348

…1_revert_ncclAllToAll Deactive ncclAllToAll

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU goes away after an error occurs #1010

GPU goes away after an error occurs #1010

jihunchoi commented Mar 16, 2017 •

edited

apaszke commented Mar 18, 2017

ngimel commented Mar 18, 2017

Maratyszcza commented Mar 19, 2017

jihunchoi commented Mar 19, 2017

chengyangfu commented Mar 26, 2017

soumith commented Mar 26, 2017

chengyangfu commented Mar 28, 2017

ankitmishra262 commented Jun 8, 2017

ShoufaChen commented Dec 20, 2017

drcege commented Jan 21, 2018 •

edited

nabihach commented Sep 9, 2018

GPU goes away after an error occurs #1010

GPU goes away after an error occurs #1010

Comments

jihunchoi commented Mar 16, 2017 • edited

apaszke commented Mar 18, 2017

ngimel commented Mar 18, 2017

Maratyszcza commented Mar 19, 2017

jihunchoi commented Mar 19, 2017

chengyangfu commented Mar 26, 2017

soumith commented Mar 26, 2017

chengyangfu commented Mar 28, 2017

ankitmishra262 commented Jun 8, 2017

ShoufaChen commented Dec 20, 2017

drcege commented Jan 21, 2018 • edited

nabihach commented Sep 9, 2018

jihunchoi commented Mar 16, 2017 •

edited

drcege commented Jan 21, 2018 •

edited