Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU goes away after an error occurs #1010

Closed
jihunchoi opened this issue Mar 16, 2017 · 11 comments
Closed

GPU goes away after an error occurs #1010

jihunchoi opened this issue Mar 16, 2017 · 11 comments
Labels
awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it

Comments

@jihunchoi
Copy link
Contributor

jihunchoi commented Mar 16, 2017

After upgrading to the latest source-code version, weird error messages occur when I write some wrong code.
For example, when I run the below CPU-only codes, neat error messages occur:

emb = nn.Embedding(10, 10)
inds = Variable(torch.LongTensor([1, -1]))
emb(inds)
RuntimeError: index out of range at /home/my_name/pytorch/torch/lib/TH/generic/THTensorMath.c:273

However, when I run the codes in GPU, the error messages become:

emb.cuda()
emb(inds.cuda())
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/my_name/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

which is much less comprehensible than the CPU-version codes.

After some googling, I found the related issue submitted on cutorch repo: torch/cutorch#708.

The first problem is OK, since I can anyway debug codes and fix the codes by adding some environment variables.
However, when I terminate the Python process where the error occurred, OS starts to ignore the GPU device, and the only way (as far as I know) to revive the GPU is to reboot the entire system.
Specifically, when I run nvidia-smi after terminating the process, OS cannot find the device:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/my_name/pytorch/torch/lib/THC/generic/THCTensorCopy.c:65

In [10]: !nvidia-smi
Thu Mar 16 11:27:20 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.44                 Driver Version: 367.44                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0     Off |                  N/A |
|  0%   46C    P2    45W / 166W |    237MiB /  8113MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1802    C   /home/my_name/anaconda3/bin/python              235MiB |
+-----------------------------------------------------------------------------+

In [11]:
Do you really want to exit ([y]/n)? y


~ ⌚ 11:27:32
$ nvidia-smi
No devices were found
@apaszke
Copy link
Contributor

apaszke commented Mar 18, 2017

@ngimel can triggering an assert cause the GPU to go down? Does one have to reset it afterwards?

@apaszke apaszke added the awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it label Mar 18, 2017
@ngimel
Copy link
Collaborator

ngimel commented Mar 18, 2017

After triggering an assert cudaDeviceReset has to be called http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion. But device should be reset when the process terminates anyway, so it's strange that reboot is needed. Perhaps python is not cleaning up something.

@Maratyszcza
Copy link
Contributor

@jihunchoi Do you see any errors in kernel log (dmesg)?

@jihunchoi
Copy link
Contributor Author

Quite strange -- I've watched the same problem for several repeats of run and reboot, however suddenly it was solved like a charm.
The only weird thing is that the GPU utility is 100% even after terminating the process.
However I can anyway reuse the GPU and the utility becomes 0% when I reuse.

In dmesg, I can't see any logs related to GPU. (I grepped dmesg output with keywords NVIDIA and GPU but it emits no special log.)
I think I am the only person who experienced this kind of problem, so I think this issue can be closed. If the same problem occurs again or if I get what caused the problem, I will ping again.

@chengyangfu
Copy link

Have the same problem.
THCudaCheck FAIL file=/net/bvisionserver1/playpen2/chengyangfu/pytorch/torch/lib/THC/generic/THCTensorCopy.c line=18 error=59 : device-side assert triggered RuntimeError: RuntimeE...y.c:18',)

This error happens in this line
loss_l = Variable(torch.FloatTensor([0])).cuda()

This also happened when I updated the source-code version (3~5) days ago.

@soumith
Copy link
Member

soumith commented Mar 26, 2017

@chengyangfu device side assert happens elsewhere, but you see a stack-trace over there. To get exact location of device-side assert, run your program with CUDA_LAUNCH_BLOCKING=1 python main.py (or whatever is the replacement for main.py in your case)

@chengyangfu
Copy link

@soumith Thanks for the advice. It seems my code accidentally uses -1 as the label in cross entropy loss function. This trigger the runtime error.

@ankitmishra262
Copy link

Another common reason for this error could be if the number of classes in the labels does not match with the number of units in the final softmaxed Linear Layer for a classification problem. I recently encountered this accidentally.

@ShoufaChen
Copy link

Commen reanons I have seen:

  1. the input of the loss function may be incorrect, e.g. the BCELoss() requires the input, target both
    x>=0 && x <=1, which result in that we may need a sigmoid or softmax after the fc-layer. Also, it's a better choice to use BCEWithLogitsLoss()
  2. mentioned like above, the number of labels may be inconsistent with the number of net outputs classes.

@drcege
Copy link

drcege commented Jan 21, 2018

@ShoufaChen Recently, I often get Assertion input >= 0. && input <= 1. failed when using BCELoss. I used some custom operations before feeding into BCELoss, actually, product of two softmax. So, in theory, the product will never greater than one.
Based on my debugging, when the error occur, the input is 1. However, I cannot see the exact number. I think this may be related to floating-point precision?

@nabihach
Copy link

nabihach commented Sep 9, 2018

I often encounter this error when the initialization of a layer is bad. For instance, yesterday I was getting this error because I initialized an attention layer using nn.init.uniform_(tensor, 0, 1). When I changed it to nn.init.uniform(tensor, -0.5, 0.5), the issue went away.

jaglinux pushed a commit to jaglinux/pytorch that referenced this issue May 19, 2022
…1_revert_ncclAllToAll

Deactive ncclAllToAll
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it
Projects
None yet
Development

No branches or pull requests

10 participants