New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU goes away after an error occurs #1010
Comments
@ngimel can triggering an assert cause the GPU to go down? Does one have to reset it afterwards? |
After triggering an assert cudaDeviceReset has to be called http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#assertion. But device should be reset when the process terminates anyway, so it's strange that reboot is needed. Perhaps python is not cleaning up something. |
@jihunchoi Do you see any errors in kernel log ( |
Quite strange -- I've watched the same problem for several repeats of run and reboot, however suddenly it was solved like a charm. In dmesg, I can't see any logs related to GPU. (I grepped dmesg output with keywords NVIDIA and GPU but it emits no special log.) |
Have the same problem. This error happens in this line This also happened when I updated the source-code version (3~5) days ago. |
@chengyangfu device side assert happens elsewhere, but you see a stack-trace over there. To get exact location of device-side assert, run your program with |
@soumith Thanks for the advice. It seems my code accidentally uses -1 as the label in cross entropy loss function. This trigger the runtime error. |
Another common reason for this error could be if the number of classes in the labels does not match with the number of units in the final softmaxed Linear Layer for a classification problem. I recently encountered this accidentally. |
Commen reanons I have seen:
|
@ShoufaChen Recently, I often get |
I often encounter this error when the initialization of a layer is bad. For instance, yesterday I was getting this error because I initialized an attention layer using |
…1_revert_ncclAllToAll Deactive ncclAllToAll
After upgrading to the latest source-code version, weird error messages occur when I write some wrong code.
For example, when I run the below CPU-only codes, neat error messages occur:
However, when I run the codes in GPU, the error messages become:
which is much less comprehensible than the CPU-version codes.
After some googling, I found the related issue submitted on cutorch repo: torch/cutorch#708.
The first problem is OK, since I can anyway debug codes and fix the codes by adding some environment variables.
However, when I terminate the Python process where the error occurred, OS starts to ignore the GPU device, and the only way (as far as I know) to revive the GPU is to reboot the entire system.
Specifically, when I run
nvidia-smi
after terminating the process, OS cannot find the device:The text was updated successfully, but these errors were encountered: