-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anomaly detection: Error detected in CudnnRnnBackward0 #65301
Comments
Moving from CUDA to CPU I now get this error: |
Running the same code against pip3 Notice none of these failures occur at the beginning of training. They occur a bit into an epoch. |
Torch 1.10 is likely still an issue but 1.09 might be suffering from #35666 I started getting these errors around the time I moved from BatchNorm to LayerNorm and 16-bit floats. |
Interestingly, the error only occurs when Do I need to do something special to reset the GPU state before retrying the operation with a smaller batch size? |
Here is a minimal testcase for this bug:
My output is:
My (updated) environment is:
I am looking for answers to the following questions:
|
It seems you are running into multiple errors, so could you please try to fix them in order?
|
I would ignore the original post. This error was based on a nightly build of torch and project code which has since changed (I don't remember the details). I'd rather focus on the above testcase. As you can see, the testcase never modifies any variable in-place so I assume whatever bugs we are seeing are in pytorch.
What does "after the initial warmup steps" refer to? Anyway I think this is a red herring because I've already got anomaly detection enabled at the top of the testcase. If I disable it altogether I get If I decrease the batch size then the error goes away. This leads me to believe that this error is actually triggered by an out of memory condition but I'm happy to investigate this further with your guidance to make sure this is correct. Just let me know how to track this down to the source of the problem.
How do we track this to the source of the problem? Thank you. |
@ptrblck By the way I just checked and removing If we can figure out how to avoid getting an error with |
@ptrblck I ran
Here is the full output for your review: cuda-memcheck.zip The only modification I made to the testcase is adding Please let me know what you think. |
I stand corrected: in my proprietary project I get this bug even with AMP/precision=16 disabled. Does this bug need to be directed to the PyTorch team or Nvidia? Thanks. |
Thanks for the updates and sorry for the late reply. |
I reproduced the illegal memory access using the code snippet unchanged. To run memcheck, I added Keep in mind that this issue might be specific to my hardware in the sense that my video card (RTX 3080) has 10GB of RAM. I don't think you'll be able to reproduce the problem as easily if you have a different amount of GPU memory. If you don't have the same amount of GPU memory available, try setting Let me know if you're able to reproduce the problem on your end. Thanks. |
Perfect, thanks a lot for the code snippet as I can reproduce it on a 3080. |
Phew, that's a relief :) Okay, please let me know what the next steps are and CC me on any new tickets. Thanks. |
Update: just verified the fix in the upcoming cuDNN 8.3.0 release. |
@ptrblck Excellent news. Is cuDNN open-source? Is there a ticket I could subscribe to? |
No, cuDNN is not open source and this issue is the public ticket to track the bug and fix. |
Sounds good. Thank you for your help. |
🐛 Bug
To Reproduce
When I run
output, _ = self.gru(output)
I get the following traceback:Unfortunately, I don't have a minimal testcase to share with you but feel free to ask me for any more information you need.
Environment
cc @csarofeen @ptrblck @xwang233 @zou3519 @ngimel
The text was updated successfully, but these errors were encountered: