improved assert message in the case of "CUDA error: device-side assert triggered" #17425
Labels
module: bootcamp
We plan to do a full writeup on the issue, and then get someone to do it for onboarding
module: cuda
Related to torch.cuda, and CUDA support in general
module: molly-guard
Features which help prevent users from committing common mistakes
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
When pytorch spits out
CUDA error: device-side assert triggered
would it be possible to add a brief note, such as: "It's not possible to recover. Program/kernel restart is required to continue"So that users won't waste time trying to recover or re-run their code, when we know ahead of time they won't be able to.
And a bonus feature is to add a recommendation to re-run with
CUDA_LAUNCH_BLOCKING=1
env var for debugging the issue.Thank you.
I have just added in my code (in addition to the normal exception):
but ideally it should be part of the pytorch exception in first place.
cc @ngimel
The text was updated successfully, but these errors were encountered: