Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved assert message in the case of "CUDA error: device-side assert triggered" #17425

Open
stas00 opened this issue Feb 23, 2019 · 1 comment
Assignees
Labels
module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding module: cuda Related to torch.cuda, and CUDA support in general module: molly-guard Features which help prevent users from committing common mistakes triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 23, 2019

When pytorch spits out CUDA error: device-side assert triggered would it be possible to add a brief note, such as: "It's not possible to recover. Program/kernel restart is required to continue"

So that users won't waste time trying to recover or re-run their code, when we know ahead of time they won't be able to.

And a bonus feature is to add a recommendation to re-run with CUDA_LAUNCH_BLOCKING=1 env var for debugging the issue.

Thank you.

I have just added in my code (in addition to the normal exception):

                if "device-side assert triggered" in str(e):
                    warn("""When 'device-side assert triggered' error happens, it's not possible to recover and you must restart the kernel to continue. Use os.environ['CUDA_LAUNCH_BLOCKING']="1" before restarting to debug""")

but ideally it should be part of the pytorch exception in first place.

cc @ngimel

@ezyang
Copy link
Contributor

ezyang commented Feb 25, 2019

Seems reasonable to me. You'd have to edit C10_CUDA_CHECK, I believe.

@soumith soumith added the todo Not as important as medium or high priority tasks, but we will work on these. label Feb 25, 2019
@soumith soumith self-assigned this Feb 25, 2019
@ezyang ezyang added module: cuda Related to torch.cuda, and CUDA support in general module: molly-guard Features which help prevent users from committing common mistakes triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding and removed todo Not as important as medium or high priority tasks, but we will work on these. labels Aug 30, 2021
@ezyang ezyang self-assigned this Aug 30, 2021
ezyang added a commit that referenced this issue Aug 30, 2021
Fixes #17425

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this issue Aug 30, 2021
Fixes #17425

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 27421493dbc3b10332dac2c0bdb4897e929501bb
Pull Request resolved: #64184
ezyang added a commit that referenced this issue May 23, 2022
Fixes #17425

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

ghstack-source-id: 6f96f817f5ad4376beffdeab3fbde7accf7b18e0
Pull Request resolved: #64184
ezyang added a commit that referenced this issue May 23, 2022
…e that they have to restart."

Fixes #17425

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
ezyang added a commit that referenced this issue May 23, 2022
… to restart."

Fixes #17425

Signed-off-by: Edward Z. Yang <ezyangfb.com>

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: bootcamp We plan to do a full writeup on the issue, and then get someone to do it for onboarding module: cuda Related to torch.cuda, and CUDA support in general module: molly-guard Features which help prevent users from committing common mistakes triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants