Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable way to identify RuntimeErrors (CUDA) #29710

Open
c-hofer opened this issue Nov 13, 2019 · 1 comment
Open

Reliable way to identify RuntimeErrors (CUDA) #29710

c-hofer opened this issue Nov 13, 2019 · 1 comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@c-hofer
Copy link

c-hofer commented Nov 13, 2019

馃殌 Feature

Reliable way to check for CUDA out of memory (and CUDA Runtime errors in general).

Motivation

Currently there I see no way to reliably check for a cuda out of memory error except parsing the exception arg for

CUDA out of memory.

(After a quick grep on the pytorch sources this seems to work at the moment)

As this text may change in future I do not feel comfortable with this work-around as it screams for breaking.
In application code reliably detecting such an error seems crucial to me.

If there is a way to do so and I did not find it, this issue may be a well place to document this?

Pitch

A solution would be quiet standard, e.g., RuntimeError subclasses or an attached error code.

@soumith @albanD How do you folks think about this?

cc @ngimel

@mruberry mruberry added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module enhancement Not as big of a feature, but technically not a bug. Should be easy to fix labels Nov 13, 2019
@vadimkantorov
Copy link
Contributor

Currently it seems that people are checking exception messages: https://github.com/pytorch/fairseq/blob/3655cf266e32a2272d6deac6069a594977880084/fairseq/trainer.py#L615

It would indeed be good to have a separate exception type for out of memory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Not as big of a feature, but technically not a bug. Should be easy to fix module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants