Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: unspecified launch failure at random places #39872

Closed
HumamHelfawi opened this issue Jun 11, 2020 · 5 comments
Closed
Labels
module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@HumamHelfawi
Copy link

HumamHelfawi commented Jun 11, 2020

馃悰 Bug

I am getting RuntimeError: CUDA error: unspecified launch failure at random places while the CUDA_LAUNCH_BLOCKING flag is set to 0. However, if it was set to 1, everything is fine except a huge performance decrease.

To Reproduce

Steps to reproduce the behavior:

There is no a specific way to reproduce the behavior. Sometimes, it happens at:

total_loss += loss.item()

Sometimes at:

for batch_idx, (x, lx, mx, y, ly, my, m) in enumerate(train_loader):

Sometimes, it happens from the first epochs and other times it takes 3-4 epochs...

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.6.0.dev20200610
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
CMake version: version 3.17.3

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.4
[pip3] torch==1.6.0.dev20200610
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect

Additional context

P.S. the driver version has not been shown by the python script you provide. Using NVIDIA control panel, it is 446.14.

I tired the stable version of PyTorch, I have the exact same behaviour. I tried eariler version of the GPU driver with no luck. I tried CUDA 10.1 instead of 10.2, nothing changed. The only thing that makes difference is CUDA_LAUNCH_BLOCKING.

cc @ngimel

@peterjc123
Copy link
Collaborator

I guess you can get more details if you recompile PyTorch with some debugging info. I can do that for you if you want.

@mruberry mruberry added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 12, 2020
@mruberry
Copy link
Collaborator

Although the issue is intermittent, can you provide a short, self-contained script that at least sometimes reproduces the issue?

@peterjc123
Copy link
Collaborator

peterjc123 commented Jun 13, 2020

I've compiled the binaries with debug info.
https://5833189-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp36-cp36m-win_amd64.whl
https://5833191-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp37-cp37m-win_amd64.whl
https://5833196-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp38-cp38-win_amd64.whl
You can install them and then get some more info using cuda-memcheck.

:: PythonRoot in the line below refers to the directory of your Python installation
:: e.g. C:\Python37
set _NT_ALT_SYMBOL_PATH=[PythonRoot]\Lib\site-packages\torch\lib
cuda-memcheck python your-script.py

@peterjc123
Copy link
Collaborator

Related: #27837.

@mruberry
Copy link
Collaborator

Closing due to age and lack of reproduction. If this is still occurring and you can provide a script that at least sometimes reproduces the issue then please file a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants