RuntimeError: CUDA error: unspecified launch failure at random places #39872

HumamHelfawi · 2020-06-11T18:35:42Z

🐛 Bug

I am getting RuntimeError: CUDA error: unspecified launch failure at random places while the CUDA_LAUNCH_BLOCKING flag is set to 0. However, if it was set to 1, everything is fine except a huge performance decrease.

To Reproduce

Steps to reproduce the behavior:

There is no a specific way to reproduce the behavior. Sometimes, it happens at:

total_loss += loss.item()

Sometimes at:

for batch_idx, (x, lx, mx, y, ly, my, m) in enumerate(train_loader):

Sometimes, it happens from the first epochs and other times it takes 3-4 epochs...

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.6.0.dev20200610
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
CMake version: version 3.17.3

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.2.89
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.4
[pip3] torch==1.6.0.dev20200610
[pip3] torchvision==0.6.0+cu101
[conda] Could not collect

Additional context

P.S. the driver version has not been shown by the python script you provide. Using NVIDIA control panel, it is 446.14.

I tired the stable version of PyTorch, I have the exact same behaviour. I tried eariler version of the GPU driver with no luck. I tried CUDA 10.1 instead of 10.2, nothing changed. The only thing that makes difference is CUDA_LAUNCH_BLOCKING.

cc @ngimel

The text was updated successfully, but these errors were encountered:

peterjc123 · 2020-06-12T01:05:52Z

I guess you can get more details if you recompile PyTorch with some debugging info. I can do that for you if you want.

mruberry · 2020-06-12T04:45:51Z

Although the issue is intermittent, can you provide a short, self-contained script that at least sometimes reproduces the issue?

peterjc123 · 2020-06-13T05:41:53Z

I've compiled the binaries with debug info.
https://5833189-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp36-cp36m-win_amd64.whl
https://5833191-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp37-cp37m-win_amd64.whl
https://5833196-65600975-gh.circle-artifacts.com/0/w/final_pkgs/torch-1.6.0.dev20200613-cp38-cp38-win_amd64.whl
You can install them and then get some more info using cuda-memcheck.

:: PythonRoot in the line below refers to the directory of your Python installation
:: e.g. C:\Python37
set _NT_ALT_SYMBOL_PATH=[PythonRoot]\Lib\site-packages\torch\lib
cuda-memcheck python your-script.py

peterjc123 · 2020-06-13T05:48:14Z

Related: #27837.

mruberry · 2021-11-10T12:08:55Z

Closing due to age and lack of reproduction. If this is still occurring and you can provide a script that at least sometimes reproduces the issue then please file a new issue.

mruberry added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 12, 2020

mszhanyi mentioned this issue Aug 11, 2020

PyTorch 1.3: random "RuntimeError: CUDA error: unspecified launch failure" #27837

Closed

mruberry closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: unspecified launch failure at random places #39872

RuntimeError: CUDA error: unspecified launch failure at random places #39872

HumamHelfawi commented Jun 11, 2020 •

edited by pytorch-probot bot

peterjc123 commented Jun 12, 2020

mruberry commented Jun 12, 2020

peterjc123 commented Jun 13, 2020 •

edited

peterjc123 commented Jun 13, 2020

mruberry commented Nov 10, 2021

RuntimeError: CUDA error: unspecified launch failure at random places #39872

RuntimeError: CUDA error: unspecified launch failure at random places #39872

Comments

HumamHelfawi commented Jun 11, 2020 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Environment

Additional context

peterjc123 commented Jun 12, 2020

mruberry commented Jun 12, 2020

peterjc123 commented Jun 13, 2020 • edited

peterjc123 commented Jun 13, 2020

mruberry commented Nov 10, 2021

HumamHelfawi commented Jun 11, 2020 •

edited by pytorch-probot bot

peterjc123 commented Jun 13, 2020 •

edited