cuDNN error when using 3d convolutions #18053

vlasenkov · 2019-03-15T08:56:39Z

🐛 Bug

The following simple script: https://gist.github.com/vlasenkov/b3aa7c12570fe0056fca3421453470ca crashes with the following traceback:

Traceback (most recent call last):
  File "models/unet/test.py", line 126, in <module>
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

To Reproduce

Run the code above on single GPU.

Expected behavior

The script just successfully finishes.

Environment

Collecting environment information...
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-DGXS-32GB
GPU 1: Tesla V100-DGXS-32GB
GPU 2: Tesla V100-DGXS-32GB
GPU 3: Tesla V100-DGXS-32GB

Nvidia driver version: 410.79
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1

Versions of relevant libraries:
[pip] Could not collect
[conda] blas                      1.0                         mkl  
[conda] magma-cuda100             2.1.0                         5    local
[conda] mkl                       2019.0                      118  
[conda] mkl-include               2019.0                      118  
[conda] mkl_fft                   1.0.6            py36h7dd41cf_0  
[conda] mkl_random                1.0.1            py36h4414c95_1  
[conda] pytorch                   1.0.1           py3.6_cuda10.0.130_cudnn7.4.2_2    pytorch
[conda] torch                     1.0.0a0                  pypi_0    pypi
[conda] torchtext                 0.3.0                    pypi_0    pypi
[conda] torchvision               0.2.1                    pypi_0    pypi

The text was updated successfully, but these errors were encountered:

ezyang · 2019-03-18T17:53:44Z

This is probably not an interpolate bug, because interpolate with trilinear doesn't call cuDNN. This looks like a Conv3d bug, cc @mruberry

It would be really helpful to get a minimal reproduction. Would it be possible for you to remove layers from your network while keeping the bug reproducing? Otherwise one of us can try.

vlasenkov · 2019-03-19T15:23:21Z

I have updated the gist. I removed as much as was possible. Removing more layers makes the error disappear.

mruberry · 2019-03-20T06:29:21Z

We should be able to take a look at this next week (GTC is this week).

rakeshjasti · 2019-04-05T04:38:52Z

A workaround is to use torch.backends.cudnn.benchmark=True

vlasenkov · 2019-04-05T07:18:24Z

Ahah)) When I do torch.backends.cudnn.benchmark=True I get:

RuntimeError: no deterministic convolution algorithms available in CuDNN

I have to use the deterministic mode because memory consumption is large otherwise. Guys, why CuDNN is so buggy?

mruberry · 2019-04-05T18:12:25Z

Thanks for the report. We are reviewing this issue.

ngimel · 2019-04-09T17:11:26Z

@vlasenkov I can repro the error with binary packages that use cudnn 7.4.2, but I cannot repro it with source builds with cudnn 7.5. Can you try building from source with cudnn 7.5 and see if it fixes your error? Unfortunately, binary packages statically link to cudnn, so you cannot drop and replace cudnn version in binaries.

ezyang · 2019-04-11T21:46:21Z

Once we update the nightlies to 7.5 and confirm it doesn't repro, we'll close the bug.

vishwakftw added the module: cudnn Related to torch.backends.cudnn, and CuDNN support label Mar 15, 2019

vlasenkov changed the title ~~cuDNN error when using interpolations~~ cuDNN error when using 3d convolutions Mar 19, 2019

fmassa added module: dependency bug Problem is not caused by us, but caused by an upstream library we use triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN error when using 3d convolutions #18053

cuDNN error when using 3d convolutions #18053

vlasenkov commented Mar 15, 2019 •

edited

ezyang commented Mar 18, 2019

vlasenkov commented Mar 19, 2019

mruberry commented Mar 20, 2019

rakeshjasti commented Apr 5, 2019

vlasenkov commented Apr 5, 2019 •

edited

mruberry commented Apr 5, 2019

ngimel commented Apr 9, 2019

ezyang commented Apr 11, 2019

cuDNN error when using 3d convolutions #18053

cuDNN error when using 3d convolutions #18053

Comments

vlasenkov commented Mar 15, 2019 • edited

🐛 Bug

To Reproduce

Expected behavior

Environment

ezyang commented Mar 18, 2019

vlasenkov commented Mar 19, 2019

mruberry commented Mar 20, 2019

rakeshjasti commented Apr 5, 2019

vlasenkov commented Apr 5, 2019 • edited

mruberry commented Apr 5, 2019

ngimel commented Apr 9, 2019

ezyang commented Apr 11, 2019

vlasenkov commented Mar 15, 2019 •

edited

vlasenkov commented Apr 5, 2019 •

edited