Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN error when using 3d convolutions #18053

Open
vlasenkov opened this issue Mar 15, 2019 · 8 comments
Open

cuDNN error when using 3d convolutions #18053

vlasenkov opened this issue Mar 15, 2019 · 8 comments
Labels
module: cudnn Related to torch.backends.cudnn, and CuDNN support module: dependency bug Problem is not caused by us, but caused by an upstream library we use triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vlasenkov
Copy link
Contributor

vlasenkov commented Mar 15, 2019

馃悰 Bug

The following simple script: https://gist.github.com/vlasenkov/b3aa7c12570fe0056fca3421453470ca crashes with the following traceback:

Traceback (most recent call last):
  File "models/unet/test.py", line 126, in <module>
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

To Reproduce

Run the code above on single GPU.

Expected behavior

The script just successfully finishes.

Environment

Collecting environment information...
PyTorch version: 1.0.1.post2
Is debug build: No
CUDA used to build PyTorch: 10.0.130

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla V100-DGXS-32GB
GPU 1: Tesla V100-DGXS-32GB
GPU 2: Tesla V100-DGXS-32GB
GPU 3: Tesla V100-DGXS-32GB

Nvidia driver version: 410.79
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.1

Versions of relevant libraries:
[pip] Could not collect
[conda] blas                      1.0                         mkl  
[conda] magma-cuda100             2.1.0                         5    local
[conda] mkl                       2019.0                      118  
[conda] mkl-include               2019.0                      118  
[conda] mkl_fft                   1.0.6            py36h7dd41cf_0  
[conda] mkl_random                1.0.1            py36h4414c95_1  
[conda] pytorch                   1.0.1           py3.6_cuda10.0.130_cudnn7.4.2_2    pytorch
[conda] torch                     1.0.0a0                  pypi_0    pypi
[conda] torchtext                 0.3.0                    pypi_0    pypi
[conda] torchvision               0.2.1                    pypi_0    pypi
@vishwakftw vishwakftw added the module: cudnn Related to torch.backends.cudnn, and CuDNN support label Mar 15, 2019
@ezyang
Copy link
Contributor

ezyang commented Mar 18, 2019

This is probably not an interpolate bug, because interpolate with trilinear doesn't call cuDNN. This looks like a Conv3d bug, cc @mruberry

It would be really helpful to get a minimal reproduction. Would it be possible for you to remove layers from your network while keeping the bug reproducing? Otherwise one of us can try.

@vlasenkov
Copy link
Contributor Author

I have updated the gist. I removed as much as was possible. Removing more layers makes the error disappear.

@vlasenkov vlasenkov changed the title cuDNN error when using interpolations cuDNN error when using 3d convolutions Mar 19, 2019
@mruberry
Copy link
Collaborator

We should be able to take a look at this next week (GTC is this week).

@rakeshjasti
Copy link

A workaround is to use torch.backends.cudnn.benchmark=True

@vlasenkov
Copy link
Contributor Author

vlasenkov commented Apr 5, 2019

Ahah)) When I do torch.backends.cudnn.benchmark=True I get:

RuntimeError: no deterministic convolution algorithms available in CuDNN

I have to use the deterministic mode because memory consumption is large otherwise. Guys, why CuDNN is so buggy?

@fmassa fmassa added module: dependency bug Problem is not caused by us, but caused by an upstream library we use triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Apr 5, 2019
@mruberry
Copy link
Collaborator

mruberry commented Apr 5, 2019

Thanks for the report. We are reviewing this issue.

@ngimel
Copy link
Collaborator

ngimel commented Apr 9, 2019

@vlasenkov I can repro the error with binary packages that use cudnn 7.4.2, but I cannot repro it with source builds with cudnn 7.5. Can you try building from source with cudnn 7.5 and see if it fixes your error? Unfortunately, binary packages statically link to cudnn, so you cannot drop and replace cudnn version in binaries.

@ezyang
Copy link
Contributor

ezyang commented Apr 11, 2019

Once we update the nightlies to 7.5 and confirm it doesn't repro, we'll close the bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cudnn Related to torch.backends.cudnn, and CuDNN support module: dependency bug Problem is not caused by us, but caused by an upstream library we use triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

7 participants