Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF2.0 beta on RTX 2080 throws Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #29632

Closed
srihari-humbarwadi opened this issue Jun 11, 2019 · 10 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@srihari-humbarwadi
Copy link
Contributor

Please make sure that this is a build/installation issue. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:build_template

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/a
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 2.0.0-beta0
  • Python version: 3.6
  • Installed using virtualenv? pip? conda?: built whl and installed with virtualenv
  • Bazel version (if compiling from source): 0.25.0
  • GCC/Compiler version (if compiling from source): 7.3.0
  • CUDA/cuDNN version: 10.0 / 7.5.0
  • GPU model and memory: RTX 2080, driver 418.40.04

Describe the problem
I am trying to build the tf2.0 beta on my rtx system, and i can build successfully but i cannot run anything on i install the whl. i keep hitting

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I also tried using the prebuilt binary it still throws the same error. Strangely 1.13.1 works prefectly only if i build it from source.

Provide the exact sequence of commands / steps that you executed before running into the problem

Any other info / logs

Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.5.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import tensorflow as tf                                                                                                                                                                             

In [2]: model = tf.keras.applications.ResNet50()                                                                                                                                                            
2019-06-11 14:12:10.599679: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-06-11 14:12:10.654528: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.655435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.83
pciBusID: 0000:02:00.0
2019-06-11 14:12:10.655756: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-11 14:12:10.656708: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-11 14:12:10.672933: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-11 14:12:10.673320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-11 14:12:10.674796: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-11 14:12:10.675938: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-11 14:12:10.679002: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-11 14:12:10.679234: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.680150: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.680897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-06-11 14:12:10.681515: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.682282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.83
pciBusID: 0000:02:00.0
2019-06-11 14:12:10.682345: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-11 14:12:10.682365: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-06-11 14:12:10.682391: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2019-06-11 14:12:10.682412: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2019-06-11 14:12:10.682432: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2019-06-11 14:12:10.682451: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2019-06-11 14:12:10.682470: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-11 14:12:10.682548: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.683363: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.684127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-06-11 14:12:10.722429: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-06-11 14:12:10.842536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-11 14:12:10.842661: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-06-11 14:12:10.842716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2019-06-11 14:12:10.867592: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.868473: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.869242: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-11 14:12:10.869958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7248 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:02:00.0, compute capability: 7.5)

In [3]: model(tf.random.normal([5, 224, 224, 3]))                                                                                                                                                           
2019-06-11 14:13:21.027798: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-06-11 14:13:22.315487: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-06-11 14:13:22.318659: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

@srihari-humbarwadi
Copy link
Contributor Author

@ymodak @jvishnuvardhan

@srihari-humbarwadi
Copy link
Contributor Author

Tried it on a fresh Ubuntu 18.04 installation and followed official scripts to install cuda and cudnn
Still facing the same issue.

# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update

# Install NVIDIA driver
sudo apt-get install --no-install-recommends nvidia-driver-410
# Reboot. Check that GPUs are visible using the command: nvidia-smi

# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends \
    cuda-10-0 \
    libcudnn7=7.4.1.5-1+cuda10.0  \
    libcudnn7-dev=7.4.1.5-1+cuda10.0


# Install TensorRT. Requires that libcudnn7 is installed above.
sudo apt-get update && \
        sudo apt-get install nvinfer-runtime-trt-repo-ubuntu1804-5.0.2-ga-cuda10.0 \
        && sudo apt-get update \
        && sudo apt-get install -y --no-install-recommends libnvinfer-dev=5.0.2-1+cuda10.0

@srihari-humbarwadi
Copy link
Contributor Author

The only workaround I could find is this

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

But is there any other way to fix this so that I wouldn't have to use this? This doesn't seem to be normal to me

@ymodak ymodak self-assigned this Jun 11, 2019
@ymodak ymodak added the comp:gpu GPU related issues label Jun 11, 2019
@ymodak ymodak assigned caisq and unassigned ymodak Jun 11, 2019
@ymodak ymodak added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug labels Jun 11, 2019
@ymodak ymodak assigned aaroey and unassigned caisq Aug 30, 2019
@inkcherry
Copy link

The only workaround I could find is this

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

But is there any other way to fix this so that I wouldn't have to use this? This doesn't seem to be normal to me

I have the same problem,and I have 3GPU. By your way :
tf.config.experimental.set_memory_growth(physical_devices[0], True)
and I got ValueError: Memory growth cannot differ between GPU devices

and then:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
tf.config.experimental.set_memory_growth(physical_devices[1], True)
tf.config.experimental.set_memory_growth(physical_devices[2], True)

I sloved the problem,thanks a lot.

@jvishnuvardhan
Copy link
Contributor

jvishnuvardhan commented Sep 26, 2019

@srihari-humbarwadi Sorry for the delay. Is this still an issue or was it resolved? Can you check with TF2.0 and let us know whether the issue persists with latest TF version. Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 26, 2019
@jvishnuvardhan
Copy link
Contributor

Automatically closing this out since I understand it to be resolved, but please let me know if I'm mistaken.Thanks!

@tensorflow-bot
Copy link

tensorflow-bot bot commented Oct 7, 2019

Are you satisfied with the resolution of your issue?
Yes
No

@nikste
Copy link
Contributor

nikste commented Feb 11, 2020

I still experience the issue with an RTX 2080 and the latest tensorflow version with cuda 10.1 and cudnn 7.6

@farahmand-m
Copy link

farahmand-m commented Jul 24, 2020

I built TensorFlow 2.3 from source, using CUDA 10.2 and cuDNN 7.6, Arch Linux, and I get a similar error:

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

I've a GeForce 1660 Ti; the latest driver is installed, and PyTorch can use the libraries, no problem.

@majnas
Copy link

majnas commented Nov 18, 2020

physical_devices = tf.config.experimental.list_physical_devices('GPU')
[tf.config.experimental.set_memory_growth(pd, True) for pd in physical_devices]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests