could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

zia-hasan · 2017-03-31T19:08:53Z

Hi,
I installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. Also installed CUDA 8.0.71 and cuDNN 5.1. I am running a tf code that works fine with non CPU tensorflow but on GPU version , I get this error (once a while it works too).

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

What is happening here? Is there a bug in tensorflow. Please advise.

Thanks

aselle · 2017-04-01T22:01:50Z

@gunan, may have insight, but in general, mac support for nvidia gpus is relatively poor, so it is difficult for us to support them.

gunan · 2017-04-02T08:50:09Z

Looks like previous instances of the same error messages were usually cuDNN version mismatches.
Maybe we have a cuDNN patch version mismatch.
What is your full cuDNN version? I will check our build machines on Monday, then we can compare.

If this is caused by cuDNN version you may need to build TF from source.

zia-hasan · 2017-04-02T08:56:39Z

My cuDNN version is 5.1 (OSX). I noticed if there is more GPU memory available then command line version of the program works but jupyter notebook crashes with this error.

gunan · 2017-04-02T09:03:00Z

5.1 are only major and minor cudnn versions. There is also a 3rd integer, patch version. You can check this via looking into your cudnn.h header file.

I feel like there is more we can get from your logs.
Could you paste your full terminal output to pastebin and share its link here?

zia-hasan · 2017-04-02T09:11:10Z

Here it is:
#define CUDNN_MAJOR 5
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 10

Also terminal output:
https://pastebin.com/9D2983ex
Thanks so much

zenvendof · 2017-04-04T16:08:46Z

@crack00ns I just encountered this when I tried to free up the GT 750m by taking out the external monitor, which switches the display to Iris Pro. Running the imagenet tutorial it crashes out like you described:

Total memory: 2.00GiB
Free memory: 1.72GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
W tensorflow/core/framework/op_def_util.cc:332] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

However, as soon as I enabled Discrete graphics again by using the external monitor it started to work again. So I suggest you use gfxCardStatus to switch on the Nvidia GPU before you run the code.

Given that from 1.1 onwards MAC GPU will be unsupported, I will move away to a Linux desktop with a GTX 1080 Ti soon....

zia-hasan · 2017-04-04T19:56:53Z

Thank you so much. Using gfxCardStatus and forcing it to Discrete mode seems to work!

gunan · 2017-04-04T20:47:44Z

Looks like the issue is resolved?
I will close this issue, please let me know if it is not resolved yet.

wangg12 · 2017-04-08T10:26:48Z

@gunan I encountered similar problem, how to resolve it?

zia-hasan · 2017-04-08T10:41:13Z

@wangg12 I have a simple workaround to fix this. It is probably some memory related issue. Get gfxcardstatus and force switch to integrated and then switch back to Discrete only. It kinda resets the graphics card. Check the GPU memory with cuda-smi. This process frees up GPU memory. Run your code again. Should work. I assume you are using macOS.

wangg12 · 2017-04-08T10:46:57Z

@crack00ns No, I am using ubuntu14.04. I dont know what gfxcardstatus is.

zia-hasan · 2017-04-08T10:50:33Z

@wangg12 Ah then try resetting GPU with nvidia-smi. Check your GPU memory too. This is some memory related bug/issue in my opinion.

wangg12 · 2017-04-08T11:32:43Z

Maybe this is related to cuda driver version. I tried the same code on another machine which has higher driver version, everything goes fine. But I'am not 100% sure because I can't update the driver version to test it for now. Thanks anyway @crack00ns.

zia-hasan · 2017-04-08T11:40:17Z

You're welcome. Could be driver issue too. Perhaps @gunan could provide better insight. Workaround works for me for now.

mjp0 · 2017-05-01T11:09:49Z

I'm having this issue and can't figure out why it's happening because Theano works. The interesting thing is that I've to run theano with sudo or pygpu can't find cudnn handle either. If I try to run this TF script with sudo I crash immediately with the good old Library not loaded: @rpath/libcudnn.5.dylib that I can't seem to make go away no matter what tricks I try.

I'm on MacOS 10.12 with 1080ti, CUDA 8 and cudnn 5.1.

The log is pretty much the same as above:

2017-05-01 13:58:31.475182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 8.84GiB
2017-05-01 13:58:31.475192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-05-01 13:58:31.475195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2017-05-01 13:58:31.475202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-05-01 13:59:08.504593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
---------------------------------
Run id: resnet_cifar10
Log directory: /tmp/tflearn_logs/
---------------------------------
Preprocessing... Calculating mean over all dataset (this may take long)...
Mean: [ 0.49139968  0.48215841  0.44653091] (To avoid repetitive computation, add it to argument 'mean' of `add_featurewise_zero_center`)
---------------------------------
Training samples: 50000
Validation samples: 10000
--
2017-05-01 13:59:34.352806: E tensorflow/stream_executor/cuda/cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-05-01 13:59:34.352824: E tensorflow/stream_executor/cuda/cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-05-01 13:59:34.352831: F tensorflow/core/kernels/conv_ops.cc:659] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

I tested the out of memory angle but it doesn't seem to be the issue because if I set TF to have config.allow_soft_placement = True and config.gpu_options.allow_growth = True, and watch cuda-smi executing I can see I've over 8GB of memory left once it crashes.

jagadeesr · 2017-05-22T19:16:48Z

On Ubuntu 16.04 calling 'nvidia-smi' fixed the problem. thanks to @crack00ns

SunTiecheng · 2017-06-13T12:11:16Z

@jagadeesr how do you solve the problem, I use "sudo nvidia-smi -r -i 0" but display:
"GPU Reset is not supported on devices running as primary GPU.
Terminating early due to previous errors."
what should I do.
Thank you very much!

jagadeesr · 2017-06-13T13:08:02Z

@SunTiecheng I was able to run 'nvidia-smi' without sudo. After running this cmd, it worked for me. No arguments passed to the cmd.

SunTiecheng · 2017-06-13T13:15:57Z

@jagadeesr It still can't work for me. But thank you very much!

archenroot · 2017-06-13T18:30:25Z

I faced this issue and in my case the root cause was not reseting device via nvidia-smi --reset-gpu --id=0 for example, but by disabling CNMeM.

I work with Theano and this is enabled there via entry in ~/.theanorc as following:
[lib]
cnmem = 1

So removing this in my case helped resolution of this or very similar issue.

Unfortunately TensorFlow uses its own memory management and doesn't utilize cnmem delivered by nvidia, so I don't know how to configure this here.

There unfortunately isn't anything like externally configurable memory manager for GPU in Tensorflow as per my understanding and you can tune this only directly via code => gpu usage

Note: my knowledge of TensorFlow is limited, just pointing out what I have discovered and how resolved issue with Theano on mobile graphic.

delton137 · 2017-08-21T21:00:17Z

I am having the same issue :

Linux Mint 18.1 Serena
CUDA 8.0
libcudnn 5.1
tensorflow-gpu 1.3.0
keras (using tf as a backend)

 E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-21 16:53:38.788947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-21 16:53:38.788956: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

Per another thread, this appears to be an issue with the GPU running out of memory. I have tried to use this code snippet

    from keras.backend.tensorflow_backend import set_session
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 0.1
    set_session(tf.Session(config=config))

but no luck. I have been checking nvidia-smi. I get the following :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 20%   30C    P8     9W / 250W |    427MiB / 11169MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1383    G   /usr/lib/xorg/Xorg                             271MiB |
|    0      1991    G   compton                                          3MiB |
|    0      2143    G   ...el-token=3B20FA9DA27556BEE46CF45A65B73A9B   107MiB |
|    0      8279    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    42MiB |
+-----------------------------------------------------------------------------+

SimonWalsh1000 · 2017-09-05T09:41:28Z

Was this issue solved? - I too am having the same error with TF 1.3.0, cuDNN 6.0, CUDA 8.0 and keras. Altering the GPU-options didn't work. Many answers suggest checking for Zombie processes running in the background taking up GPU memory (I have none)

delton137 · 2017-09-05T15:31:25Z

@SimonWalsh1000 I am not sure if this helps but I was able to fix it by nuking the conda environment I was working in, uninstalling CUDA then reinstalling everything using conda (just conda install tensorflow-gpu, it will install all the cuda dependencies automatically). There must have been a compatibility issue somewhere between the Nvidia driver, CUDA, CUDADNN, and TF, but I'm not sure where.

Also, it seemed that the issue may have been associated with the latest NVIDIA driver. You may want to uninstall your nvidia driver as well, (apt-get remove nvidia-XXX). Conda is smart enough to install that too. I am using nvidia-375.

SimonWalsh1000 · 2017-09-05T15:40:29Z

I think you may have needed cuDNN 6.0 for TF 1.3.

GPrathap · 2017-10-13T14:11:43Z

I also had this issue, to verify this is related memory issue you can try disabling gup and try using only cpu.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

civilman628 · 2017-12-11T19:11:23Z

run this fix the issue.

sudo rm -rf ~/.nv

acinwinstack · 2018-02-27T06:55:54Z

I'm on Windows10 and encountered this issue. Running "C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" doesn't solve it for me... There isn't a .nv file/directory under home directory to delete either.

Has anyone solved this for Windows Environments?

Below is the result of nvidia-smi.exe

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.54                 Driver Version: 385.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   32C    P8    N/A /  N/A |     73MiB /  2048MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

cpdiku · 2018-04-02T08:58:19Z

I had the same problem:
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

The solution for me was downgrading from cudnn 7.1.2 to 7.005.

sunriseXu · 2018-04-09T11:24:56Z

with tf.Graph().as_default():
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
with sess.as_default():

i lower gpu_memory_fraction, and there is no problem

oukohou · 2018-08-15T01:23:48Z

Well, I had the same error, for me, simply reboot the Linux Ubuntu 16.04 solved the problem.

bohaohuang · 2018-10-25T13:32:13Z

Similar problem happend when having dual gpus and running the code while gpu0 is occupied.
I made only the 2nd gpu visible by adding:
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
and it solve the problem for my case.

longlimin · 2019-04-18T11:19:34Z

I am having the same issue :

Linux Mint 18.1 Serena
CUDA 8.0
libcudnn 5.1
tensorflow-gpu 1.3.0
keras (using tf as a backend)

 E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-21 16:53:38.788947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-21 16:53:38.788956: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

Per another thread, this appears to be an issue with the GPU running out of memory. I have tried to use this code snippet

    from keras.backend.tensorflow_backend import set_session
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 0.1
    set_session(tf.Session(config=config))

but no luck. I have been checking nvidia-smi. I get the following :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 20%   30C    P8     9W / 250W |    427MiB / 11169MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1383    G   /usr/lib/xorg/Xorg                             271MiB |
|    0      1991    G   compton                                          3MiB |
|    0      2143    G   ...el-token=3B20FA9DA27556BEE46CF45A65B73A9B   107MiB |
|    0      8279    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    42MiB |
+-----------------------------------------------------------------------------+

Thanks, that is useful.

aselle added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues labels Apr 1, 2017

gunan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 2, 2017

aselle added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Apr 2, 2017

gunan closed this as completed Apr 4, 2017

asimshankar mentioned this issue Apr 13, 2017

Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) only in the test mode #9132

Closed

gineshidalgo99 mentioned this issue Dec 20, 2017

【Ubuntu Execution error】Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR CMU-Perceptual-Computing-Lab/openpose#357

Closed

Ramay7 mentioned this issue Jan 29, 2018

Jetson TX2 run demo error: Check failed: status == CUDNN_STATUS_SUCCESS (9 vs. 0) CUDNN_STATUS_NOT_SUPPORTED CMU-Perceptual-Computing-Lab/openpose#383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

zia-hasan commented Mar 31, 2017 •

edited

aselle commented Apr 1, 2017

gunan commented Apr 2, 2017

zia-hasan commented Apr 2, 2017

gunan commented Apr 2, 2017

zia-hasan commented Apr 2, 2017

zenvendof commented Apr 4, 2017

zia-hasan commented Apr 4, 2017

gunan commented Apr 4, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

mjp0 commented May 1, 2017

jagadeesr commented May 22, 2017

SunTiecheng commented Jun 13, 2017

jagadeesr commented Jun 13, 2017 •

edited

SunTiecheng commented Jun 13, 2017

archenroot commented Jun 13, 2017

delton137 commented Aug 21, 2017 •

edited

SimonWalsh1000 commented Sep 5, 2017

delton137 commented Sep 5, 2017 •

edited

SimonWalsh1000 commented Sep 5, 2017

GPrathap commented Oct 13, 2017

civilman628 commented Dec 11, 2017

acinwinstack commented Feb 27, 2018 •

edited

cpdiku commented Apr 2, 2018

sunriseXu commented Apr 9, 2018

oukohou commented Aug 15, 2018 •

edited

bohaohuang commented Oct 25, 2018

longlimin commented Apr 18, 2019

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

Comments

zia-hasan commented Mar 31, 2017 • edited

aselle commented Apr 1, 2017

gunan commented Apr 2, 2017

zia-hasan commented Apr 2, 2017

gunan commented Apr 2, 2017

zia-hasan commented Apr 2, 2017

zenvendof commented Apr 4, 2017

zia-hasan commented Apr 4, 2017

gunan commented Apr 4, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

wangg12 commented Apr 8, 2017

zia-hasan commented Apr 8, 2017

mjp0 commented May 1, 2017

jagadeesr commented May 22, 2017

SunTiecheng commented Jun 13, 2017

jagadeesr commented Jun 13, 2017 • edited

SunTiecheng commented Jun 13, 2017

archenroot commented Jun 13, 2017

delton137 commented Aug 21, 2017 • edited

SimonWalsh1000 commented Sep 5, 2017

delton137 commented Sep 5, 2017 • edited

SimonWalsh1000 commented Sep 5, 2017

GPrathap commented Oct 13, 2017

civilman628 commented Dec 11, 2017

acinwinstack commented Feb 27, 2018 • edited

cpdiku commented Apr 2, 2018

sunriseXu commented Apr 9, 2018

oukohou commented Aug 15, 2018 • edited

bohaohuang commented Oct 25, 2018

longlimin commented Apr 18, 2019

zia-hasan commented Mar 31, 2017 •

edited

jagadeesr commented Jun 13, 2017 •

edited

delton137 commented Aug 21, 2017 •

edited

delton137 commented Sep 5, 2017 •

edited

acinwinstack commented Feb 27, 2018 •

edited

oukohou commented Aug 15, 2018 •

edited