Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

Closed
zia-hasan opened this issue Mar 31, 2017 · 32 comments
Closed

could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR #8879

zia-hasan opened this issue Mar 31, 2017 · 32 comments
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues

Comments

@zia-hasan
Copy link

zia-hasan commented Mar 31, 2017

Hi,
I installed tensorflow 1.0.1 GPU version on my Macbook Pro with GeForce GT 750M. Also installed CUDA 8.0.71 and cuDNN 5.1. I am running a tf code that works fine with non CPU tensorflow but on GPU version , I get this error (once a while it works too).

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

What is happening here? Is there a bug in tensorflow. Please advise.

Thanks

@aselle
Copy link
Contributor

aselle commented Apr 1, 2017

@gunan, may have insight, but in general, mac support for nvidia gpus is relatively poor, so it is difficult for us to support them.

@aselle aselle added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues labels Apr 1, 2017
@gunan
Copy link
Contributor

gunan commented Apr 2, 2017

Looks like previous instances of the same error messages were usually cuDNN version mismatches.
Maybe we have a cuDNN patch version mismatch.
What is your full cuDNN version? I will check our build machines on Monday, then we can compare.

If this is caused by cuDNN version you may need to build TF from source.

@zia-hasan
Copy link
Author

My cuDNN version is 5.1 (OSX). I noticed if there is more GPU memory available then command line version of the program works but jupyter notebook crashes with this error.

@gunan gunan added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Apr 2, 2017
@gunan
Copy link
Contributor

gunan commented Apr 2, 2017

5.1 are only major and minor cudnn versions. There is also a 3rd integer, patch version. You can check this via looking into your cudnn.h header file.

I feel like there is more we can get from your logs.
Could you paste your full terminal output to pastebin and share its link here?

@zia-hasan
Copy link
Author

Here it is:
#define CUDNN_MAJOR 5
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 10

Also terminal output:
https://pastebin.com/9D2983ex
Thanks so much

@aselle aselle added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Apr 2, 2017
@zenvendof
Copy link

@crack00ns I just encountered this when I tried to free up the GT 750m by taking out the external monitor, which switches the display to Iris Pro. Running the imagenet tutorial it crashes out like you described:

Total memory: 2.00GiB
Free memory: 1.72GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 750M, pci bus id: 0000:01:00.0)
W tensorflow/core/framework/op_def_util.cc:332] Op BatchNormWithGlobalNormalization is deprecated. It will cease to work in GraphDef version 9. Use tf.nn.batch_normalization().
E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
E tensorflow/stream_executor/cuda/cuda_dnn.cc:364] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

However, as soon as I enabled Discrete graphics again by using the external monitor it started to work again. So I suggest you use gfxCardStatus to switch on the Nvidia GPU before you run the code.

Given that from 1.1 onwards MAC GPU will be unsupported, I will move away to a Linux desktop with a GTX 1080 Ti soon....

@zia-hasan
Copy link
Author

Thank you so much. Using gfxCardStatus and forcing it to Discrete mode seems to work!

@gunan
Copy link
Contributor

gunan commented Apr 4, 2017

Looks like the issue is resolved?
I will close this issue, please let me know if it is not resolved yet.

@gunan gunan closed this as completed Apr 4, 2017
@wangg12
Copy link
Contributor

wangg12 commented Apr 8, 2017

@gunan I encountered similar problem, how to resolve it?

@zia-hasan
Copy link
Author

@wangg12 I have a simple workaround to fix this. It is probably some memory related issue. Get gfxcardstatus and force switch to integrated and then switch back to Discrete only. It kinda resets the graphics card. Check the GPU memory with cuda-smi. This process frees up GPU memory. Run your code again. Should work. I assume you are using macOS.

@wangg12
Copy link
Contributor

wangg12 commented Apr 8, 2017

@crack00ns No, I am using ubuntu14.04. I dont know what gfxcardstatus is.

@zia-hasan
Copy link
Author

@wangg12 Ah then try resetting GPU with nvidia-smi. Check your GPU memory too. This is some memory related bug/issue in my opinion.

@wangg12
Copy link
Contributor

wangg12 commented Apr 8, 2017

Maybe this is related to cuda driver version. I tried the same code on another machine which has higher driver version, everything goes fine. But I'am not 100% sure because I can't update the driver version to test it for now. Thanks anyway @crack00ns.

@zia-hasan
Copy link
Author

You're welcome. Could be driver issue too. Perhaps @gunan could provide better insight. Workaround works for me for now.

@mjp0
Copy link

mjp0 commented May 1, 2017

I'm having this issue and can't figure out why it's happening because Theano works. The interesting thing is that I've to run theano with sudo or pygpu can't find cudnn handle either. If I try to run this TF script with sudo I crash immediately with the good old Library not loaded: @rpath/libcudnn.5.dylib that I can't seem to make go away no matter what tricks I try.

I'm on MacOS 10.12 with 1080ti, CUDA 8 and cudnn 5.1.

The log is pretty much the same as above:

2017-05-01 13:58:31.475182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:01:00.0
Total memory: 11.00GiB
Free memory: 8.84GiB
2017-05-01 13:58:31.475192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-05-01 13:58:31.475195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0:   Y
2017-05-01 13:58:31.475202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
2017-05-01 13:59:08.504593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
---------------------------------
Run id: resnet_cifar10
Log directory: /tmp/tflearn_logs/
---------------------------------
Preprocessing... Calculating mean over all dataset (this may take long)...
Mean: [ 0.49139968  0.48215841  0.44653091] (To avoid repetitive computation, add it to argument 'mean' of `add_featurewise_zero_center`)
---------------------------------
Training samples: 50000
Validation samples: 10000
--
2017-05-01 13:59:34.352806: E tensorflow/stream_executor/cuda/cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-05-01 13:59:34.352824: E tensorflow/stream_executor/cuda/cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-05-01 13:59:34.352831: F tensorflow/core/kernels/conv_ops.cc:659] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

I tested the out of memory angle but it doesn't seem to be the issue because if I set TF to have config.allow_soft_placement = True and config.gpu_options.allow_growth = True, and watch cuda-smi executing I can see I've over 8GB of memory left once it crashes.

@jagadeesr
Copy link

On Ubuntu 16.04 calling 'nvidia-smi' fixed the problem. thanks to @crack00ns

@SunTiecheng
Copy link

@jagadeesr how do you solve the problem, I use "sudo nvidia-smi -r -i 0" but display:
"GPU Reset is not supported on devices running as primary GPU.
Terminating early due to previous errors."
what should I do.
Thank you very much!

@jagadeesr
Copy link

jagadeesr commented Jun 13, 2017

@SunTiecheng I was able to run 'nvidia-smi' without sudo. After running this cmd, it worked for me. No arguments passed to the cmd.

@SunTiecheng
Copy link

@jagadeesr It still can't work for me. But thank you very much!

@archenroot
Copy link

I faced this issue and in my case the root cause was not reseting device via nvidia-smi --reset-gpu --id=0 for example, but by disabling CNMeM.

I work with Theano and this is enabled there via entry in ~/.theanorc as following:
[lib]
cnmem = 1

So removing this in my case helped resolution of this or very similar issue.

Unfortunately TensorFlow uses its own memory management and doesn't utilize cnmem delivered by nvidia, so I don't know how to configure this here.

There unfortunately isn't anything like externally configurable memory manager for GPU in Tensorflow as per my understanding and you can tune this only directly via code => gpu usage

Note: my knowledge of TensorFlow is limited, just pointing out what I have discovered and how resolved issue with Theano on mobile graphic.

@delton137
Copy link

delton137 commented Aug 21, 2017

I am having the same issue :

Linux Mint 18.1 Serena
CUDA 8.0
libcudnn 5.1
tensorflow-gpu 1.3.0
keras (using tf as a backend)

 E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-21 16:53:38.788947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-21 16:53:38.788956: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

Per another thread, this appears to be an issue with the GPU running out of memory. I have tried to use this code snippet

    from keras.backend.tensorflow_backend import set_session
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 0.1
    set_session(tf.Session(config=config))

but no luck. I have been checking nvidia-smi. I get the following :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 20%   30C    P8     9W / 250W |    427MiB / 11169MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1383    G   /usr/lib/xorg/Xorg                             271MiB |
|    0      1991    G   compton                                          3MiB |
|    0      2143    G   ...el-token=3B20FA9DA27556BEE46CF45A65B73A9B   107MiB |
|    0      8279    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    42MiB |
+-----------------------------------------------------------------------------+

@SimonWalsh1000
Copy link

Was this issue solved? - I too am having the same error with TF 1.3.0, cuDNN 6.0, CUDA 8.0 and keras. Altering the GPU-options didn't work. Many answers suggest checking for Zombie processes running in the background taking up GPU memory (I have none)

@delton137
Copy link

delton137 commented Sep 5, 2017

@SimonWalsh1000 I am not sure if this helps but I was able to fix it by nuking the conda environment I was working in, uninstalling CUDA then reinstalling everything using conda (just conda install tensorflow-gpu, it will install all the cuda dependencies automatically). There must have been a compatibility issue somewhere between the Nvidia driver, CUDA, CUDADNN, and TF, but I'm not sure where.

Also, it seemed that the issue may have been associated with the latest NVIDIA driver. You may want to uninstall your nvidia driver as well, (apt-get remove nvidia-XXX). Conda is smart enough to install that too. I am using nvidia-375.

@SimonWalsh1000
Copy link

I think you may have needed cuDNN 6.0 for TF 1.3.

@GPrathap
Copy link

I also had this issue, to verify this is related memory issue you can try disabling gup and try using only cpu.
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

@civilman628
Copy link

run this fix the issue.

sudo rm -rf ~/.nv

@acinwinstack
Copy link

acinwinstack commented Feb 27, 2018

I'm on Windows10 and encountered this issue. Running "C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" doesn't solve it for me... There isn't a .nv file/directory under home directory to delete either.

Has anyone solved this for Windows Environments?

Below is the result of nvidia-smi.exe

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.54                 Driver Version: 385.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1050   WDDM  | 00000000:01:00.0 Off |                  N/A |
| N/A   32C    P8    N/A /  N/A |     73MiB /  2048MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@cpdiku
Copy link

cpdiku commented Apr 2, 2018

I had the same problem:
F tensorflow/core/kernels/conv_ops.cc:605] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

The solution for me was downgrading from cudnn 7.1.2 to 7.005.

@sunriseXu
Copy link

with tf.Graph().as_default():
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
with sess.as_default():

i lower gpu_memory_fraction, and there is no problem

@oukohou
Copy link

oukohou commented Aug 15, 2018

Well, I had the same error, for me, simply reboot the Linux Ubuntu 16.04 solved the problem.

@bohaohuang
Copy link

Similar problem happend when having dual gpus and running the code while gpu0 is occupied.
I made only the 2nd gpu visible by adding:
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '1'
and it solve the problem for my case.

@longlimin
Copy link

I am having the same issue :

Linux Mint 18.1 Serena
CUDA 8.0
libcudnn 5.1
tensorflow-gpu 1.3.0
keras (using tf as a backend)

 E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-08-21 16:53:38.788947: E tensorflow/stream_executor/cuda/cuda_dnn.cc:338] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-08-21 16:53:38.788956: F tensorflow/core/kernels/conv_ops.cc:672] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

Per another thread, this appears to be an issue with the GPU running out of memory. I have tried to use this code snippet

    from keras.backend.tensorflow_backend import set_session
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    config.gpu_options.per_process_gpu_memory_fraction = 0.1
    set_session(tf.Session(config=config))

but no luck. I have been checking nvidia-smi. I get the following :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0  On |                  N/A |
| 20%   30C    P8     9W / 250W |    427MiB / 11169MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1383    G   /usr/lib/xorg/Xorg                             271MiB |
|    0      1991    G   compton                                          3MiB |
|    0      2143    G   ...el-token=3B20FA9DA27556BEE46CF45A65B73A9B   107MiB |
|    0      8279    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    42MiB |
+-----------------------------------------------------------------------------+

Thanks, that is useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower type:support Support issues
Projects
None yet
Development

No branches or pull requests