-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash: Could not create cuDNN handle when convnets are used #6698
Comments
I met exactly the same problem as you do with CUDA8 and TF r0.12.1. |
@EncodeTS I just added a minimal reproducible example to my first post. Could you check if it reproduces the problem on your machine? On my machine, one convolutional layer works but not two convolutional layers, which led me to think that the problem might be caused by some resource limitations. |
I can confirm that @ymfa minimal example fails on MacOS NVidia 750, but also same example works on Linux/Titan X |
The minimal example works on my Ubuntu. It looks like the issue I had encountered has a very low occurrence probability on my computer. |
I'm encountering the same problem. The graph will run fine when forced to the cpu, but crashed on the gpu. EnvironmentOS: macOS 10.12.2 (output of
ExampleThe minimal example provided by @ymfa both fails and succeeds on my setup. The following are three outputs that have been produced.
fail(2)
pass
|
Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks! |
Not so fast, I see this crash too. Macbook pro, geforce 650. TF v1. Running via jupyter kernels, which I have to frequently restart. Maybe this graphics card is just too weak? Seeing as how the op uses the same card: likely.
|
I have the same problem with GTX 960m, cudnn5.1.5 and cuda-8.0.44. |
Have the same problem with centOS, titan X |
Have the same problem with ubuntu(14.04) and GRID K520 (aws g2.2) |
Have the same problem windows 10 cudnn 5.1 cuda 8 gtx 1060. Program works on cpu version of tensor flow but get these same errors with the gpu version. |
I had the same issue with gtx1060, win8.1, cuda8.0.60, cudnn5.0. Upgraded to the latest stable tensorflow-gpu nightly build (currently http://ci.tensorflow.org/job/nightly-win/133/) and cudnn5.1. Problem solved. |
Same issue here. I was having this issue with the software versions listed below, except TF was version 1.0.0. I then upgraded to TF 1.0.1. I ran the same program once and it worked. I then ran it again and it didn't work -- it produced the same error as before. Tensorflow-gpu 1.0.1 |
having the same problem with gtx650, ubuntu 16.04, CUDA Version 8.0.61, TF version 1.0.0 |
Having the same issue with gtx 1080 ti, windows 10, CUDA Version 8.0.61, TF version 1.0.1, 5.1 Cudann, cuda 8.0.61 |
I was able to get a program to work by limiting the gpu usage. In my case with a 3gb gtx 1060 on ubuntu 16.04, if I set gpu option per_process_gpu_memory_fraction to .7 it works. Anything higher, I get these errors E tensorflow/stream_executor/cuda/cuda_dnn.cc:397] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR It could be a case of bad error reporting by tensorflow. Seems completely unrelated. Maybe it is a clue to getting this resolved in a better manner? |
@zheng-xq is there an obvious setup issue? |
Same issue too. I'm on Windows 10, GTX1070, CUDA 8.0, cuDNN 5.1.
|
If it helps anyone, seems there are sometimes zombie processes left which prevent from tf to start again properly and gave me this error. killing them work around the issue. |
Here is a bit more info on how I temporarily resolved it. I believe these issues are all related to GPU memory allocation and have nothing to do with the errors being reported. There were other errors before this indicating some sort of memory allocation problem but the program continued to progress, eventually giving the cudnn errors that everyone is getting. The reason I believe it works sometimes is that if you use the gpu for other things besides tensorflow such as your primary display, the available memory fluctuates. Sometimes you can allocate what you need and other times it can't. From the API I think this default allocation is broken in some way that causes this erratic behavior and certain situations to work and others to fail. I have resolved this issue by changing the default behavior of TF to allocate a minimum amount of memory and grow as needed as detailed in the webpage. I have also tried the alternate way and was able to get it to work and fail with experimentally choosing a percentage that worked. In my case it ended up being about .7. config = tf.ConfigProto() Still no word from anyone on the TF team confirming this but it is worth a shot to see if others can confirm similar behavior. |
I am also getting the
I am on Windows 10, CUDA 8.0, cuDNN 5.1 . Can anything be done to avoid these? I was able to run earlier some other tensorflow tests and it worked fine (including conv op), but now it doesn't work on this new test... @serans1 What zombie processes are you referring to? Please let me know if there is a workaround for this. Thank you! EDIT This might have been a newbie mistake, but I will just mention it here, in case someone else runs in the same issue: Sorry again for my mistake! I'm just at the beginning of playing around with this :) |
The same problem,is there any solution to it ? I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties: |
I have exactly same issue. |
I had the error and I 'fixed' it by closing my multiple instances of Jupyter and closing other applications. I'm new to working with tensorflow in general so it's likely this only fixed my problem. |
E tensorflow/stream_executor/cuda/cuda_dnn.cc:353] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR I had this issue with 10.1 Cuda+cuDNN7.5 and TF 1.11 compiled from source with cuda. The script I was trying to use needed these lines inserted somewhere: and then later: This done, a lot of "GPU out of memory errors" - but detection goes on very quickly as I suppose it should when we're using GPU. Thanks for sharing! |
I faced the same issues.and use below line fixed it. check here get detail. |
Actually, I'm working on Ubuntu 18.04, not macOS, but this looks to make sense that it might be caused by some resource limitations. Me either faced the same issue on GTX 1050 ti (4 GB) but the issue has gone away when I run the same architecture on GTX 1080 ti (11 GB). Though all the environments are not the same between the two systems, I tried my best by utilizing the docker container. |
This problem is generally related to the version of cuda and GPU memory, if former, the easiest way is to change your cuda version by Anaconda!if later, you can find some ways to solve in other answers. |
if you are still getting this issue, try the following. it worked for me tensorflow 2 alpha |
I have similar issue: CUDNN_STATUS_ALLOC_FAILED. But the key is to write it immediately below "import tensorflow as tf" which I wasn't doing. I had written it after all the imports. |
May be tensorflow-gpu version has problems, you should check your own versions try again and again, uninstall and install..... tensorflow-gpu找到对应的版本号然后卸载再重装 |
I am getting the same error with
Please help |
great reply, worked for me !! |
Changing the Nvidia driver to 396+ solved the issue for me. |
It has to do with the memory fraction available to load GPU resources to create cudnn handle, also known as
Use as small fraction as could fit in your memory. (In the code, I use 0.7, you can start with 0.3 or even smaller, then increase until you get the same error, that's your limit.) This should allow your GPU create a cudnn handle for your TensorFlow code. |
I was getting the following error with tensorflow 2.0 in my conda environment.
so i added the following code to my CNN
My output is now
As everyone suggested it is due to tensorflow using all of the GPU/GPUs. My CNN trains without error now. |
That solved for me, thanks! |
This also resolved the issue for me. GeForce GTX 1050, CUDA 10.0 Note: this is the only thing I can find that works in TF 2.0 for now. Thanks!
|
This didn't make any difference for me... TF 2.0, RTX 2060, CUDA 10.1, CuDNN 7.6 This is with 16 GB RAM, 6 GB video memory, and a basic MNIST toy model with one conv layer. No memory problems, just a stack trace. No GPU problems at all with Pytorch, as usual |
In my case, I have two machines, both with RTX 2080Ti, TF 2.1, CUDA 10.1, CuDNN 7.6. One works, the other one raises the aforementioned error. Both machines have the same amount of RAM, 16GB. There are hardware differentes, though, like the CPU. But the problem is only occurring when using the GPU. |
Same platform, same problem |
If you are using the latest tensorflow and keras. Try this from here, it worked for me: gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e) |
This worked for me. Thanks |
@Samaritan1011001 your solution works for me thanks a lot. |
@Samaritan1011001 your solution works for me, too! thanks xD ! |
不好意思请问一下要怎么在anaconda里面直接修改cuda的版本呢?感激不尽 |
Tensorflow (GPU) was imported successfully, but when running a session that involves a convolutional neural network (CNN), Python crashes with the following message:
The problem persists on any combination of CUDA toolkit 7.5/8.0 and Tensorflow installed from pip/source. Test sessions that do not use CNNs are run successfully.
What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?
The issue is similar to #6586, where I first commented. But since I experience the problem on a Mac, I was suggested to open a separate issue.
Environment info
Operating System: macOS Sierra 10.12.2
Xcode version 8.2 (8C38) (When I later tried CUDA 7.5, I installed Command Line Tools version 7.3.1 because CUDA 7.5 lacked support of the more recent compilers.)
Python 3.5.2 (anaconda)
Installed version of CUDA: tried both 8.0 (initially) and 7.5 (reported here, toolkit only -- the driver is still 8.0)
Installed version of cuDNN: 5.1 (different installations according to CUDA versions)
(please attach the output of
ls -l /path/to/cuda/lib/libcud*
):I tried both installing from pip and source. I first installed from binary pip package:
tensorflow-gpu
python -c "import tensorflow; print(tensorflow.__version__)"
.0.12.head
Later I installed from source (the pip package was uninstalled):
The commit hash (
git rev-parse HEAD
)d67c09d98a576e1fbf2f3609ddb842e53890f31c
The output of
bazel version
Build label: 0.4.3-homebrew
Build target: bazel-out/local-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Thu Dec 22 15:20:15 2016 (1482420015)
Build timestamp: 1482420015
Build timestamp as int: 1482420015
If possible, provide a minimal reproducible example
I made a minimal example by simplifying the network and reducing the training data to only twenty images and two classes for classification. issue.zip contains the Python code and the data. I wrote two convolutional layers because I found the network with only one convolutional layer runs without problem.
Complete log using CUDA 7.5 and Tensorflow compiled from source
Complete log using CUDA 8.0 and Tensorflow installed from pip
The text was updated successfully, but these errors were encountered: