tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

rongou · 2018-03-08T18:08:57Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 16.04
TensorFlow installed from (source or binary):
binary (pip install tensorflow-gpu)
TensorFlow version (use command below):
1.6.0
Python version:
2.7
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
CUDA 9, cuDNN 7
GPU model and memory:
Exact command to reproduce:
I was trying to build a horovod image, but this would affect anyone using the nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 base image:

docker build -t horovod https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
docker run -it --rm horovod python tensorflow_mnist.py

Describe the problem

When building a docker image based on nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 and doing a pip install tensorflow-gpu==1.6.0, the resulting image causes a crash because the base image contains cuDNN 7.1, while the tensorflow-gpu pip package was built against cuDNN 7.0.

Source code / logs

Error messages:

2018-03-08 17:46:50.845206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-08 17:46:50.845868: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

@flx42

The text was updated successfully, but these errors were encountered:

flx42 · 2018-03-08T18:25:01Z

@gunan do you remember why you have a check on major/minor cuDNN version instead of just major version?

gunan · 2018-03-08T19:30:24Z

@zheng-xq @martinwicke do you know why we check for cuDNN minor version?

wdma · 2018-03-09T14:42:02Z

Hello,

I had to rebuild my computer and am now experiencing the one of the errors described in the original post (see below). Is there a recommended workaround?

Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.

flx42 · 2018-03-09T18:23:27Z

If you use docker, you should pin the version of cuDNN you are installing. For instance:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/Dockerfile.devel-gpu#L16-L17
But that means not using the cuDNN official images, but a regular CUDA one.

If you are not using docker, you can still downgrade cuDNN with a similar command.

wdma · 2018-03-09T18:43:12Z

Thank you for your reply.

I had just solved it by updating Tensorflow. Type "pip install update"

adampl · 2018-03-11T18:52:47Z

@wdma How could you solve it by upgrading TF? I'm getting this error on the latest (1.6) TF.

Luke035 · 2018-03-11T23:43:41Z

+1

wdma · 2018-03-12T12:12:57Z

@adampl Installing tensorflow per these instructions (https://www.tensorflow.org/install/) generates the above error. Typing "pip install update" fixes it. I hope this helps!

adampl · 2018-03-12T14:38:28Z

I ended up doing what @flx42 advised (#17566 (comment)) though it's not a perfect solution.

rongou · 2018-03-12T22:57:00Z

If you use docker, I think you have 3 options:

Use the cuda base image (e.g. nvidia/cuda:9.0-devel-ubuntu16.04; note this doesn't have cuDNN), and install cuDNN 7.0 yourself, as I've done for horovod (Pin Dockerfile to a specific cuDNN version. horovod/horovod#206).
Use the cuda+cudnn base image (e.g. nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04), but downgrade cuDNN to 7.0. You need to do apt-get install --allow-downgrades libcudnn7=7.0.5.15-1+cuda9.0.
Use Tensorflow's docker image (tensorflow/tensorflow:1.6.0-gpu) as base.

If you don't use docker, just make sure your machine has cuDNN 7.0, not 7.1.

Add downgrade option for cuDNN, workaround for tensorflow/tensorflow#17566

Luke035 · 2018-03-13T00:17:05Z

@rongou I implemented your second suggestion in my Dockerfile and I've been able to run TF 1.6 along with KERAS 2.15 within the base image nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04.
The only thing I had to do was to add a RUN layer in my docker file for executing "apt-get install --allow-downgrades libcudnn7=7.0.5.15-1+cuda9.0".

flx42 · 2018-03-13T00:25:54Z

@zheng-xq @martinwicke let me know if there is a problem with cuDNN that warrants this strict check!

ghost · 2018-03-13T02:38:09Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow installed from (source or binary):pip3 install --upgrade tensorflow-gpu
TensorFlow version (use command below):1.6.0
Python version: Python 3.5.2
Bazel version (if compiling from source):no
GCC/Compiler version (if compiling from source):gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
CUDA/cuDNN version:CUDA 9, cuDNN v7.1.1 Library for Linux
GPU model and memory:1070

Hello, I leave my story installing cuda with the problems related to the messages below.

2018-03-13 10:19:33.118216: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-13 10:19:33.118929: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted

I installed cuDNN v7.0.4 Library for Linux(the oldest version for cuda9.0) (link) like belows.
tar xzvf cudnn-9.0-linux-x64-v7.tgz
sudo cp cuda/lib64/* /usr/local/cuda-9.0/lib64/
sudo cp cuda/include/* /usr/local/cuda-9.0/include/
sudo chmod a+r /usr/local/cuda-9.0/lib64/libcudnn*
sudo chmod a+r /usr/local/cuda-9.0/include/cudnn.h

No Errors, works well.

tfboyd · 2018-03-13T05:15:40Z

@flx42 I am tracking this down tomorrow. The nightly docker is also broken due to the same issue, which means I cannot run nightly tests so I have some direct motivation. I see two issues maybe more in summary of this thread:

Does the check need to be so stricket
Why is the nightly docker broken? I have a nightly perf test and should have caught this earlier but that is not the purpose of my perf tests and I was distracted.

My suggestion will be to move cuDNN forward for the next release if possible, e.g. 1.7 unless it is too late and at a minimum switch the nightly as well as followup on the version check.

gunan · 2018-03-13T05:19:41Z

I think the biggest problem is the "latest" nvidia-docker images are cuda 9.1, cudnn 7.1.
And our builds look for cuda 9 and cudnn 7.
In our nightlies, or tests the fix is to avoid using "latest" nvidia docker images.

Also, it is too late to change anything for 1.7. RC0 is almost out.

tfboyd · 2018-03-13T05:23:36Z

The nightly docker we release is broken. I ran into the problem trying to figure out why my perf tests stopped running. We can at least pin the nightly docker image until 1.8 nightlies. I do not think we need to move to 7.1 per say as that also break people but I will ask XQ tomorrow about major/minor versions to see what he thinks and Martin as well. I know you can but just to help track stuff down to help.

That unblocks me and anyone using nightly docker. I cannot believe you were awake to see my message. :-)

gunan · 2018-03-13T05:27:14Z

Thanks for following up on this.
It is a good idea to pin the nightly docker image bases, and going forward, I wonder if we can remove minor version checks on both CUDA and cuDNN.

flx42 · 2018-03-13T05:30:54Z

I think the biggest problem is the "latest" nvidia-docker images are cuda 9.1, cudnn 7.1.

What do you mean? Which dockerfile is that?

gunan · 2018-03-13T05:39:24Z

Sorry, you are right.
Not latest image, but rather all cudnn versions are marked "cudnn7" without minor version specification.

flx42 · 2018-03-13T05:43:15Z

If there is a good reason to have this check on the minor version (e.g. an incompatibility despite the SONAME), I might split future images with the cuDNN minor version.

martinwicke · 2018-03-13T06:38:05Z

I believe that we have run into incompatibilities between some minor versions of CUDA, but I'm not sure whether we've ever seen that in cuDNN (@jlebar are there details?)

jlebar · 2018-03-13T09:13:35Z

@martinwicke, yeah, e.g. CUDA 9.0 and CUDA 9.1 are quite different in the respects we care about.

For cudnn, I have not seen a statement specifying their level of backwards compatibility. I would naively expect that if you build against cudnn x.y and run with cudnn x.z for z >= y, it probably will work. But to be comfortable with blessing that I'd want a statement in writing from nvidia. (Perhaps such a statement already exists.)

Whether or not it should be a fatal error in TF vs a "good luck, you're on your own" warning (like we do for known-broken ptxas versions), I don't have an opinion on.

flx42 · 2018-03-13T16:34:04Z

CUDA toolkit libraries and cuDNN have different SONAME, so that's actually expected.

$ objdump -p  /usr/lib/x86_64-linux-gnu/libcudnn.so.7 | grep SONAME
  SONAME               libcudnn.so.7

$ objdump -p  /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcublas.so.9.1 | grep SONAME
  SONAME               libcublas.so.9.1

cliffwoolley · 2018-03-13T17:25:48Z

For cudnn, I have not seen a statement specifying their level of backwards compatibility. I would naively expect that if you build against cudnn x.y and run with cudnn x.z for z >= y, it probably will work. But to be comfortable with blessing that I'd want a statement in writing from nvidia. (Perhaps such a statement already exists.)

Statement from NVIDIA:

Beginning in cuDNN 7, binary compatibility of patch and minor releases is maintained as follows:

Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z).
cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x).

(Note that this compatibility was not necessarily guaranteed in prior cuDNN major releases.)

--Cliff Woolley
Director, DL Frameworks Engineering, NVIDIA

tfboyd · 2018-03-13T17:25:49Z

Update. I caught up with the build people on @gunan team and the nightly Docker will be fixed. I need to talk to a few more people but I think we should consider a warning for a minor version difference and a fatal for major difference. I have some concerns that there might be feature differences in point releases. I am doing research and talking with people, it will likely take a few days. I or someone will have a final statement.

b/74600152

cliffwoolley · 2018-03-13T17:27:04Z

We will also update the cuDNN docs to say the same as what I posted above. Thanks for pointing out the omission.

tfboyd · 2018-03-13T17:28:59Z

@cliffwoolley Thank you. I am opening an internal issue and looking for someone to update the code to match your statement in cuDNN.

tfboyd · 2018-03-13T19:26:07Z

Last update until done or change in progress. I found someone to make the changes to match Cliff's cuDNN version position update. Will post when complete and would not expect it to take very long. Team effort I just get the honor of updating the github issue. :-)

flx42 · 2018-03-13T23:20:20Z

@gunan Should we modify Dockerfile.gpu to make it more similar to Dockerfile.gpu-devel? That is to say, FROM nvidia/cuda:9.0-base-ubuntu16.04, select the CUDA packages you need, and pin the cuDNN version.

gunan · 2018-03-14T07:46:53Z

How much space will we save by doing that?
it may still be worth it just to fix the cudnn issues.

flx42 · 2018-03-14T17:13:38Z

400 MB from a quick test, 300 MB if I re-add CUPTI. So it seems worth it, and it's always better to pin the version of a key dependency like cuDNN. It's better for reproducibility.

gunan · 2018-03-14T18:34:14Z

Then this looks like it is worthwhile. Thanks for the analysis! I can review,if you like to send the change.

…

On Wed, Mar 14, 2018 at 10:15 AM, Felix Abecassis ***@***.***> wrote: 400 MB from a quick test, 300 MB if I re-add CUPTI. So it seems worth it, and it's always better to pin the version of a key dependency like cuDNN. It's better for reproducibility. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17566 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHlCOdUpWzmocwEQ3V3NzFgdGFXoDNZLks5teVA6gaJpZM4SjG8W> .

Related: #17566 Fixes: #17431 Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

Kjos · 2018-03-20T22:26:46Z

I'm having the same error with pip. I guess I'll try the Docker then

Related: tensorflow#17566 Fixes: tensorflow#17431 Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

tfboyd · 2018-04-02T20:14:06Z

Fixed in nightly builds. We are now checking according to Cliff's update in nightly builds now and I am going to guess in TF 1.8 (100% not 1.7) because 1.8 has not been branched yet.:

Beginning in cuDNN 7, binary compatibility of patch and minor releases is maintained as follows:

Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z).
cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x).

See tensorflow/tensorflow#17566

rongou mentioned this issue Mar 12, 2018

Pin Dockerfile to a specific cuDNN version. horovod/horovod#206

Merged

Luke035 added a commit to Luke035/nvidia-anaconda-docker that referenced this issue Mar 13, 2018

Add cuDNN 7.1 fix

015431c

Add downgrade option for cuDNN, workaround for tensorflow/tensorflow#17566

tfboyd self-assigned this Mar 13, 2018

flx42 mentioned this issue Mar 13, 2018

docker image for 1.6.0 is missing CUPTI libraries #17431

Closed

flx42 mentioned this issue Mar 14, 2018

Pin the version of cuDNN used in Dockerfile.gpu #17723

Merged

frankchn pushed a commit that referenced this issue Mar 15, 2018

Pin the version of cuDNN used in Dockerfile.gpu (#17723)

1128294

Related: #17566 Fixes: #17431 Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

StanislawAntol pushed a commit to StanislawAntol/tensorflow that referenced this issue Mar 23, 2018

Pin the version of cuDNN used in Dockerfile.gpu (tensorflow#17723)

98edcc4

Related: tensorflow#17566 Fixes: tensorflow#17431 Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

tfboyd closed this as completed Apr 2, 2018

MalKeshar mentioned this issue Apr 24, 2018

[Caffe2] cudnn versions compatibility issue. pytorch/pytorch#6898

Open

taewookim mentioned this issue May 13, 2018

Can't use GPU in getting embedding layers? davidsandberg/facenet#742

Closed

karmel mentioned this issue May 17, 2018

[tf-gpu1.7][tf-serving] Error with cuDNN version incompatible issue when load a ckpt model to do the inference #19286

Closed

guillaumekln added a commit to OpenNMT/nmt-wizard-docker that referenced this issue Jun 29, 2018

Downgrade cuDNN to 7.0.5.15-1

45427ad

See tensorflow/tensorflow#17566

dorarad mentioned this issue Apr 8, 2019

Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000). stanfordnlp/mac-network#23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

rongou commented Mar 8, 2018

flx42 commented Mar 8, 2018

gunan commented Mar 8, 2018

wdma commented Mar 9, 2018

flx42 commented Mar 9, 2018 •

edited

wdma commented Mar 9, 2018 •

edited

adampl commented Mar 11, 2018 •

edited

Luke035 commented Mar 11, 2018

wdma commented Mar 12, 2018 •

edited

adampl commented Mar 12, 2018

rongou commented Mar 12, 2018

Luke035 commented Mar 13, 2018

flx42 commented Mar 13, 2018

ghost commented Mar 13, 2018 •

edited by ghost

tfboyd commented Mar 13, 2018

gunan commented Mar 13, 2018

tfboyd commented Mar 13, 2018 •

edited

gunan commented Mar 13, 2018

flx42 commented Mar 13, 2018

gunan commented Mar 13, 2018

flx42 commented Mar 13, 2018

martinwicke commented Mar 13, 2018

jlebar commented Mar 13, 2018 •

edited

flx42 commented Mar 13, 2018

cliffwoolley commented Mar 13, 2018

tfboyd commented Mar 13, 2018 •

edited

cliffwoolley commented Mar 13, 2018

tfboyd commented Mar 13, 2018

tfboyd commented Mar 13, 2018

flx42 commented Mar 13, 2018 •

edited

gunan commented Mar 14, 2018

flx42 commented Mar 14, 2018

gunan commented Mar 14, 2018 via email

Kjos commented Mar 20, 2018

tfboyd commented Apr 2, 2018

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

Comments

rongou commented Mar 8, 2018

System information

Describe the problem

Source code / logs

flx42 commented Mar 8, 2018

gunan commented Mar 8, 2018

wdma commented Mar 9, 2018

flx42 commented Mar 9, 2018 • edited

wdma commented Mar 9, 2018 • edited

adampl commented Mar 11, 2018 • edited

Luke035 commented Mar 11, 2018

wdma commented Mar 12, 2018 • edited

adampl commented Mar 12, 2018

rongou commented Mar 12, 2018

Luke035 commented Mar 13, 2018

flx42 commented Mar 13, 2018

ghost commented Mar 13, 2018 • edited by ghost

System information

tfboyd commented Mar 13, 2018

gunan commented Mar 13, 2018

tfboyd commented Mar 13, 2018 • edited

gunan commented Mar 13, 2018

flx42 commented Mar 13, 2018

gunan commented Mar 13, 2018

flx42 commented Mar 13, 2018

martinwicke commented Mar 13, 2018

jlebar commented Mar 13, 2018 • edited

flx42 commented Mar 13, 2018

cliffwoolley commented Mar 13, 2018

tfboyd commented Mar 13, 2018 • edited

cliffwoolley commented Mar 13, 2018

tfboyd commented Mar 13, 2018

tfboyd commented Mar 13, 2018

flx42 commented Mar 13, 2018 • edited

gunan commented Mar 14, 2018

flx42 commented Mar 14, 2018

gunan commented Mar 14, 2018 via email

Kjos commented Mar 20, 2018

tfboyd commented Apr 2, 2018

flx42 commented Mar 9, 2018 •

edited

wdma commented Mar 9, 2018 •

edited

adampl commented Mar 11, 2018 •

edited

wdma commented Mar 12, 2018 •

edited

ghost commented Mar 13, 2018 •

edited by ghost

tfboyd commented Mar 13, 2018 •

edited

jlebar commented Mar 13, 2018 •

edited

tfboyd commented Mar 13, 2018 •

edited

flx42 commented Mar 13, 2018 •

edited