Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

Closed
rongou opened this issue Mar 8, 2018 · 34 comments
Closed

tensorflow-gpu pip package is not compatible with cuda9 docker image #17566

rongou opened this issue Mar 8, 2018 · 34 comments
Assignees

Comments

@rongou
Copy link

rongou commented Mar 8, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary):
    binary (pip install tensorflow-gpu)
  • TensorFlow version (use command below):
    1.6.0
  • Python version:
    2.7
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
    CUDA 9, cuDNN 7
  • GPU model and memory:
  • Exact command to reproduce:
    I was trying to build a horovod image, but this would affect anyone using the nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 base image:
docker build -t horovod https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
docker run -it --rm horovod python tensorflow_mnist.py

Describe the problem

When building a docker image based on nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 and doing a pip install tensorflow-gpu==1.6.0, the resulting image causes a crash because the base image contains cuDNN 7.1, while the tensorflow-gpu pip package was built against cuDNN 7.0.

Source code / logs

Error messages:

2018-03-08 17:46:50.845206: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-08 17:46:50.845868: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 

@flx42

@flx42
Copy link
Contributor

flx42 commented Mar 8, 2018

@gunan do you remember why you have a check on major/minor cuDNN version instead of just major version?

@gunan
Copy link
Contributor

gunan commented Mar 8, 2018

@zheng-xq @martinwicke do you know why we check for cuDNN minor version?

@wdma
Copy link

wdma commented Mar 9, 2018

Hello,

I had to rebuild my computer and am now experiencing the one of the errors described in the original post (see below). Is there a recommended workaround?

Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.

@flx42
Copy link
Contributor

flx42 commented Mar 9, 2018

If you use docker, you should pin the version of cuDNN you are installing. For instance:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/Dockerfile.devel-gpu#L16-L17
But that means not using the cuDNN official images, but a regular CUDA one.

If you are not using docker, you can still downgrade cuDNN with a similar command.

@wdma
Copy link

wdma commented Mar 9, 2018

Thank you for your reply.

I had just solved it by updating Tensorflow. Type "pip install update"

@adampl
Copy link

adampl commented Mar 11, 2018

@wdma How could you solve it by upgrading TF? I'm getting this error on the latest (1.6) TF.

@Luke035
Copy link

Luke035 commented Mar 11, 2018

+1

@wdma
Copy link

wdma commented Mar 12, 2018

@adampl Installing tensorflow per these instructions (https://www.tensorflow.org/install/) generates the above error. Typing "pip install update" fixes it. I hope this helps!

@adampl
Copy link

adampl commented Mar 12, 2018

I ended up doing what @flx42 advised (#17566 (comment)) though it's not a perfect solution.

@rongou
Copy link
Author

rongou commented Mar 12, 2018

If you use docker, I think you have 3 options:

  • Use the cuda base image (e.g. nvidia/cuda:9.0-devel-ubuntu16.04; note this doesn't have cuDNN), and install cuDNN 7.0 yourself, as I've done for horovod (Pin Dockerfile to a specific cuDNN version. horovod/horovod#206).
  • Use the cuda+cudnn base image (e.g. nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04), but downgrade cuDNN to 7.0. You need to do apt-get install --allow-downgrades libcudnn7=7.0.5.15-1+cuda9.0.
  • Use Tensorflow's docker image (tensorflow/tensorflow:1.6.0-gpu) as base.

If you don't use docker, just make sure your machine has cuDNN 7.0, not 7.1.

Luke035 added a commit to Luke035/nvidia-anaconda-docker that referenced this issue Mar 13, 2018
Add downgrade option for cuDNN, workaround for tensorflow/tensorflow#17566
@Luke035
Copy link

Luke035 commented Mar 13, 2018

@rongou I implemented your second suggestion in my Dockerfile and I've been able to run TF 1.6 along with KERAS 2.15 within the base image nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04.
The only thing I had to do was to add a RUN layer in my docker file for executing "apt-get install --allow-downgrades libcudnn7=7.0.5.15-1+cuda9.0".

@flx42
Copy link
Contributor

flx42 commented Mar 13, 2018

@zheng-xq @martinwicke let me know if there is a problem with cuDNN that warrants this strict check!

@ghost
Copy link

ghost commented Mar 13, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
  • TensorFlow installed from (source or binary):pip3 install --upgrade tensorflow-gpu
  • TensorFlow version (use command below):1.6.0
  • Python version: Python 3.5.2
  • Bazel version (if compiling from source):no
  • GCC/Compiler version (if compiling from source):gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
  • CUDA/cuDNN version:CUDA 9, cuDNN v7.1.1 Library for Linux
  • GPU model and memory:1070

Hello, I leave my story installing cuda with the problems related to the messages below.

2018-03-13 10:19:33.118216: E tensorflow/stream_executor/cuda/cuda_dnn.cc:378] Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7004 (compatibility version 7000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018-03-13 10:19:33.118929: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
Aborted

I installed cuDNN v7.0.4 Library for Linux(the oldest version for cuda9.0) (link) like belows.
tar xzvf cudnn-9.0-linux-x64-v7.tgz
sudo cp cuda/lib64/* /usr/local/cuda-9.0/lib64/
sudo cp cuda/include/* /usr/local/cuda-9.0/include/
sudo chmod a+r /usr/local/cuda-9.0/lib64/libcudnn*
sudo chmod a+r /usr/local/cuda-9.0/include/cudnn.h

No Errors, works well.

@tfboyd
Copy link
Member

tfboyd commented Mar 13, 2018

@flx42 I am tracking this down tomorrow. The nightly docker is also broken due to the same issue, which means I cannot run nightly tests so I have some direct motivation. I see two issues maybe more in summary of this thread:

  1. Does the check need to be so stricket
  2. Why is the nightly docker broken? I have a nightly perf test and should have caught this earlier but that is not the purpose of my perf tests and I was distracted.

My suggestion will be to move cuDNN forward for the next release if possible, e.g. 1.7 unless it is too late and at a minimum switch the nightly as well as followup on the version check.

@tfboyd tfboyd self-assigned this Mar 13, 2018
@gunan
Copy link
Contributor

gunan commented Mar 13, 2018

I think the biggest problem is the "latest" nvidia-docker images are cuda 9.1, cudnn 7.1.
And our builds look for cuda 9 and cudnn 7.
In our nightlies, or tests the fix is to avoid using "latest" nvidia docker images.

Also, it is too late to change anything for 1.7. RC0 is almost out.

@tfboyd
Copy link
Member

tfboyd commented Mar 13, 2018

The nightly docker we release is broken. I ran into the problem trying to figure out why my perf tests stopped running. We can at least pin the nightly docker image until 1.8 nightlies. I do not think we need to move to 7.1 per say as that also break people but I will ask XQ tomorrow about major/minor versions to see what he thinks and Martin as well. I know you can but just to help track stuff down to help.

That unblocks me and anyone using nightly docker. I cannot believe you were awake to see my message. :-)

@gunan
Copy link
Contributor

gunan commented Mar 13, 2018

Thanks for following up on this.
It is a good idea to pin the nightly docker image bases, and going forward, I wonder if we can remove minor version checks on both CUDA and cuDNN.

@flx42
Copy link
Contributor

flx42 commented Mar 13, 2018

I think the biggest problem is the "latest" nvidia-docker images are cuda 9.1, cudnn 7.1.

What do you mean? Which dockerfile is that?

@gunan
Copy link
Contributor

gunan commented Mar 13, 2018

Sorry, you are right.
Not latest image, but rather all cudnn versions are marked "cudnn7" without minor version specification.

@flx42
Copy link
Contributor

flx42 commented Mar 13, 2018

If there is a good reason to have this check on the minor version (e.g. an incompatibility despite the SONAME), I might split future images with the cuDNN minor version.

@martinwicke
Copy link
Member

I believe that we have run into incompatibilities between some minor versions of CUDA, but I'm not sure whether we've ever seen that in cuDNN (@jlebar are there details?)

@jlebar
Copy link
Contributor

jlebar commented Mar 13, 2018

@martinwicke, yeah, e.g. CUDA 9.0 and CUDA 9.1 are quite different in the respects we care about.

For cudnn, I have not seen a statement specifying their level of backwards compatibility. I would naively expect that if you build against cudnn x.y and run with cudnn x.z for z >= y, it probably will work. But to be comfortable with blessing that I'd want a statement in writing from nvidia. (Perhaps such a statement already exists.)

Whether or not it should be a fatal error in TF vs a "good luck, you're on your own" warning (like we do for known-broken ptxas versions), I don't have an opinion on.

@flx42
Copy link
Contributor

flx42 commented Mar 13, 2018

CUDA toolkit libraries and cuDNN have different SONAME, so that's actually expected.

$ objdump -p  /usr/lib/x86_64-linux-gnu/libcudnn.so.7 | grep SONAME
  SONAME               libcudnn.so.7

$ objdump -p  /usr/local/cuda-9.1/targets/x86_64-linux/lib/libcublas.so.9.1 | grep SONAME
  SONAME               libcublas.so.9.1

@cliffwoolley
Copy link
Contributor

For cudnn, I have not seen a statement specifying their level of backwards compatibility. I would naively expect that if you build against cudnn x.y and run with cudnn x.z for z >= y, it probably will work. But to be comfortable with blessing that I'd want a statement in writing from nvidia. (Perhaps such a statement already exists.)

Statement from NVIDIA:

Beginning in cuDNN 7, binary compatibility of patch and minor releases is maintained as follows:

  • Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z).
  • cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x).

(Note that this compatibility was not necessarily guaranteed in prior cuDNN major releases.)

--Cliff Woolley
Director, DL Frameworks Engineering, NVIDIA

@tfboyd
Copy link
Member

tfboyd commented Mar 13, 2018

Update. I caught up with the build people on @gunan team and the nightly Docker will be fixed. I need to talk to a few more people but I think we should consider a warning for a minor version difference and a fatal for major difference. I have some concerns that there might be feature differences in point releases. I am doing research and talking with people, it will likely take a few days. I or someone will have a final statement.

b/74600152

@cliffwoolley
Copy link
Contributor

We will also update the cuDNN docs to say the same as what I posted above. Thanks for pointing out the omission.

@tfboyd
Copy link
Member

tfboyd commented Mar 13, 2018

@cliffwoolley Thank you. I am opening an internal issue and looking for someone to update the code to match your statement in cuDNN.

@tfboyd
Copy link
Member

tfboyd commented Mar 13, 2018

Last update until done or change in progress. I found someone to make the changes to match Cliff's cuDNN version position update. Will post when complete and would not expect it to take very long. Team effort I just get the honor of updating the github issue. :-)

@flx42
Copy link
Contributor

flx42 commented Mar 13, 2018

@gunan Should we modify Dockerfile.gpu to make it more similar to Dockerfile.gpu-devel? That is to say, FROM nvidia/cuda:9.0-base-ubuntu16.04, select the CUDA packages you need, and pin the cuDNN version.

@gunan
Copy link
Contributor

gunan commented Mar 14, 2018

How much space will we save by doing that?
it may still be worth it just to fix the cudnn issues.

@flx42
Copy link
Contributor

flx42 commented Mar 14, 2018

400 MB from a quick test, 300 MB if I re-add CUPTI. So it seems worth it, and it's always better to pin the version of a key dependency like cuDNN. It's better for reproducibility.

@gunan
Copy link
Contributor

gunan commented Mar 14, 2018 via email

frankchn pushed a commit that referenced this issue Mar 15, 2018
Related: #17566
Fixes: #17431

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
@Kjos
Copy link

Kjos commented Mar 20, 2018

I'm having the same error with pip. I guess I'll try the Docker then

StanislawAntol pushed a commit to StanislawAntol/tensorflow that referenced this issue Mar 23, 2018
Related: tensorflow#17566
Fixes: tensorflow#17431

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
@tfboyd
Copy link
Member

tfboyd commented Apr 2, 2018

Fixed in nightly builds. We are now checking according to Cliff's update in nightly builds now and I am going to guess in TF 1.8 (100% not 1.7) because 1.8 has not been branched yet.:

Beginning in cuDNN 7, binary compatibility of patch and minor releases is maintained as follows:

Any patch release x.y.z is forward- or backward-compatible with applications built against another cuDNN patch release x.y.w (i.e., of the same major and minor version number, but having w!=z).
cuDNN minor releases beginning with cuDNN 7 are binary backward-compatible with applications built against the same or earlier patch release (i.e., an app built against cuDNN 7.x is binary compatible with cuDNN library 7.y, where y>=x).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests