Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status: CUDA driver version is insufficient for CUDA runtime version #21832

Closed
mforde84 opened this issue Aug 23, 2018 · 28 comments
Closed

Status: CUDA driver version is insufficient for CUDA runtime version #21832

mforde84 opened this issue Aug 23, 2018 · 28 comments
Assignees
Labels
stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues

Comments

@mforde84
Copy link

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Kernel: 2.6.32-573.12.1.el6.x86_64
    Host: RHEL 6.7
    Container: Ubuntu 16.04.5 LTS

  • TensorFlow installed from (source or binary):
    Singularity

  • TensorFlow version (use command below):
    Tensorflow:1.10.0-devel-gpu-py3

  • Python version:
    Python 3.5.2

  • GCC/Compiler version (if compiling from source):
    GCC 5.4.0

  • CUDA/cuDNN version:
    9

  • GPU model and memory:
    Singularity tensorflow:1.10.0-devel-gpu-py3:~> nvidia-smi
    Thu Aug 23 00:24:41 2018
    +------------------------------------------------------+
    | NVIDIA-SMI 352.39 Driver Version: 352.39 |
    |-------------------------------+----------------------+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 Tesla K80 Off | 0000:84:00.0 Off | 0 |
    | N/A 39C P0 58W / 149W | 22MiB / 11519MiB | 0% E. Process |
    +-------------------------------+----------------------+----------------------+

  • Exact command to reproduce:
    $ # install nvidia driver v352.39
    $ sudo singularity build --sandbox /path/to/sandbox docker://tensorflow/tensorflow/1.10.0-devel-gpu-py3
    $ singularity shell -nv /path/to/sandbox
    Singularity tensorflow:1.10.0-devel-gpu-py3:~> nvidia-smi
    Thu Aug 23 00:24:41 2018
    +------------------------------------------------------+
    | NVIDIA-SMI 352.39 Driver Version: 352.39 |
    |-------------------------------+----------------------+----------------------+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 Tesla K80 Off | 0000:84:00.0 Off | 0 |
    | N/A 39C P0 58W / 149W | 22MiB / 11519MiB | 0% E. Process |
    +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Singularity tensorflow:1.10.0-devel-gpu-py3:~> python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

        from tensorflow.python.client import device_lib
        print(device_lib.list_local_devices())
        2018-08-23 00:26:35.424225: I
        tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports
        instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
        2018-08-23 00:26:38.208490: I
        tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with
        properties:
        name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
        pciBusID: 0000:84:00.0
        totalMemory: 11.25GiB freeMemory: 11.16GiB
        2018-08-23 00:26:38.208576: I
        tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu
        devices: 0
        Traceback (most recent call last):
        File "", line 1, in
        File
        "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/device_lib.py",
        line 41, in list_local_devices
        for s in pywrap_tensorflow.list_devices(session_config=session_config)
        File
        "/usr/local/lib/python3.5/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py",
        line 1679, in list_devices
        return ListDevices(status)
        File
        "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py",
        line 519, in exit
        c_api.TF_GetCode(self.status.status))
        tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed.
        Status: CUDA driver version is insufficient for CUDA runtime version

Describe the problem

I built a tensorflow container with singularity. I think there might be a mismatch between the some of the card drivers and cuda libraries between the host and container. I have the container built as a sandbox so I'm able to make modifications quiet easily, I was curious if there's a way I can install appropriate cuda driver and runtimes to the container, and have the container run off those instead of pulling libraries from the host which are incompatible with the container? Is this the right way to do it? Or should I be updating the cuda drivers / libraries on the host to match the container?

@tensorflowbutler tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Aug 24, 2018
@tensorflowbutler
Copy link
Member

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
Bazel version
Mobile device

@mforde84
Copy link
Author

Have I written custom code
N/A
Bazel version
N/A
Mobile device
N/A

@mforde84
Copy link
Author

Would https://github.com/NIH-HPC/gpu4singularity be viable for Singularity 2.6.0 with --nv flags or would I need to make additional modification to library paths?

@ppwwyyxx
Copy link
Contributor

This is not a tensorflow issue: according to https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html your nvidia driver is not new enough for cuda 9.0

@mforde84
Copy link
Author

mforde84 commented Aug 24, 2018

Sure. But the question is more on how to integrate compatible drivers into a tensorflow container. The adage about containerization is: build once, run anywhere; and not: build once, run anywhere with Nvidia drivers v485 and above plus a kernel supporting experimental filesystem overlays. Even experimental / unofficial documentation on this scenario would be extremely helpful for most HPC environments that are still running epel6. ¯_(ツ)_/¯

@ppwwyyxx
Copy link
Contributor

The world is not perfect. I'm afraid "build once, run anywhere with nvidia drivers>=384.81" is the way to go. At least that's what nvidia says: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA#requirements

Running a CUDA container requires a machine with at least one CUDA-capable GPU and a driver compatible with the CUDA toolkit version you are using.

@nicolefinnie
Copy link

@mforde84 @tensorflowbutler

I hit exactly this problem and someone else with the same combination (tensorflow 1.11 + CUDA runtime 9.0 + cudnn 7.3 + nvidia driver 390 ) hit this problem too, though nvidia driver 390 is new enough for CUDA runtime 9.0. This person opened an issue in the Nvidia DevTalk:

https://devtalk.nvidia.com/default/topic/1042575/cuda-driver-version-is-insufficient-for-cuda-runtime-version/?offset=2#5289688

And I downgraded the tensorflow version from 1.11 (the latest conda version) to 1.7 and this problem got solved. And my question is if the newer tensorflow, say 1.10+, has a dependency on specific nvidia drivers /cuda versions?

@tatianashp tatianashp assigned azaks2 and unassigned tatianashp Oct 13, 2018
@mforde84
Copy link
Author

We upgraded to a recent version of drivers 396 and the issue resolved.

@nicolefinnie
Copy link

nicolefinnie commented Oct 13, 2018

@mforde84 Thanks for the confirmation. That's what I was thinking too, but I had trouble upgrading to 396.54 due to a broken dependency, however, after having read your confirmation, I managed to install 396.54 and now it works with tensorflow 1.11.0, Yoho! Thanks! Upgraded the ticket in the Nvidia DevTalk.

@tatianashp tatianashp added the type:build/install Build and install issues label Oct 13, 2018
@azaks2
Copy link

azaks2 commented Oct 15, 2018

tensorflow 1.11 + CUDA runtime 9.0 + cudnn 7.3 + nvidia driver 390
the combo should have worked. Note with 396.54 there will be one more upgrade once TF switches to CUDA 10.

@hello-wangjj
Copy link

@nicolefinnie , thanks, I downgraded the tensorflow version fromt o 1.7 and this problem got solved.

@saskra
Copy link

saskra commented Oct 17, 2018

I tested the recommendations in this thread, but I was not able to install any other driver than 390 on Ubuntu 18.04 and downgrading tensorflow to 1.7 resulted in a new error message:

2018-10-17 09:12:21.434933: E tensorflow/stream_executor/cuda/cuda_dnn.cc:343] Loaded runtime CuDNN library: 7.1.2 but source was compiled with: 7.2.1.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Segmentation fault (core dumped)

Which is strange, as I had installed version 7.3.1 on my system, but it seems that anaconda installs its own cudnn in the enviroment.

@hello-wangjj
Copy link

I tested the recommendations in this thread, but I was not able to install any other driver than 390 on Ubuntu 18.04 and downgrading tensorflow to 1.7 resulted in a new error message:

2018-10-17 09:12:21.434933: E tensorflow/stream_executor/cuda/cuda_dnn.cc:343] Loaded runtime CuDNN library: 7.1.2 but source was compiled with: 7.2.1.  CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
Segmentation fault (core dumped)

Which is strange, as I had installed version 7.3.1...

@saskra ,I was use deepin15.8, nvidia-driver==390.67, cuda==9.0,cudnn==7.0, and miniconda installed tensorflow-gpu==1.7,and the problem got solved.

@mforde84
Copy link
Author

Saskra are you running in a container?

@saskra
Copy link

saskra commented Oct 17, 2018

No. But I now found the solution: Anaconda creates an environment with its own incompatible cudnn version which has to be overwritten manually. :-)

@PhilipMay
Copy link
Contributor

No. But I now found the solution: Anaconda creates an environment with its own incompatible cudnn version which has to be overwritten manually. :-)

I have the same problem. :-(
Which version of which exact conda module did you have to use to overwrite?

@saskra
Copy link

saskra commented Oct 19, 2018

I have Ubuntu 18.04 which needs Nvidia driver 390. Anaconda brings cuDNN 7.2.1, which seems to be too old for this driver version: https://anaconda.org/anaconda/cudnn Now I am using the newest cuDNN version (7.3.1) as suggested by the official download site: https://developer.nvidia.com/rdp/cudnn-download btw: Anaconda's cuDNN version depends on its TensorFlow version, I have the newest one here as well (1.11).

PS: I suggested to update the version: ContinuumIO/anaconda-issues#10224

@Yongyao
Copy link

Yongyao commented Nov 29, 2018

@mforde84 Would you mind sharing how you upgraded it?

@Huixxi
Copy link

Huixxi commented Mar 25, 2019

check whether your nvidia-driver support your cuda version from here https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

@mmattklaus
Copy link

@mforde84 Would you mind sharing how you upgraded it?

As for me, upgrading my driver worked out. I run a Windows 10 PC and use TF 1.13.
( NOTE: _Just an aside, I needed to activate my virtual environment and start Jupyter notebook in that env before I was able to use TF in the notebook.)

Here is how I upgraded my driver:

  1. Open Device Manager
  2. Expand the display adapters
  3. Locate your NVIDIA Graphics adapter
  4. Right-click and click Update driver

Alternative

  • I found this software ( GeForce Experience ) on the NVIDIA website for my graphics family which can also be downloaded, installed and used to update the driver(s). This should work as well, though I didn't go that way.

@Huixxi
Copy link

Huixxi commented Jun 23, 2019

@mforde84 Maybe you can get the solution from there. https://stackoverflow.com/q/41409842/7121726

@ghost
Copy link

ghost commented Aug 13, 2019

Same issue here and I can't find an appropriate tensorflow version. I currently have ubuntu version 16.04.6, driver version 410.78, cuda version 10, conda version 4.7.11 and none of the above-mentioned tensorflow versions works for me. I tried 1.13.1, 1.7 and 1.14.
Anaconda installs cudnn with version 7.6.0. Edit: I forced conda to use the version 10.0 for cudatoolkit and not cuda10.1_0 as it was before (according to @saskra's suggestion), but nothing changed unfortunately.

Updating anaconda also didn't help. In fact, conda update --all and conda update conda outputs many new errors like:
InvalidArchiveError('Error with archive ... You probably need to delete and re- download or re-create this file. Message from libarchive was:...

Creating a conda environment with my current specs or simply running my python script also produces various InvalidArchiveError messages like above:

channels:
  - conda-forge
  - defaults
dependencies:
  - keras=2.2.4
  - nltk=3.3.0
  - numpy=1.15.4
  - pandas=0.23.4
  - python=3.6.6
  - scikit-learn=0.20.0
  - scipy=1.1.0
  - tensorflow=1.7
  - tensorflow-gpu=1.7
  - cython=0.29
  - pip:
    - fasttext==0.8.3
    - fuzzywuzzy==0.17.0
    - python-levenshtein==0.12.0
    - subsample==0.0.6
    - talos
    - tabulate==0.8.3

@agostini01
Copy link

I had a similar issue using driver 384.130. Turns out that versions of the cudatoolkit inside anaconda environment and the cuda supported by my driver did not match.

These two links helped me identifying my driver and cuda version and, later, to install the correct version of tensorflow_gpu that matched the cuda in my machine

To select the appropriate version based on your cuda installation:
https://www.tensorflow.org/install/source#tested_build_configurations

Version Python version Compiler Build tools cuDNN CUDA
tensorflow_gpu-1.14.0 2.7, 3.3-3.7 GCC 4.8 Bazel 0.24.1 7.4 10.0
tensorflow_gpu-1.13.1 2.7, 3.3-3.7 GCC 4.8 Bazel 0.19.2 7.4 10.0
tensorflow_gpu-1.12.0 2.7, 3.3-3.6 GCC 4.8 Bazel 0.15.0 7 9
tensorflow_gpu-1.11.0 2.7, 3.3-3.6 GCC 4.8 Bazel 0.15.0 7 9
tensorflow_gpu-1.10.0 2.7, 3.3-3.6 GCC 4.8 Bazel 0.15.0 7 9
tensorflow_gpu-1.9.0 2.7, 3.3-3.6 GCC 4.8 Bazel 0.11.0 7 9

The cuda versions may have minor-versions (9.0, 9.2), thus you should double check what exactly you are installing with conda.
To check what you have inside your conda enviroment and how to install a different version
https://stackoverflow.com/a/55351774/2971299

So, I identified my cuda version

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

And installed the correct anaconda environment:

conda create -n gpu tensorflow-gpu==1.9.0 jupyter

@ghost
Copy link

ghost commented Sep 19, 2019

Thank you very much @agostini01 . I actually have all versions aligned correctly. The only thing that actually worked out is the second answer here: https://stackoverflow.com/questions/41402409/tensorflow-doesnt-seem-to-see-my-gpu
I uninstalled tensorflow and reinstalled tensorflow-gpu. Apparently they don't go well together?
Now Python sees my GPUs and when I do watch-smi I can see my job using them.

@agostini01
Copy link

@KonstantinaLazaridou no problems. I believe your suggested link is for when you are installing cuda system wide.

This line: conda create -n gpu tensorflow-gpu==1.9.0 jupyter cudatoolkit==XX should work as long as you match the anaconda tensorflow-gpu version with the correct anaconda cudatoolkit (XX) and "system-wide installed" cuda driver. Unfortunately I dont remember what to use for the XX value anymore.
|Apparently they don't go well together?
indeed! Nice catch. The advantage of using conda is that you can have tensorflow in one environment and tensorflow-gpu in another.

@MagaretJi
Copy link

@mforde84 I had a similar issue using driver 384.81,but Nvidia recommended Tesla k80 need install driver 384.183.So upgraded to a recent version of drivers 396 is a good choice???
GPU Tesla k80
tensorflow-gpu 1.10.0
CDUNN 7.0.5
CUDA 9.0

2019-12-17 09:55:46.558571: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2019-12-17 09:55:46.558747: E tensorflow/stream_executor/cuda/cuda_dnn.cc:463] possibly insufficient driver version: 384.81.0
2019-12-17 09:55:46.558864: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)

@turinglife
Copy link

turinglife commented Mar 9, 2020

### nvidia drivers mismatch

my nvidia driver is 384.90.

before: error which is same as the title of the thread.
tensorflow-gpu 1.15.0 with cudatoolkit 10.0.130 + cudnn 7.6.5

after: Worked
tensorflow-gpu 1.12.0 with cudatoolkit 9.0

solution:
conda uninstall cudatoolkit (10.0.130)
conda install tensorflow-gpu 1.12 cudatoolkit=9.0

image

@shivam1702
Copy link

shivam1702 commented Jun 14, 2021

This error also occurs if you create a symbolic link for any CUDA shared object file with a higher version to a shared object. with a lower version.

For example, for me this error was occurring because I had a symbolic link from /usr/local/cuda-10.0/lib64/libcudart.so pointing towards: /usr/local/cuda/lib64/libcudart.so.10.1, among other symlinks.

When I removed just this symlink, the error vanished, but I noticed that there was no significant difference between the training times between GPU and CPU, despite the GPU process showing up in nvidia-smi, while the other one obviously didn't. They were exactly the same. Weird issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests