Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using tensorflow gpu 2.1 with Cuda 10.2 #34759

Closed
zaccharieramzi opened this issue Dec 2, 2019 · 38 comments
Closed

Using tensorflow gpu 2.1 with Cuda 10.2 #34759

zaccharieramzi opened this issue Dec 2, 2019 · 38 comments
Assignees
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.1 for tracking issues in 2.1 release type:build/install Build and install issues

Comments

@zaccharieramzi
Copy link
Contributor

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow installed from: pip
  • TensorFlow version: 2.1.0rc0
  • Python version: 3.6.8
  • Installed using virtualenv? pip? conda?: pip
  • CUDA/cuDNN version: 10.2
  • GPU model and memory: Quadro P5000, 16GB

Describe the problem

I want to use tensorflow-gpu==2.1.0rc0 with cuda 10.2 and it seems that it can't work right now.
When I use tensorflow-gpu=2.0.0 it works perfectly fine.

Provide the exact sequence of commands / steps that you executed before running into the problem

mkdir tests2 &&\
cd tests2 &&\
virtualenv -p /usr/bin/python3.6 venv &&\
source venv/bin/activate &&\
pip install tensorflow-gpu==2.1.0rc0 &&\
python -c 'import tensorflow'

Which gives the following warnings:

2019-12-02 15:23:46.869198: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:46.869227: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2019-12-02 15:23:47.516321: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516433: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516449: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Any other info / logs
When I do locate libcudart.so, I get the following:

/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130
/usr/local/cuda-10.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2.89

locate libnvinfer_plugin.so is empty.

@ymodak
Copy link
Contributor

ymodak commented Dec 2, 2019

As printed in the stack trace Could not load dynamic library 'libcudart.so.10.1'. You have to roll back to cuda 10.1 in order to use TF 2.1 binary. For using cuda 10.2 you have to install TF from sources.
Also, tensorflow pip package (TF 2.1) now includes GPU support by default (same as tensorflow-gpu) for both Linux and Windows.

@zaccharieramzi
Copy link
Contributor Author

Thanks for your answer.
I am a bit surprised though because I am able to use tf 2.0.0 just fine (with cuda 10.2). How is it possible?

@ymodak
Copy link
Contributor

ymodak commented Dec 2, 2019

You are able to import tf cpu version in tf 2.0
Reason being when you installed tf 2.0.0 without specifying accelerator(gpu) it installed both CPU and GPU support.
To check this you may try printing;

import tensorflow as tf
tf.test.is_gpu_available()

@zaccharieramzi
Copy link
Contributor Author

No I installed specifically tensorflow gpu, didn't get the warning at import and more than simply using this function to test, I monitored the GPU usage during training (and my training was way faster than when I was masking the GPU).

@zaccharieramzi
Copy link
Contributor Author

When I look at the logs for tf 2.0.0, I see that it's loading packages from the 10.0 version of Cuda. These packages are certainly legacy packages that I still have, and so that's why it's working even though my main Cuda is 10.2.
I'll roll back to 10.1 to use tf 2.1.0

@gadagashwini-zz gadagashwini-zz self-assigned this Dec 3, 2019
@gadagashwini-zz gadagashwini-zz added subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.1 for tracking issues in 2.1 release type:build/install Build and install issues labels Dec 3, 2019
@gadagashwini-zz
Copy link
Contributor

@zaccharieramzi, Were you able to install Tensorflow-gpu 2.1 with Cuda 10.1.

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Dec 5, 2019
@zaccharieramzi
Copy link
Contributor Author

Hi @gadagashwini , I didn't have time to try. I can't really try right now, cos I am too afraid to mess up my conda environment atm(haha). But I can let you know in a few days.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 5, 2019
@gadagashwini-zz
Copy link
Contributor

Thanks @zaccharieramzi. Please try and let us know how it progresses. Thanks!

@EwoutH
Copy link
Contributor

EwoutH commented Dec 9, 2019

@zaccharieramzi They got Cuda 10.1 working in another thread: #34429 (comment). Could you verify?

@pisiiki
Copy link

pisiiki commented Dec 12, 2019

I managed to build master with --config=v1 and cuda 10.2(gcc), numpy 1.17.4, however I get reproducible segfaults on code that runs fine with tf 1.x/cuda 10.1. OS is a fresh new Ubuntu 18.04, cuda/cudnn from debs. Hardware is a xeon 2695v2, 128GB RAM, 1060ti:

2019-12-12 09:53:39.102734: F ./tensorflow/core/util/gpu_launch_config.h:129] Check failed: work_element_count > 0 (0 vs. 0)
[xeon:22138] *** Process received signal ***
[xeon:22138] Signal: Aborted (6)
[xeon:22138] Signal code:  (-6)
[xeon:22138] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f650fe53f20]
[xeon:22138] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f650fe53e97]
[xeon:22138] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f650fe55801]
[xeon:22138] [ 3] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0xaa769c7)[0x7f6464a8b9c7]
[xeon:22138] [ 4] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow7functor9ApplyAdamIN5Eigen9GpuDeviceEfEclERKS3_NS2_9TensorMapINS2_6TensorIfLi1ELi1ElEELi16ENS2_11MakePointerEEESB_SB_NS7_INS2_15TensorFixedSizeIKfNS2_5SizesIJEEELi1ElEELi16ESA_EESH_SH_SH_SH_SH_NS7_INS8_ISD_Li1ELi1ElEELi16ESA_EEb+0x40f)[0x7f646322fb7f]
[xeon:22138] [ 5] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow11ApplyAdamOpIN5Eigen9GpuDeviceEfE7ComputeEPNS_15OpKernelContextE+0x52c)[0x7f646315015c]
[xeon:22138] [ 6] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xe6)[0x7f6475574b76]
[xeon:22138] [ 7] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0xf75665)[0x7f64755df665]
[xeon:22138] [ 8] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(+0xf75d2f)[0x7f64755dfd2f]
[xeon:22138] [ 9] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x4b1)[0x7f64756cdbc1]
[xeon:22138] [10] /home/i/.local/lib/python3.6/site-packages/tensorflow_core/python/../libtensorflow_framework.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x7f64756caed3]
[xeon:22138] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f)[0x7f648ff7266f]
[xeon:22138] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f650fbfd6db]
[xeon:22138] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f650ff3688f]
[xeon:22138] *** End of error message ***

Weirdest thing I'm doing in the code is to have tensors with a dimension axis at 0 (ie: shape@(40,4,0,50)). This happens because I'm doing a network architecture search process.

I'm rolling back to 10.1 meanwhile, Regards.

@EwoutH
Copy link
Contributor

EwoutH commented Dec 12, 2019

@pisiiki Did your build include this patch? #34885

@pisiiki
Copy link

pisiiki commented Dec 12, 2019

@EwoutH it think so, I did a pull on master yesterday and this was merged 6 days ago.

edit:

I have just crashed with a prebuilt conda tf-gpu 1.15 package. It seems that my code may crash with a segfault or get some numerical overflows depending on driver, python version, tf version, packages, etc. Conda has its own cuda 10.0 runtime for this package. I'm still on 440.33.01(cuda 10.2) driver btw.

I didn't notice before because my previous setup (py3.6 tf 1.15 cuda 10.1) simply threw exceptions. It turned some vars into infs on a train call and I raised the exceptions myself. I suposed all of these were regular instability/overflows but now I think some may be related with 0 sized dimensions at some tensors.

I will rollback the driver to 10.1, however I guess this is worth checking out. I will also try to reproduce the error with minimal code.

@gadagashwini-zz
Copy link
Contributor

@zaccharieramzi, Any update on this issue!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Dec 30, 2019
@zaccharieramzi
Copy link
Contributor Author

Sorry, I didn't get a chance to check on this with the holiday season and all.
Would you rather have me try to build with cuda 10.2 or try and install cuda 10.1?

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 31, 2019
@lijiaying
Copy link

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow installed from: pip
  • TensorFlow version: 2.1.0rc0
  • Python version: 3.6.8
  • Installed using virtualenv? pip? conda?: pip
  • CUDA/cuDNN version: 10.2
  • GPU model and memory: Quadro P5000, 16GB

Describe the problem

I want to use tensorflow-gpu==2.1.0rc0 with cuda 10.2 and it seems that it can't work right now.
When I use tensorflow-gpu=2.0.0 it works perfectly fine.

Provide the exact sequence of commands / steps that you executed before running into the problem

mkdir tests2 &&\
cd tests2 &&\
virtualenv -p /usr/bin/python3.6 venv &&\
source venv/bin/activate &&\
pip install tensorflow-gpu==2.1.0rc0 &&\
python -c 'import tensorflow'

Which gives the following warnings:

2019-12-02 15:23:46.869198: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:46.869227: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2019-12-02 15:23:47.516321: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516433: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516449: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Any other info / logs
When I do locate libcudart.so, I get the following:

/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130
/usr/local/cuda-10.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2.89

locate libnvinfer_plugin.so is empty.

I have similar issues here. The way help me out is to build tensorflow from source. It seems the prebuild tensorflow is not compatible with cuda10.2 quite well.

@zaccharieramzi
Copy link
Contributor Author

I rolled back to cuda 10.1, and everything seems to work fine. I am going to close this since tensorflow 2.1 is not supposed to be directly usable with cuda 10.2, and from @lijiaying 's comment, I understand there is no issue with building from source.

@tensorflow-bot
Copy link

tensorflow-bot bot commented Jan 2, 2020

Are you satisfied with the resolution of your issue?
Yes
No

@alokssingh
Copy link

  • OS Platform and Distribution: Linux Ubuntu 16.04
  • TensorFlow installed from: pip
  • TensorFlow version: 2.1.0rc0
  • Python version: 3.6.8
  • Installed using virtualenv? pip? conda?: pip
  • CUDA/cuDNN version: 10.2
  • GPU model and memory: Quadro P5000, 16GB

Describe the problem
I want to use tensorflow-gpu==2.1.0rc0 with cuda 10.2 and it seems that it can't work right now.
When I use tensorflow-gpu=2.0.0 it works perfectly fine.
Provide the exact sequence of commands / steps that you executed before running into the problem

mkdir tests2 &&\
cd tests2 &&\
virtualenv -p /usr/bin/python3.6 venv &&\
source venv/bin/activate &&\
pip install tensorflow-gpu==2.1.0rc0 &&\
python -c 'import tensorflow'

Which gives the following warnings:

2019-12-02 15:23:46.869198: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:46.869227: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2019-12-02 15:23:47.516321: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516433: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2019-12-02 15:23:47.516449: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Any other info / logs
When I do locate libcudart.so, I get the following:

/usr/local/cuda-10.0/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
/usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0.130
/usr/local/cuda-10.2/doc/man/man7/libcudart.so.7
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudart.so.10.2.89

locate libnvinfer_plugin.so is empty.

I have similar issues here. The way help me out is to build tensorflow from source. It seems the prebuild tensorflow is not compatible with cuda10.2 quite well.

did you get any solution for your problem??

@zaccharieramzi
Copy link
Contributor Author

zaccharieramzi commented Jan 16, 2020

@aloksingh3110 like I said in the comment just above, I rolled back to cuda 10.1. This person has run into problems when building from source with cuda 10.2, so I advise you to roll back to 10.1.

@SalahAdDin
Copy link

SalahAdDin commented Jan 21, 2020

I have the same problem, but using CUDA 10.1:

Python 3.5.2 (default, Oct  8 2019, 13:06:37) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-01-21 16:33:12.576855: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.27' not found (required by /usr/lib/x86_64-linux-gnu/libnvinfer.so.6)
2020-01-21 16:33:12.577361: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.27' not found (required by /usr/lib/x86_64-linux-gnu/libnvinfer.so.6)
2020-01-21 16:33:12.577381: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
>>> tf.test.is_gpu_available()
WARNING:tensorflow:From <stdin>:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2020-01-21 16:33:15.483537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3491610000 Hz
2020-01-21 16:33:15.484394: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4de73f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-21 16:33:15.484448: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-01-21 16:33:15.487966: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-21 16:33:15.518212: I tensorflow/compiler/xla/service/platform_util.cc:205] StreamExecutor cuda device (0) is of insufficient compute capability: 3.5 required, device is 3.0
2020-01-21 16:33:15.518325: I tensorflow/compiler/jit/xla_gpu_device.cc:136] Ignoring visible XLA_GPU_JIT device. Device number is 0, reason: Internal: no supported devices found for platform CUDA
2020-01-21 16:33:15.518765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:05:00.0 name: Quadro K2000 computeCapability: 3.0
coreClock: 0.954GHz coreCount: 2 deviceMemorySize: 1.94GiB deviceMemoryBandwidth: 59.60GiB/s
2020-01-21 16:33:15.518997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-01-21 16:33:15.520344: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-01-21 16:33:15.521455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-01-21 16:33:15.521690: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-01-21 16:33:15.523131: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-01-21 16:33:15.523940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-01-21 16:33:15.527679: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-21 16:33:15.528518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1651] Ignoring visible gpu device (device: 0, name: Quadro K2000, pci bus id: 0000:05:00.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2020-01-21 16:33:15.528552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-21 16:33:15.528566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-01-21 16:33:15.528579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 

@hd1090
Copy link

hd1090 commented Jan 22, 2020

I'm relatively very new to this, so I might be wrong, but @SalahAdDin it seems like you're missing the libnvinfer library found in TensorRT (see here). CUDA-10.1 seems to have loaded fine.

@maximuslee1226
Copy link

@hd1090. I tried to install tensorRT in cuda 10.1. My gpu is working, but tensorRT installation is erroring out on me with the following message. Do you know what could be wrong here?
(base) prompt$:~$ sudo apt install tensorrt
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
tensorrt : Depends: libnvinfer6 (= 6.0.1-1+cuda10.1) but 6.0.1-1+cuda10.2 is to be installed
Depends: libnvinfer-plugin6 (= 6.0.1-1+cuda10.1) but 6.0.1-1+cuda10.2 is to be installed
Depends: libnvparsers6 (= 6.0.1-1+cuda10.1) but 6.0.1-1+cuda10.2 is to be installed
Depends: libnvonnxparsers6 (= 6.0.1-1+cuda10.1) but 6.0.1-1+cuda10.2 is to be installed
Depends: libnvinfer-bin (= 6.0.1-1+cuda10.1) but it is not going to be installed
Depends: libnvinfer-dev (= 6.0.1-1+cuda10.1) but 7.0.0-1+cuda10.2 is to be installed
Depends: libnvinfer-plugin-dev (= 6.0.1-1+cuda10.1) but 7.0.0-1+cuda10.2 is to be installed
Depends: libnvparsers-dev (= 6.0.1-1+cuda10.1) but 7.0.0-1+cuda10.2 is to be installed
Depends: libnvonnxparsers-dev (= 6.0.1-1+cuda10.1) but 7.0.0-1+cuda10.2 is to be installed
Depends: libnvinfer-samples (= 6.0.1-1+cuda10.1) but it is not going to be installed
Depends: libnvinfer-doc (= 6.0.1-1+cuda10.1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

@SalahAdDin
Copy link

@maximuslee1226
Try doing this:

!sudo apt-get install -y --no-install-recommends libnvinfer6=6.0.1-1+cuda10.0 \
    libnvinfer-dev=6.0.1-1+cuda10.0 \
    libnvinfer-plugin6=6.0.1-1+cuda10.0

Using CUDA 10.1.

@alihamid996
Copy link

@zaccharieramzi did you find any other solution other then rolling back to 10.1?

@zaccharieramzi
Copy link
Contributor Author

@alihamid996 I didn't try anything else, so I couldn't tell you myself. @lijiaying 's comment suggests that it's possible to build tf 2.1 from source with cuda 10.2 though.

@antonywu
Copy link

antonywu commented May 1, 2020

My head hurts just to see it is such a pain in the rear to get all the moving pieces exactly right to have GPU support. NVidia obviously wants you to install 10.2... and here goes Tensorflow only works with 10.1 out of box. Now I have to recompile the OpenCV from scratch.. Just wonderful.

@vitalyisaev2
Copy link

Could you please clarify. are there any plans to support 10.2 in the next release of Tensorflow?

Unfortunately there are no CUDA 10.1 packages for modern RHEL based repos like Centos 8 or Fedora 30+. Only CUDA 10.2 is available for these distributives.

@mariusmotea
Copy link

New sdcard image for Jetson Nano comes with CUDA 10.2 preinstalled and older images cannot be downloaded anymore. Not sure if is possible to downgrade because it seams nVidia blacklist older packages.

@bm777
Copy link

bm777 commented May 18, 2020

You are able to import tf cpu version in tf 2.0
Reason being when you installed tf 2.0.0 without specifying accelerator(gpu) it installed both CPU and GPU support.
To check this you may try printing;

import tensorflow as tf
tf.test.is_gpu_available()

Is there any tensorflow installation solution for those person which have CUDA 10.2 installed, gcc 7.5.0 and ubuntu18.04? if so, only with tensorflow 2.1?

@mihaimaruseac
Copy link
Collaborator

#38194 (comment) seem to have a solution

@palisadoes
Copy link

I had a similar problem with 10.2 using tf.config.experimental.list_physical_devices('GPU'). Cuda v10.2 was installed using this command after installing the ubuntu 18.04 cuda and nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb repos per the TensorFlow documetation at https://www.tensorflow.org/install/gpu

apt-get install -y --no-install-recommends \
cuda-10-2 \ 
libcudnn7=7.6.5.32-1+cuda10.2  \
libcudnn7-dev=7.6.5.32-1+cuda10.2 \
libnvinfer7=7.0.0-1+cuda10.2 \
libnvinfer-dev=7.0.0-1+cuda10.2 \
libnvinfer-plugin7=7.0.0-1+cuda10.2    

The error when trying tf.config.experimental.list_physical_devices('GPU') can be seen below:

$ python3
Python 3.8.2 (default, Apr 27 2020, 15:53:34) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf                            
>>> tf.config.experimental.list_physical_devices('GPU')
2020-05-20 14:02:50.885725: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-20 14:02:50.903802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-05-20 14:02:50.904116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1070 Ti computeCapability: 6.1
coreClock: 1.683GHz coreCount: 19 deviceMemorySize: 7.92GiB deviceMemoryBandwidth: 238.66GiB/s
2020-05-20 14:02:50.904232: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-05-20 14:02:50.905355: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-05-20 14:02:50.906434: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-05-20 14:02:50.906614: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-05-20 14:02:50.907827: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-05-20 14:02:50.908474: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-05-20 14:02:50.910932: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-05-20 14:02:50.910947: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1598] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>> 

This error Could not load dynamic library 'libcudart.so.10.1' was the clue.

This resolved it:

apt-get -y install cuda-cudart-10-1

@petervandenabeele
Copy link

petervandenabeele commented May 25, 2020

UPDATE: WARNING below #34759 (comment)

FYI, we have a tested work-around (symlink fake libcudart.so.10.2 to real libcudart.so.10.1) for TF 2.2 and cuda 10.2 for Ubuntu 20.04 and Windows in #38194 (comment)
and #38194 (comment)

I am using this for days now, mainly with CPU and I never got trouble with it (so libcudart 10.1 and 10.2 really seem compatible, as was promised higher up in that thread NOT SURE !!).

@sanjoy
Copy link
Contributor

sanjoy commented May 26, 2020

so libcudart 10.1 and 10.2 really seem compatible, as was promised higher up in that thread

I don't think libcudart 10.1 and 10.2 are ABI compatible (doc). Symlinking 10.2 to 10.1 may seem to work, but there is no guarantee that e.g. this will not fry your GPU.

@petervandenabeele
Copy link

so libcudart 10.1 and 10.2 really seem compatible, as was promised higher up in that thread

I don't think libcudart 10.1 and 10.2 are ABI compatible (doc). Symlinking 10.2 to 10.1 may seem to work, but there is no guarantee that e.g. this will not fry your GPU.

Wow, that would be bad ...

To be clear, I am doing most of the actual work on the CPU (this smallish GM107GLM [Quadro M1200 Mobile] GPU with 2.5G free RAM, faces OOM very quickly for any real work, and when it does not face OOM, it seems a factor 2 slower than the 8 core CPU).

@braindotai
Copy link

Can anyone tell about when tensorflow gpu would be able to run with cuda 10.2?

@mihaimaruseac
Copy link
Collaborator

At least after the TF 2.4 release

@sanjoy
Copy link
Contributor

sanjoy commented Jul 28, 2020

Can anyone tell about when tensorflow gpu would be able to run with cuda 10.2?

Our current plan is to use move TF 2.4 to CUDA 11.

@PointCloudYC
Copy link

Managed to install tensorflow-gpu 2.3 with cudatoolkit 10.1 on my cuda 10.2 driver(Jan 19th,2021)

conda create --name tf2 python=3.8.3
conda install cudnn==7.6.4
pip install tensorflow-gpu=2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.1 for tracking issues in 2.1 release type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests