Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow 2.2 and 2.3 not detecting GPU with CUDA 10.1 #43236

Closed
javedsha opened this issue Sep 15, 2020 · 25 comments
Closed

Tensorflow 2.2 and 2.3 not detecting GPU with CUDA 10.1 #43236

javedsha opened this issue Sep 15, 2020 · 25 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues

Comments

@javedsha
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 18.04):
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version: 2.3 and 2.2
  • Python version: 3.6
  • Installed using virtualenv? pip? conda?: venv and pip
  • GCC/Compiler version (if compiling from source): 7.5
  • CUDA/cuDNN version: 10.1
  • GPU model and memory: K80

Describe the problem
After installing tensorflow, GPU is not detected and getting error: 'Cannot open dynamic library libcublas.so.10'.

Provide the exact sequence of commands / steps that you executed before running into the problem

  1. All the steps are followed from the official tensorflow page as it is: https://www.tensorflow.org/install/gpu and https://www.tensorflow.org/install/pip.
  2. Also, i have to install cuda-toolkit separately.
  3. Finally added CUDA-10.1 path in bashrc file.

How I fix the problem:

I started with a clean VM on Azure with nothing installed. Then followed the tensorflow guides (above) to install NVIDIA-Driver, CUDA 10.1, cuDNN, cuda-toolkit and tensorflow.

After all these steps, my local folder had two cuda folders (don't know why):
/usr/local/cuda-10.1/lib64/
/usr/localo/cuda-10.2/lib64/

The error which I was getting was for dynamic library 'libcublas.so.10'. And this file was not present in folder 'cuda-10.1', but instead it was present in 'cuda-10.2' (note, that i have installed everything in venv)

I have to manually copy all the files (including files inside the 'stubs' folder). And then it works.

This site also mention this issue, where they say that with CUDA 10.1, some of the libraries are installed differently - https://forums.developer.nvidia.com/t/cublas-for-10-1-is-missing/71015/4 (the steps here are when you install libraries at system level and not venv).

Expected Behaviour:
Either tensorflow should automatically refer to the missing dynamic libraries or mention how to fix this in Install Set up.

Note: The errors are similar when you install CUDA 10.2, it's just the dynamic library version are different.

@javedsha javedsha added the type:build/install Build and install issues label Sep 15, 2020
@ravikyram ravikyram added the TF 2.3 Issues related to TF 2.3 label Sep 15, 2020
@ravikyram
Copy link
Contributor

ravikyram commented Sep 15, 2020

@javedsha

TensorFlow v2.3 is compatible with CUDA 10.1 and cuDNN 7.6. For more information regarding this please take a look at the tested build configurations.

And the CUDA version mismatch query has been explained in this StackOverflow comment.

Can you paste the output of nvida-smi?
Thanks!

@ravikyram ravikyram added the stat:awaiting response Status - Awaiting response from author label Sep 15, 2020
@javedsha
Copy link
Author

@ravikyram

Output of NVIDIA-SMI:

Driver Version: 450.1.06
CUDA Version: 11.0

nvcc --version
10.1

I followed all the steps mentioned in the tensorflow gpu guide, the only thing extra I did was install cuda-toolkit 'sudo apt-get install cuda-toolkit'

What am i doing wrong?

@ahtik
Copy link
Contributor

ahtik commented Sep 16, 2020

@javedsha The following is a procedure I use for Ubuntu 18.04, confirmed to work with the Ubuntu-shipped python 3.6. Hope it helps to pinpoint your issue.

In your case, the trouble possibly started with the sudo apt-get install cuda-toolkit, as it's not fixed to 10.1. Having 10.1 parallel to 10.2 and 11.0 is not advisable, nor practically feasible due to the env vars.

Btw, CUDA version that is reported by the nvidia-smi is not necessarily the CUDA version that Tensorflow picks up (longer story), but with my installation procedure it should report 10.1.

# To start fresh, clean up all the nivida-related packages. Be careful when using the same system as a desktop!
sudo apt-get --purge remove 'cuda*'
sudo apt-get --purge remove 'nvidia*'
sudo apt-get --purge remove 'libnvidia*'

# Check if all clean
sudo find /usr/local/cuda/ -name '*blas*'
sudo find /usr/lib/ -name '*blas*'

# CUDA 10.1 instructions for creating a locally available repo and installing from it
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
sudo apt-get update

# Make sure the driver number matches the GPU. Also -440 would most likely work.
sudo apt install nvidia-driver-418
sudo apt install cuda-10.1

# Make sure the libs are now in place
sudo find /usr/local/cuda/ -name '*blas*'
sudo find /usr/lib/ -name '*blas*'

# Run nvidia-smi for sanity check
nvidia-smi

python3 -m venv ~/.venv-tf2.3-sanity
. ~/.venv-tf2.3-sanity/bin/activate
pip install -U pip
pip install tensorflow==2.3
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([10000, 10000])))"

@javedsha
Copy link
Author

@ahtik is the local thing same as the stable version? I will give this a try today and will post here. Thank you.

@ahtik
Copy link
Contributor

ahtik commented Sep 16, 2020 via email

@Zapunidi
Copy link

I have the same problem with libcublas.so.10. Same OS, same python version, tf 2.3 etc. The only difference was that I didn't use venv and have different GPU.
I followed Ubutnu 18.04 instructions from official guide: https://www.tensorflow.org/install/gpu
I also found cuda 10.2 folder near cuda 10.1 folder with former having libcublas.so.10 and latter having all other libs.

My solution was to install cuda 10.2 even if it contradicts the guide. GPU is working in tensorflow now. I have taken cuda-10.2 from nvidia website as a deb package.

Also the guide itself (https://www.tensorflow.org/install/gpu) seems to be not perfectly written. It tells you to install CUPTI when there is no way to install it separately. It read as: "Install CUPTI which ships with the CUDA® Toolkit. Append its installation directory to the $LD_LIBRARY_PATH environmental variable:" when it should be IMHO "Install CUDA Toolkit. You will have CUPTI library installed. Append its installation directory to the $LD_LIBRARY_PATH environmental variable:" And I still don't get how section with CUPTI goes before section with cuda installation on Ubuntu. I hope my feedback will be useful.

@ahtik
Copy link
Contributor

ahtik commented Sep 17, 2020

@Zapunidi Indeed, the official guide for Ubuntu doesn't seem to work for me either (all other libs load fine, getting one Warning):

2020-09-17 12:50:29.307000: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-09-17 12:50:29.307313: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory
2020-09-17 12:50:29.334711: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-09-17 12:50:29.340930: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-09-17 12:50:29.391160: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-09-17 12:50:29.400149: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-09-17 12:50:29.507706: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7

Almost like something in the NVIDIA machine-learning repo still manages to force an upgrade from 10.1..

I do not have this warning when using the local installation method I posted previously. For cudnn and tensorrt/libnvinfer I have a separate tensorrt-cuda10.1 setup.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Sep 19, 2020
@javedsha
Copy link
Author

So the issue is for everyone. It will make sense to upgrade the tensorflow documentation as it is not working.
@Zapunidi did you follow the same steps as mentioned in https://www.tensorflow.org/install/gpu, except that you installed Cuda 10.2 from official package. Did you also installed CUDA toolkit? Could you please post all the steps (detailed) as it can help everyone.

@Zapunidi
Copy link

@Zapunidi did you follow the same steps as mentioned in https://www.tensorflow.org/install/gpu, except that you installed Cuda 10.2 from official package. Did you also installed CUDA toolkit? Could you please post all the steps (detailed) as it can help everyone.

I can't see the difference between CUDA and CUDA Toolkit. Even https://www.tensorflow.org/install/gpu joggle these two terms like "The following NVIDIA® software must be installed on your system: ... CUDA® Toolkit —TensorFlow supports CUDA® 10.1 (TensorFlow >= 2.1.0)..." Then for Linux setup the manual just mentions "CUDA" without "Toolkit".
I didn't written down the exact steps, so my report is not reliable, sorry. I remember that I

  1. Installed CUDA Toolkit 10.1 from the link from manual: https://developer.nvidia.com/cuda-toolkit-archive
  2. Installed cuDNN SDK 7.6.5 from nvidia website.
  3. Rebooted
  4. Executed every command from Ubuntu 18.04 console commands block https://www.tensorflow.org/install/gpu I didn't reboot in the middle of the block as it tells me to do because I already had required drivers and kernel module. A violation, yes.

So it was not a clean install. I do not have a spare machine with supported GPU to make clean test for you guys. I also don't think that GPU virtualization is mature to use virtual machine on my primary PC.

@ahtik
Copy link
Contributor

ahtik commented Sep 23, 2020

@Zapunidi Yes, CUDA and CUDA Toolkit is 100% the same. This reboot in the middle does not matter, as long as you still reboot after the last step.

One thing that might work is to run on top of everything still the "local" installation method like this and see what happens after reboot (taken from my comment above):

# CUDA 10.1 instructions for creating a locally available repo and installing from it
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
sudo apt-get update

sudo apt install cuda-10.1

# Check which libs are now where, just for your own sanity; env vars should be set after reboot by themselves.
sudo find /usr/local/cuda/ -name '*blas*'
sudo find /usr/lib/ -name '*blas*'

IF this fails and still curious, you can try with my instructions in #43236 (comment) using the "local" repo installation method and this way the CUDA version remains 10.1&TF works fine. Just make sure not to reboot before the end and ensure most recent nvidia-driver-418 suitable for your GPU is used (the same that you currently have). This does involve some risk when using on a primary PC, I just don't know a better way to quickly clean up everything cuda-related without removing the nvidia driver at the same time.

@ravikyram ravikyram added the stat:awaiting response Status - Awaiting response from author label Sep 25, 2020
@bnsblue
Copy link

bnsblue commented Oct 1, 2020

The problem is that libcublas seems to be missing when installing cuda-10.1 via apt
You'll not be able to find libcublas.so.10 under /usr/local/cuda-10.1/lib64/ (default path of installation)

A work around seems to be installing cuda-10.1 via runfile. I encountered another error during the installation process but maybe it works for you.

Check this thread for more details: https://forums.developer.nvidia.com/t/cublas-for-10-1-is-missing/71015/18

@ahtik
Copy link
Contributor

ahtik commented Oct 1, 2020

@bnsblue If using the local installer method that I detailed in my comment above then libcublas.so.10 is being installed into /usr/lib/x86_64-linux-gnu/libcublas.so.10 and everything works fine without additional tweaks [1]. This works both for Ubuntu 18.04 and 20.04. Indeed, the TensorFlow official GPU installation method does not work for me as well. Btw, for Ubuntu 20.04 one should still use the 1804 repo in order to get access to cuda-10.1 (2004 apt only seems to have cuda-11).

[1] It has involved a bit for our use and does not include the libcudnn7 and tensorrt bits, but this should still work as well. For nvidia drivers using v455. If you're interested, I can provide the full instruction that I'm using.

@javedsha
Copy link
Author

javedsha commented Oct 4, 2020

@ravikyram why this is waiting for author response? The steps in the documentation doesn't work.

@bnsblue
Copy link

bnsblue commented Oct 6, 2020

@ahtik Thanks for the response! It would be awesome if you could share the full instructions :)

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Oct 6, 2020
@johntmyers
Copy link

@ahtik I also tried your "local install" and libcublas.so.10 does not get installed into that location. Any ideas?

@johntmyers
Copy link

Must have gotten something wrong the first time, the libraries show up now. But nvidia-smi fails now.

sudo find /usr/lib/ -name '*blas*'
/usr/lib/x86_64-linux-gnu/libnvblas.so.10.2.1.243
/usr/lib/x86_64-linux-gnu/libcublas_static.a
/usr/lib/x86_64-linux-gnu/libcublasLt_static.a
/usr/lib/x86_64-linux-gnu/libcublas.so.10
/usr/lib/x86_64-linux-gnu/libnvblas.so.10
/usr/lib/x86_64-linux-gnu/libcublasLt.so.10.2.1.243
/usr/lib/x86_64-linux-gnu/libcublas.so
/usr/lib/x86_64-linux-gnu/libnvblas.so
/usr/lib/x86_64-linux-gnu/stubs/libcublas.so
/usr/lib/x86_64-linux-gnu/stubs/libcublasLt.so
/usr/lib/x86_64-linux-gnu/libcublasLt.so
/usr/lib/x86_64-linux-gnu/libcublasLt.so.10
/usr/lib/x86_64-linux-gnu/libcublas.so.10.2.1.243
/usr/lib/pkgconfig/cublas-10.pc

Then nvidia-smi yields: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

I will continue to try and get this working.

@ahtik
Copy link
Contributor

ahtik commented Oct 11, 2020

@johntmyers Did you make sure to restart the machine after all the driver and CUDA installation steps? This error is usually from not restarting.

@johntmyers
Copy link

@ahtik Yes, same issue. FWIW I'm using 18.04 on Google Compute Engine, so I'm not sure if something there is not working properly.

@ravikyram
Copy link
Contributor

@javedsha

Any updates on the issue please. Thanks!

@ravikyram ravikyram added the stat:awaiting response Status - Awaiting response from author label Oct 19, 2020
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Oct 26, 2020
@jchwenger
Copy link

@ahtik thanks for your setup above, I would also be grateful if you shared your TensorRT instructions!

@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@ravikyram ravikyram added the subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues label Nov 10, 2020
@guruvishnuvardan
Copy link

guruvishnuvardan commented Dec 25, 2020

@ahtik

System information

OS Platform and Distribution (e.g., Linux Ubuntu 18.04): Ubuntu 18.04
TensorFlow installed from (source or binary): binary
TensorFlow version: 2.3
Python version: 3.6
Installed using virtualenv? pip? conda?: pip3
GCC/Compiler version (if compiling from source): 7.5
CUDA/cuDNN version: 10.1
GPU model and memory: 1080 TI

I have exactly followed the instruction dated Sep1-16. I am encountering the following error:

ioz@ioz-B250M-DS3H:$ python3.6 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([10000, 10000])))"
2020-12-25 21:48:47.947244: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/dist-packages/tensorflow/init.py", line 41, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/init.py", line 45, in
from tensorflow.python import data
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/init.py", line 25, in
from tensorflow.python.data import experimental
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/init.py", line 125, in
from tensorflow.python.data.experimental.ops.parsing_ops import parse_example_dataset
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/parsing_ops.py", line 26, in
from tensorflow.python.ops import parsing_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_ops.py", line 27, in
from tensorflow.python.ops import parsing_config
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_config.py", line 31, in
from tensorflow.python.ops import sparse_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/sparse_ops.py", line 42, in
from tensorflow.python.ops import special_math_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/special_math_ops.py", line 30, in
import opt_einsum
ModuleNotFoundError: No module named 'opt_einsum'
ioz@ioz-B250M-DS3H:
$ python3.6 -c "import tensorflow as tf;

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
bash: unexpected EOF while looking for matching `"'
bash: syntax error: unexpected end of file
ioz@ioz-B250M-DS3H:~$ python3.6
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
2020-12-25 21:50:17.792384: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/dist-packages/tensorflow/init.py", line 41, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/init.py", line 45, in
from tensorflow.python import data
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/init.py", line 25, in
from tensorflow.python.data import experimental
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/init.py", line 125, in
from tensorflow.python.data.experimental.ops.parsing_ops import parse_example_dataset
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/parsing_ops.py", line 26, in
from tensorflow.python.ops import parsing_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_ops.py", line 27, in
from tensorflow.python.ops import parsing_config
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_config.py", line 31, in
from tensorflow.python.ops import sparse_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/sparse_ops.py", line 42, in
from tensorflow.python.ops import special_math_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/special_math_ops.py", line 30, in
import opt_einsum
ModuleNotFoundError: No module named 'opt_einsum'
import tensorflow as tf
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/dist-packages/tensorflow/init.py", line 41, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/init.py", line 45, in
from tensorflow.python import data
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/init.py", line 25, in
from tensorflow.python.data import experimental
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/init.py", line 125, in
from tensorflow.python.data.experimental.ops.parsing_ops import parse_example_dataset
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/parsing_ops.py", line 26, in
from tensorflow.python.ops import parsing_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_ops.py", line 27, in
from tensorflow.python.ops import parsing_config
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_config.py", line 31, in
from tensorflow.python.ops import sparse_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/sparse_ops.py", line 42, in
from tensorflow.python.ops import special_math_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/special_math_ops.py", line 30, in
import opt_einsum
ModuleNotFoundError: No module named 'opt_einsum'

ioz@ioz-B250M-DS3H:~$ python3.6
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf
2020-12-25 21:50:37.806079: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.6/dist-packages/tensorflow/init.py", line 41, in
from tensorflow.python.tools import module_util as _module_util
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/init.py", line 45, in
from tensorflow.python import data
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/init.py", line 25, in
from tensorflow.python.data import experimental
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/init.py", line 125, in
from tensorflow.python.data.experimental.ops.parsing_ops import parse_example_dataset
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/experimental/ops/parsing_ops.py", line 26, in
from tensorflow.python.ops import parsing_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_ops.py", line 27, in
from tensorflow.python.ops import parsing_config
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/parsing_config.py", line 31, in
from tensorflow.python.ops import sparse_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/sparse_ops.py", line 42, in
from tensorflow.python.ops import special_math_ops
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/special_math_ops.py", line 30, in
import opt_einsum
ModuleNotFoundError: No module named 'opt_einsum'

Can you please help me with your thoughts.

Thanks
Guru

@LucaUrbinati44
Copy link

LucaUrbinati44 commented Apr 3, 2024

@javedsha The following is a procedure I use for Ubuntu 18.04, confirmed to work with the Ubuntu-shipped python 3.6. Hope it helps to pinpoint your issue.

In your case, the trouble possibly started with the sudo apt-get install cuda-toolkit, as it's not fixed to 10.1. Having 10.1 parallel to 10.2 and 11.0 is not advisable, nor practically feasible due to the env vars.

Btw, CUDA version that is reported by the nvidia-smi is not necessarily the CUDA version that Tensorflow picks up (longer story), but with my installation procedure it should report 10.1.

# To start fresh, clean up all the nivida-related packages. Be careful when using the same system as a desktop!
sudo apt-get --purge remove 'cuda*'
sudo apt-get --purge remove 'nvidia*'
sudo apt-get --purge remove 'libnvidia*'

# Check if all clean
sudo find /usr/local/cuda/ -name '*blas*'
sudo find /usr/lib/ -name '*blas*'

# CUDA 10.1 instructions for creating a locally available repo and installing from it
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
sudo apt-get update

# Make sure the driver number matches the GPU. Also -440 would most likely work.
sudo apt install nvidia-driver-418
sudo apt install cuda-10.1

# Make sure the libs are now in place
sudo find /usr/local/cuda/ -name '*blas*'
sudo find /usr/lib/ -name '*blas*'

# Run nvidia-smi for sanity check
nvidia-smi

python3 -m venv ~/.venv-tf2.3-sanity
. ~/.venv-tf2.3-sanity/bin/activate
pip install -U pip
pip install tensorflow==2.3
python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([10000, 10000])))"

I thank @ahtik for his post. I would like to contibute with my own recipe that I derived from @ahtik 's one. Some comments down below.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.243-418.87.00_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-1-local-10.1.243-418.87.00/7fa2af80.pub
sudo apt-get update

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/nvidia-driver-440_440.33.01-0ubuntu1_amd64.deb
sudo apt install -y ./nvidia-driver-440_440.33.01-0ubuntu1_amd64.deb 

sudo apt-mark hold libnvidia-cfg1-440 libnvidia-compute-440 libnvidia-decode-440 libnvidia-encode-440 libnvidia-fbc1-440 libnvidia-gl-440 libnvidia-ifr1-440 nvidia-compute-utils-440 nvidia-dkms-440 nvidia-driver-440 nvidia-kernel-common-440 nvidia-kernel-source-440 nvidia-utils-440 xserver-xorg-video-nvidia-440

sudo apt install cuda-drivers=440.33.01-1 cuda-runtime-10-1 cuda-demo-suite-10-1 cuda-10.1

wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn7_7.6.0.64-1+cuda10.1_amd64.deb
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libcudnn7-dev_7.6.0.64-1+cuda10.1_amd64.deb
sudo dpkg -i libcudnn7_7.6.0.64-1+cuda10.1_amd64.deb libcudnn7-dev_7.6.0.64-1+cuda10.1_amd64.deb
sudo dpkg -l | grep cudnn

cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2

My server GPU model: GeForce GTX 1070
Driver version: cat /proc/driver/nvidia/version 440.33.01

  • I had to force cuda-drivers to be 440.33.01-1 to make my nvidia-smi work after the installation.
  • I had to apt-mark hold some libraries to prevent their automatic upgrade during the installation of cuda 10.1.
  • I installed only python 3.8 and tensorflow 2.3.0 in my conda environment, to meet tensorflow requirements for cuda 10.1: https://www.tensorflow.org/install/source#gpu.
  • I realized that in my system I had to use cudnn 7.6.0 instad of 7.6.5 to make E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error disappear.
  • I did NOT install cudnn and cudatoolkit in my conda environment.
  • I exported the following environment variables at login in my .bashrc (see https://stackoverflow.com/a/64472380/11644517 for setting LD_LIBRARY_PATH correctly):
export CUDA_HOME=/usr/local/cuda
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:/usr/local/cuda-10.2/lib64
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CUDA_HOME
export TF_XLA_FLAGS="--tf_xla_enable_xla_devices"

Check if the GPU is working with:

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author subtype: ubuntu/linux Ubuntu/Linux Build/Installation Issues TF 2.3 Issues related to TF 2.3 type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests

10 participants