Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF 2.16.1 Fails to work with GPUs #63362

Open
JuanVargas opened this issue Mar 10, 2024 · 134 comments
Open

TF 2.16.1 Fails to work with GPUs #63362

JuanVargas opened this issue Mar 10, 2024 · 134 comments
Assignees
Labels
awaiting review Pull request awaiting review comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug

Comments

@JuanVargas
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

TF 2.16.1

Custom code

No

OS platform and distribution

Linux Ubuntu 22.04.4 LTS

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

12.4

GPU model and memory

No response

Current behavior?

I created a python venv in which I installed TF 2.16.1 following your instructions: pip install tensorflow
When I run python, import tf, and issue tf.config.list_physical_devices('GPU')
I get an empty list [ ]

I created another python venv, installed TF 2.16.1, only this time with the instructions:

python3 -m pip install tensorflow[and-cuda]

When I run that version, import tensorflow as tf, and issue

tf.config.list_physical_devices('GPU')

I also get an empty list.

BTW, I have no problems running on my box TF 2.15.1 with GPUs. Julia also works just fine with GPUs and so does PyTorch.
the

Standalone code to reproduce the issue

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-03-09 19:15:45.018171: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-09 19:15:50.412646: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> tf.__version__
'2.16.1'

tf.config.list_physical_devices('GPU') 
2024-03-09 19:16:28.923792: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-03-09 19:16:29.078379: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>>

Relevant log output

No response

@sh-shahrokhi
Copy link

sh-shahrokhi commented Mar 10, 2024

It does not work with python=3.12.2 either. Same error. installed tensorflow with $ pip install tensorflow[and-cuda]

@damadorPL
Copy link

The same error on bare Ubuntu and WSL2 2.15 works without any problems with python 3.11

@DiegoMont
Copy link

I have the same problem with Ubuntu 22.04.4 with the following environment:

  • tensorflow==2.16.1
  • Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
  • cuDNN 8.6.0.163
  • gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@AlpriElse
Copy link

I'm not sure if this is the root cause, but I resolved my own issue which also surfaced as a "Cannot dlopen some GPU libraries." error when trying to run python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

To resolve my issue, I followed the tested build versions here:
https://www.tensorflow.org/install/source#gpu

and I needed to update my existing installations from cuDNN 9 -> 8.9 and CUDA 12.4->12.3

When you're on an NVIDIA download page like this one for CUDA Toolkit, don't just download the latest version. See previous versions by hitting "Archive of Previous CUDA Releases"

@JuanVargas can you try uninstalling your existing CUDA installation to a tested build configuration for TF 2.16 by downgrading to CUDA 12.3?

I followed this post to uninstall my existing cuda installation:
https://askubuntu.com/questions/530043/removing-nvidia-cuda-toolkit-and-installing-new-one

@DiegoMont can you try upgrading your cuDNN to 8.9 and CUDA to 12.3?

@Gwyki
Copy link

Gwyki commented Mar 11, 2024

I am having the same issue. Brand new Ubuntu 22.04 WSL2 image. Blank Conda environment with either python 3.12.* or 3.11.* fails to correctly setup tensorflow for GPU use when following the recommended:
pip install tensorflow[and-cuda]

Trying to list the physical devices results in:

2024-03-11 02:00:00.294704: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 02:00:00.709325: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-11 02:00:01.180225: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:2d:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 02:00:01.180445: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
cuDNN 8.9.*
Cuda 12.3
Tensorflow 2.16.1
TensorRT 8.6.1

Is this a new issue caused by the fact that it doesn't appear that any system cuda needs to be separately installed in WSL2 anymore. I certainly didn't install one manually and yet nvidia-smi is happily reporting cuda version 12.3. It probably comes down to some env paths not set correctly but playing around with $CUDA_PATH and guessing the location within the conda environment has not resolved anything. TensorRT doesn't seem to be picked up yet is definitely installed in the conda environment. Pytorch GPU visibility works as expected.

@SuryanarayanaY SuryanarayanaY added comp:gpu GPU related issues TF 2.16 labels Mar 11, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @JuanVargas ,

For GPU package you need to ensure the installation of CUDA driver which can be verified with nvidia-smi command. Then you need to install TF-cuda package with pip install tensorflow[and-cuda] which automatically installs required cuda/cudnn libraries.

I have checked in colab and able to detect GPU.Please refer attached gist.

@SuryanarayanaY SuryanarayanaY added the stat:awaiting response Status - Awaiting response from author label Mar 11, 2024
@damadorPL
Copy link

damadorPL commented Mar 11, 2024

doublequotes in pip install because of ZSH

pip install "tensorflow[and-cuda]==2.16.1"                                                                       
 

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already satisfied: tensorflow==2.16.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (2.16.1)
Requirement already satisfied: absl-py>=1.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.1.0)
Requirement already satisfied: astunparse>=1.6.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.6.3)
Requirement already satisfied: flatbuffers>=23.5.26 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (24.3.7)
Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.5.4)
Requirement already satisfied: google-pasta>=0.1.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.2.0)
Requirement already satisfied: h5py>=3.10.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.10.0)
Requirement already satisfied: libclang>=13.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (16.0.6)
Requirement already satisfied: ml-dtypes~=0.3.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.3.2)
Requirement already satisfied: opt-einsum>=2.3.2 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.3.0)
Requirement already satisfied: packaging in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (24.0)
Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (4.25.3)
Requirement already satisfied: requests<3,>=2.21.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.31.0)
Requirement already satisfied: setuptools in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (69.1.1)
Requirement already satisfied: six>=1.12.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.16.0)
Requirement already satisfied: termcolor>=1.1.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.4.0)
Requirement already satisfied: typing-extensions>=3.6.6 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (4.10.0)
Requirement already satisfied: wrapt>=1.11.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.16.0)
Requirement already satisfied: grpcio<2.0,>=1.24.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.62.1)
Requirement already satisfied: tensorboard<2.17,>=2.16 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.16.2)
Requirement already satisfied: keras>=3.0.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.5)
Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.36.0)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (1.26.4)
Requirement already satisfied: nvidia-cublas-cu12==12.3.4.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.4.1)
Requirement already satisfied: nvidia-cuda-cupti-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: nvidia-cuda-nvcc-cu12==12.3.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.107)
Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.3.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.107)
Requirement already satisfied: nvidia-cuda-runtime-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: nvidia-cudnn-cu12==8.9.7.29 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (8.9.7.29)
Requirement already satisfied: nvidia-cufft-cu12==11.0.12.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (11.0.12.1)
Requirement already satisfied: nvidia-curand-cu12==10.3.4.107 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (10.3.4.107)
Requirement already satisfied: nvidia-cusolver-cu12==11.5.4.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (11.5.4.101)
Requirement already satisfied: nvidia-cusparse-cu12==12.2.0.103 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.2.0.103)
Requirement already satisfied: nvidia-nccl-cu12==2.19.3 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (2.19.3)
Requirement already satisfied: nvidia-nvjitlink-cu12==12.3.101 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorflow[and-cuda]==2.16.1) (12.3.101)
Requirement already satisfied: wheel<1.0,>=0.23.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from astunparse>=1.6.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.42.0)
Requirement already satisfied: rich in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (13.7.1)
Requirement already satisfied: namex in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.0.7)
Requirement already satisfied: dm-tree in ./miniconda3/envs/tf/lib/python3.11/site-packages (from keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.1.8)
Requirement already satisfied: charset-normalizer<4,>=2 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from requests<3,>=2.21.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2024.2.2)
Requirement already satisfied: markdown>=2.6.8 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.5.2)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from werkzeug>=1.0.1->tensorboard<2.17,>=2.16->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.1.5)
Requirement already satisfied: markdown-it-py>=2.2.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (2.17.2)
Requirement already satisfied: mdurl~=0.1 in ./miniconda3/envs/tf/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.0.0->tensorflow==2.16.1->tensorflow[and-cuda]==2.16.1) (0.1.2)
nvidia-smi             
                                                                                           
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.60.01              Driver Version: 551.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
|  0%   39C    P5             10W /  285W |    4334MiB /  12282MiB |     13%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        41      G   /Xwayland                                   N/A      |
+-----------------------------------------------------------------------------------------+

python3

Python 3.11.7 (main, Dec 15 2023, 18:12:31) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))2024-03-11 09:36:29.601060: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-11 09:36:29.921637: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 09:36:30.793353: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
>>> print(tf.config.list_physical_devices('GPU'))
2024-03-11 09:36:33.878560: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 09:36:33.980099: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[]
>>>

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 11, 2024
@damadorPL
Copy link

nvcc -V 
                                                                                                          
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

@damadorPL
Copy link

damadorPL commented Mar 11, 2024

got it work :) first
https://developer.nvidia.com/rdp/cudnn-archive?source=post_page-----bfbeb77e7c89--------------------------------

then download Local Installer for Ubuntu22.04 x86_64 (Deb)

unpack and install libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb

sudo dpkg -i libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb   
                                                           
Selecting previously unselected package libcudnn8.
(Reading database ... 47318 files and directories currently installed.)
Preparing to unpack libcudnn8_8.9.7.29-1+cuda12.2_amd64.deb ...
Unpacking libcudnn8 (8.9.7.29-1+cuda12.2) ...
Setting up libcudnn8 (8.9.7.29-1+cuda12.2) ...

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"  

                             
2024-03-11 10:27:47.879686: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-11 10:27:47.909157: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-11 10:27:48.316717: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-11 10:27:48.664469: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 10:27:48.688059: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-03-11 10:27:48.688111: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

@JuanVargas
Copy link
Author

JuanVargas commented Mar 11, 2024 via email

@sh-shahrokhi
Copy link

sh-shahrokhi commented Mar 11, 2024 via email

@JuanVargas
Copy link
Author

JuanVargas commented Mar 11, 2024 via email

@JuanVargas
Copy link
Author

JuanVargas commented Mar 11, 2024 via email

@sh-shahrokhi
Copy link

sh-shahrokhi commented Mar 11, 2024 via email

@damadorPL
Copy link

damadorPL commented Mar 11, 2024

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ you can get .deb file there directrly

@Gwyki
Copy link

Gwyki commented Mar 11, 2024

Thanks @sh-shahrokhi. I thought it was path related. Modified slightly to make it python version independent if you put it in your conda environment activation ([environment]/etc/activate.d/env_vars.sh).

NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
for dir in $NVIDIA_DIR/*; do
    if [ -d "$dir/lib" ]; then
        export LD_LIBRARY_PATH="$dir/lib:$LD_LIBRARY_PATH"
    fi
done

This is not a resolution as this post install step should not be necessary.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

I can't seem to do similar tricks to resolve the TensorRT issues when installed similarly into the conda environment. Any ideas?

@sh-shahrokhi
Copy link

sh-shahrokhi commented Mar 11, 2024

Thanks @sh-shahrokhi. I thought it was path related. Modified slightly to make it python version independent if you put it in your conda environment activation ([environment]/etc/activate.d/env_vars.sh).

NVIDIA_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)")))
for dir in $NVIDIA_DIR/*; do
    if [ -d "$dir/lib" ]; then
        export LD_LIBRARY_PATH="$dir/lib:$LD_LIBRARY_PATH"
    fi
done

This is not a resolution as this post install step should not be necessary.

W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

I can't seem to do similar tricks to resolve the TensorRT issues when installed similarly into the conda environment. Any ideas?

I don't actually use TensorRT, but I would check if the required .so file for it is visible to tensorflow. Maybe I would need to find the name of required file in tensorflow source code.

This actually doesn't change the fact that the new tensorflow version should be tested by google team before release, or the bugs should be fixed. It seems they only care about having a working docker image, not anything else.

@Gwyki
Copy link

Gwyki commented Mar 12, 2024

I have given up on TensorRT. I guess I won't be using it either.

This actually doesn't change the fact that the new tensorflow version should be tested by google team before release, or the bugs should be fixed. It seems they only care about having a working docker image, not anything else.

Agreed. Installing TF has always been hit or miss and it seems that in the many years since I last used TF that hasn't changed one bit.

@moozoo64
Copy link

Well, I wasted 8hr of my Sunday on this setting up another pc from scratch. Before reverting to the old version. Now looking to move off tensor flow.

@mihaimaruseac
Copy link
Collaborator

mihaimaruseac commented Mar 12, 2024

In general, we used to test RC versions before release. For example, we used to have RC0, RC1 and RC2 for TF 2.9. This gave people and downstream teams enough time to test and report issues.

It seems that 2.16.1 only had an RC0 (for 2.16.0).

The release process is (was?) like this:

  • cut the release branch (e.g., r2.17)
  • immediately trigger the release pipeline. This would create a few PRs to update version numbers, release notes, but after this step RC0 should be as close as possible to the version on master branch at the time the release branch has been cut. There should not be any code changes to the release branch at this point (except to maybe cherrypick fixes from master from hard bugs caused by cutting the branch at a wrong commit)
  • have at least a week of testing for downstream teams to test RC0
  • get fixes to discovered bugs landed on master, cherrypick them to release branch, after they are already tested on nightly releases
  • trigger RC1 pipeline. Again, no other code changes should occur now, except to fix bugs discovered during building
  • wait a week for downstream teams to test. If there are bugs, repeat the steps above for another RC, otherwise repeat the steps above for the final version.

Overall, this process would take number_of_RCs + 1 weeks with a possibility of a few more weeks of delay.

However, for 2.16 release, although the branch was cut on Feb 8th, there has been only one RC. Most likely issues can be solved by a patch release

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@JuanVargas
Copy link
Author

I am closing this (unresolved issue) because I am told by the Keras/TF team that the issue is related to TF.

@sgkouzias
Copy link

I'm generally confused about this setup with WSL 2. Where, what exactly needs to be installed.

When attempting to install using the command pip install tensorflow[and-cuda]

The following error is displayed:

ERROR: Could not find a version that satisfies the requirement nvidia-nccl-cu12==2.19.3; extra == "and-cuda" (from tensorflow[and-cuda]) (from versions: 0.0.1.dev5)
ERROR: No matching distribution found for nvidia-nccl-cu12==2.19.3; extra == "and-cuda"

tenserflow 2.16.1 To support it, you need this: image

python version:

Python 3.12.0

I'm running the command nvidia-smi and this is what it gives me:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85                 Driver Version: 555.85         CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1050 Ti   WDDM  |   00000000:01:00.0  On |                  N/A |
|  0%   50C    P8             N/A /   90W |     863MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I installed cuda and cuDNN image

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:30:42_Pacific_Standard_Time_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

wsl 2 (uname -m && cat /etc/*release):

x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.3 LTS"
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

I read in the nvidia manual that you need to install the cuda toolkit in linux. Installed version 12.3 and still I can't install tenserflow.

Please explain it to me. What do i need to install where.

What do i need to install inside wsl 2 and what do i need to install inside windows.

I am confused with toolkit and cuDNN. What should be installed where

@MrOxMasTer the purpose of installing WSL 2 is to "install TensorFlow inside" and as a process it is irrelevant to installing the CUDA Toolkit and cuDNN on Windows. Generally, installing WSL 2 allows you to run a full Linux environment within Windows, making it easier to develop and run applications that rely on Linux-based tools and libraries as TensorFlow version 2.16.1. More specifically, in order to run the latest version of TensorFlow and utilize your GPU in WSL 2 according to TensorFlow official documentation you must proceed with the TensorFlow installation procedure on WSL 2. Hopefully that makes more sense.

@thephet
Copy link

thephet commented Jun 3, 2024

@sgkouzias Thank you. The solution from @niko247 worked and it is what I am using

@MrOxMasTer
Copy link

@MrOxMasTer the purpose of installing WSL 2 is to "install TensorFlow inside" and as a process it is irrelevant to installing the CUDA Toolkit and cuDNN on Windows. Generally, installing WSL 2 allows you to run a full Linux environment within Windows, making it easier to develop and run applications that rely on Linux-based tools and libraries as TensorFlow version 2.16.1. More specifically, in order to run the latest version of TensorFlow and utilize your GPU in WSL 2 according to TensorFlow official documentation you must proceed with the TensorFlow installation procedure on WSL 2. Hopefully that makes more sense.

I also have to guess what the installation problem is:
image

The most terrible setup I've ever seen in my entire life. It is not clear where I need to enter these commands, and why, if all commands need to be entered in wsl2, then why do you have a command in your sentence that is entered only on the side where the graphics driver is installed, because clearly

image

After all, it is clearly written in white in nvidia that the graphics driver does not need to be installed in wsl, and for some reason you offer me to enter a command that works only on the windows side, but at the same time you say that the entire installation in wsl takes place inside.

How am I supposed to know what to install inside wsl, if it's just from the point of view of a person who uses such functionality for the first time, it sounds like nonsense, because I don't understand how to use it and naturally I will try to run something in windows, because I work in windows. Even if I install it, how would I know that for example in vs code there is an extension like “wsl” that allows you to connect to wsl?

@rkuo2000
Copy link

rkuo2000 commented Jun 7, 2024

@MrOxMasTer you can do every installation (mostly python packages) "pip install" or "sudo apt install" in WSL ubuntu like you are using a PC running ubuntu OS , except CUDA, CuDNN installation for Windows.

@sgkouzias
Copy link

sgkouzias commented Jun 7, 2024

@MrOxMasTer since you work in Windows you could simply refer to the TensorFlow official documentation to install TensorFlow with pip for Windows WSL2 (aka Windows Subsystem for Linux) and open the provided link for the official CUDA on WSL User Guide.
image

Notice that in the official CUDA on WSL User Guide is clearly stated that:

"Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2. The CUDA driver installed on Windows host will be stubbed inside the WSL 2 as libcuda.so, therefore users must not install any NVIDIA GPU Linux driver within WSL 2...."

image

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

Until today the officially documented TensorFlow standard installation procedure for Linux users with GPUs does not include the additional steps required to perform deep learning experiments with TensorFlow version 2.16.1 and utilize GPU locally. That's why I submitted a pull request (pending review) in good faith and for the shake of all users as TensorFlow is "An Open Source Machine Learning Framework for Everyone".

Hope that the next patch version of TensorFlow will fix the bug as soon as possible!

@MrOxMasTer
Copy link

"Once a Windows NVIDIA GPU driver is installed on the system, CUDA becomes available within WSL 2. The CUDA driver installed on Windows host will be stubbed inside the WSL 2 as libcuda.so, therefore users must not install any NVIDIA GPU Linux driver within WSL 2...."

Yes, I say for this that you do not need to install graphics drivers in wsl, but installing tensorflow involves not only installing graphics drivers, but also cuDNN and cudaToolkit (possibly TensorRT) and it was not clear to me specifically where it needed to be installed. Then I saw that I needed to install everything inside wsl except the graphics driver

@MrOxMasTer
Copy link

MrOxMasTer commented Jun 7, 2024

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.

Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@sgkouzias
Copy link

sgkouzias commented Jun 7, 2024

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.

Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@MrOxMasTer I totally understand your frustration but I reassure you that TensorFlow version 2.16.1 can actually work with your cuda-enabled GPU.

You can try the following:

  1. Create a fresh conda virtual environment in WSL and activate it, like this:
conda create --name tf python=3.11
conda activate tf
  1. Within the fresh conda virtual environment tf created in the previous step run the following commands sequentially:
pip install --upgrade pip
pip install tensorflow[and-cuda]
  1. Set environment variables:

Note: This step is required in order to utilize your GPU but not yet included in the official TensorFlow documentation. All NVIDIA libs are installed with TensorFlow due to the fact you ran the command pip install tensorflow[and-cuda] in the previous step!

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

@GorillaDaddy
Copy link

Thanks, but it doesn't work for me @sgkouzias. It does at least find some files. However, despite the verify gpu setup saying that 1 gpu is available, no gpu activity... all cpu

@sgkouzias
Copy link

sgkouzias commented Jun 8, 2024

Thanks, but it doesn't work for me @sgkouzias. It does at least find some files. However, despite the verify gpu setup saying that 1 gpu is available, no gpu activity... all cpu

@GorillaDaddy well in order to work your setup should meet some certain technical requirements (please first check the official TensorFlow documentation). As I am not aware of your setup I could not possibly guess (even if I could provide some useful assistance for you) why it seems that your GPU is not properly utilized (if it is the case). However, here are some hints that I hope will help you:
a) check your OS compatibility,
b) Also check whether your GPU is compatible, and of course
c) check the Python version compatibility with the desired TensorFlow version,
d) train a deep learning model in Google Colab (you can use a TensorFlow ready-to-use dataset) by utilizing a GPU and time it. Train the same model on the same data on your PC and compare the training time.

@MrOxMasTer
Copy link

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

image

Hooray, it worked, thanks, but I have a mistake with this NUMA. Is this normal? Could it be because I did not install Cuda and Cdn on behalf of the administrator?

@sgkouzias
Copy link

sgkouzias commented Jun 8, 2024

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

image

Hooray, it worked, thanks, but I have a mistake with this NUMA. Is this normal? Could it be because I did not install Cuda and Cdn on behalf of the administrator?

@MrOxMasTer congratulations and thanks for the feedback.

The error "Your kernel may have been built without NUMA support" refers to the lack of NUMA (Non-Uniform Memory Access) support in the kernel you are using. NUMA is a memory architecture used in multiprocessor systems where the memory access time depends on the memory location relative to the processor.

NUMA support is important for optimizing memory access on systems with multiple CPUs or GPUs. It allows the operating system to allocate memory and schedule processes in a way that reduces memory access latency.

The Windows Subsystem for Linux (aka WSL) provides a Linux-compatible kernel interface developed by Microsoft and allows you to run Linux binaries on Windows. However, WSL's kernel might lack certain features present in a full-fledged Linux kernel, including NUMA support.

The lack of NUMA support might lead to suboptimal performance on systems with multiple processors or GPUs because the memory allocation might not be as efficient.

Consequently you can safely ignore the warning (you can read more about the warning in this discussion in nvidia.developer.com ).

@rednag
Copy link

rednag commented Jun 10, 2024

I'm also facing this issue as the op since I've upgraded to 2.16.1. After the downgrade to 2.15.1 everything runs smooth.

TensorFlow version
TF 2.16.1

OS platform and distribution
Linux Ubuntu 22.04.4 LTS

Python version
3.10.12

CUDA/cuDNN version
12.4

Actually I want to import the ops package from keras, but it seems it is firstly available on keras 3. If I upgrade keras I also have to upgrade tensorflow due to incompatibilities... but after the upgrade I'm not able to use the GPU anymore.

@sgkouzias
Copy link

sgkouzias commented Jun 10, 2024

@rednag as I understand it you have two available options:

1) Keep TensorFlow version 2.15 and reinstall Keras 3 afterwards
According to the official Keras documentation you can simply:
pip install --upgrade keras after installing tensorflow version 2.15

2) Upgrade Tensorflow to version 2.16.1
You can upgrade to TensorFlow version 2.16.1 and utilize your GPU locally (Keras 3.0 will be installed as well) through following the steps below:

  1. Create a fresh conda virtual environment and activate it like this:
conda create --name tf python=3.11
conda activate tf
  1. pip install --upgrade pip,
  2. pip install tensorflow[and-cuda],
  3. Set environment variables:

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR
  1. Verify the GPU setup:
    python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

I have submitted the respective pull request to update the official TensorFlow installation guide and is currently pending review.

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

@rednag
Copy link

rednag commented Jun 10, 2024

Thank you for the fast reply. At the moment I'm using the old functions from the keras.src.utils and tf packages, but I'm looking forward to the new release.

@sgkouzias
Copy link

sgkouzias commented Jun 10, 2024

Thank you for the fast reply. At the moment I'm using the old functions from the keras.src.utils and tf packages, but I'm looking forward to the new release.

@rednag great. Another option to consider for fast model training with Keras 3 and GPU acceleration is to use JAX as keras backend.

@eabase
Copy link

eabase commented Jun 14, 2024

Can someone care to explain why TF >2.10 cannot be run with GPU in native windows?
This totally makes no sense whatsoever, as all other HW, WSL, and Conda works with GPU. Including other python packages, such as Torch. So what is going on?

I.e. What is the problem and why is it not being addressed by the community?

@sh-shahrokhi
Copy link

sh-shahrokhi commented Jun 14, 2024

Can someone care to explain why TF >2.10 cannot be run with GPU in native windows? This totally makes no sense whatsoever, as all other HW, WSL, and Conda works with GPU. Including other python packages, such as Torch. So what is going on?

I.e. What is the problem and why is it not being addressed by the community?

Google removed the native windows cuda build starting TF 2.11
There is nothing you can do about it, building from source with cuda will also fail in windows.

@mihaimaruseac
Copy link
Collaborator

Everyone that cared about full support of TF is no longer in the team. See above comments for more details and differences

@eabase
Copy link

eabase commented Jun 15, 2024

@sh-shahrokhi

Google removed the native windows cuda build starting TF 2.11

Unfortunately that doesn't say anything. I don't see how you can "remove" any of that, apart from breaking the build scripts. Whatever you "remove" must still be present for all other nix builds. WSL is not that different from MSYS, MinGW, which (no longer) is too far from VS C/C++ builds.

@sh-shahrokhi
Copy link

sh-shahrokhi commented Jun 15, 2024

@sh-shahrokhi

Google removed the native windows cuda build starting TF 2.11

Unfortunately that doesn't say anything. I don't see how you can "remove" any of that, apart from breaking the build scripts. Whatever you "remove" must still be present for all other nix builds. WSL is not that different from MSYS, MinGW, which (no longer) is too far from VS C/C++ builds.

#58629
Also:
#59918

@ben-jy
Copy link

ben-jy commented Jun 17, 2024

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.
Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@MrOxMasTer I totally understand your frustration but I reassure you that TensorFlow version 2.16.1 can actually work with your cuda-enabled GPU.

You can try the following:

  1. Create a fresh conda virtual environment in WSL and activate it, like this:
conda create --name tf python=3.11
conda activate tf
  1. Within the fresh conda virtual environment tf created in the previous step run the following commands sequentially:
pip install --upgrade pip
pip install tensorflow[and-cuda]
  1. Set environment variables:

Note: This step is required in order to utilize your GPU but not yet included in the official TensorFlow documentation. All NVIDIA libs are installed with TensorFlow due to the fact you ran the command pip install tensorflow[and-cuda] in the previous step!

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

Doesn't work for me :/ I even reinstalled completely WSL, but I still get an empty list when showing the available devices... Should CUDA be unistalled on Windows side ? When I use "nvidia-smi", it is written that I have the 12.5 Cuda Version, even if I didn't install anything on WSL... Is that normal ?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+

@sgkouzias
Copy link

Also kindly note that the current issue opened "TF 2.16.1 Fails to work with GPUs" involves Linux Operating Systems and potentially the additional steps to be specified in the official TensorFlow documentation in order to utilize GPUs locally.

I started a not very pleasant acquaintance with tensorflow with this version. As I understand it, the specific reason is 2.16.1 and it does not work in wsl. Because nothing worked for me. And the question is which version can be installed so that it works normally in wsl.
Also, for the future, I will say that installing anaconda does not help either. You can install a maximum of 2.10 version on it

@MrOxMasTer I totally understand your frustration but I reassure you that TensorFlow version 2.16.1 can actually work with your cuda-enabled GPU.

You can try the following:

  1. Create a fresh conda virtual environment in WSL and activate it, like this:
conda create --name tf python=3.11
conda activate tf
  1. Within the fresh conda virtual environment tf created in the previous step run the following commands sequentially:
pip install --upgrade pip
pip install tensorflow[and-cuda]
  1. Set environment variables:

Note: This step is required in order to utilize your GPU but not yet included in the official TensorFlow documentation. All NVIDIA libs are installed with TensorFlow due to the fact you ran the command pip install tensorflow[and-cuda] in the previous step!

Locate the directory for the conda environment in your terminal window by running in the terminal:

echo $CONDA_PREFIX

Enter that directory and create these subdirectories and files:

cd $CONDA_PREFIX
mkdir -p ./etc/conda/activate.d
mkdir -p ./etc/conda/deactivate.d
touch ./etc/conda/activate.d/env_vars.sh
touch ./etc/conda/deactivate.d/env_vars.sh

Edit ./etc/conda/activate.d/env_vars.sh as follows:

#!/bin/sh

# Store original LD_LIBRARY_PATH 
export ORIGINAL_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}" 

# Get the CUDNN directory 
CUDNN_DIR=$(dirname $(dirname $(python -c "import nvidia.cudnn; print(nvidia.cudnn.__file__)")))

# Set LD_LIBRARY_PATH to include CUDNN directory
export LD_LIBRARY_PATH=$(find ${CUDNN_DIR}/*/lib/ -type d -printf "%p:")${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

# Get the ptxas directory  
PTXAS_DIR=$(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)")))

# Set PATH to include the directory containing ptxas
export PATH=$(find ${PTXAS_DIR}/*/bin/ -type d -printf "%p:")${PATH:+:${PATH}}

Edit ./etc/conda/deactivate.d/env_vars.sh as follows:

#!/bin/sh

# Restore original LD_LIBRARY_PATH
export LD_LIBRARY_PATH="${ORIGINAL_LD_LIBRARY_PATH}"

# Unset environment variables
unset CUDNN_DIR
unset PTXAS_DIR

Verify the GPU setup: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Additionally, as I was informed the next version of TensorFlow will hopefully arrive within the next days!

I hope it helps!

Doesn't work for me :/ I even reinstalled completely WSL, but I still get an empty list when showing the available devices... Should CUDA be unistalled on Windows side ? When I use "nvidia-smi", it is written that I have the 12.5 Cuda Version, even if I didn't install anything on WSL... Is that normal ?

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.52.01 Driver Version: 555.99 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+

@ben-jy frankly I have no clue. Did you check the official documentation? Your setup meets the technical requirements? What's the Python version in WSL2? Is it compatible with TensorFlow 2.16.1? What's the name of your NVIDIA GPU? The output of the command nvidia-smi in WSL2 seems normal since your GPU driver is installed in Windows. However you could try reinstalling everything (compatible GPU driver, afterwards WSL2 and then TensorFlow)...

@sh-shahrokhi
Copy link

sh-shahrokhi commented Jun 17, 2024 via email

@ben-jy
Copy link

ben-jy commented Jun 18, 2024

@sgkouzias I checked the official documentation, but I find it not very clear, and seems a bit contradictory: the software requirements state that CUDA and cuDNN should be installed on the machine, but the pip package should install them automatically with Tensorflow right ? Besides, this medium tutorial explain that CUDA should not be installed on Windows side, neither on WSL side, and be installed using the pip package. Maybe I should try to uninstall all CUDA-related on Windows...
Concerning your other questions:

  1. I have an RTX 3070 Ti, which is in the list of CUDA-enabled product.
  2. I use conda and I tried the install with Python 3.10 and 3.11, which are in the software requirements of the officiel documentation. Those versions are said compatible with TensorFlow 2.16.1, according to the PyPi package tags

I will try to make a clean reinstall of my GPU driver, as well as unistalling CUDA on Windows side. If it doesn't work, I think it is better to install CUDA and cuDNN manually, along with an oldest TensorFlow version. It is still a shame that the official documentation of such a large and important library is so unclear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review Pull request awaiting review comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.16 type:bug Bug
Projects
None yet
Development

No branches or pull requests