Skip to content

GPU image 'tensorflow:2.18-gpu-jupyter' does not come with GPU support (missing CUDA libraries) #81344

@SandraAnder

Description

@SandraAnder

Issue type

Support

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.18

Custom code

Yes

OS platform and distribution

No response

Mobile device

No response

Python version

No response

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

When updating from tensorflow:2.17-gpu-jupyter to tensorflow:2.18-gpu-jupyter we expect GPU support. As per the 2.18 update local drivers are not supported and an install of Hermetic CUDA is needed. We would need to install tensorflow[and-cuda] again in the requirements.txt file.

As users of the tensorflow:2.18-gpu-jupyter, having read "Optional Features" at https://hub.docker.com/r/tensorflow/tensorflow, we expect GPU support or the existance of a seperate tag for [and-cuda].

Standalone code to reproduce the issue

The following is for Run:AI with Kubernetes:

export job_name="acceptance-test-${CI_PIPELINE_ID}"
curl -Lsk -o /usr/local/bin/runai <URL>
chmod +x /usr/local/bin/runai
source runai_login
runai config project $runai_project
runai submit $job_name -i $image:$build_number -g 1 -- python3 -c 'import tensorflow as tf; print(len(tf.config.list_physical_devices("GPU")))'
while [[ $(runai describe job $job_name | grep "Status:" | awk '{print $2}') != "Succeeded" ]]; do sleep 10; echo Waiting for pod status to be completed...; done
kubectl logs $pod_name -n "runai-${runai_project}"

Relevant log output

root@3bdc05a33062:/tf# python3 -c 'import tensorflow as tf; print(len(tf.config.list_physical_devices("GPU")))'
2024-11-29 13:25:40.471862: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1732886740.493342      11 cuda_dnn.cc:8498] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732886740.499908      11 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-29 13:25:40.523155: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1732886743.034830      11 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
0

root@3bdc05a33062:/tf# nvidia-smi
Fri Nov 29 13:31:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:07:00.0 Off |                    0 |
| N/A   28C    P0              65W / 400W |      3MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions