-
Notifications
You must be signed in to change notification settings - Fork 74.8k
Closed as not planned
Closed as not planned
Copy link
Labels
TF 2.18comp:gpuGPU related issuesGPU related issuesstat:awaiting tensorflowerStatus - Awaiting response from tensorflowerStatus - Awaiting response from tensorflowertype:supportSupport issuesSupport issues
Description
Issue type
Support
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
2.18
Custom code
Yes
OS platform and distribution
No response
Mobile device
No response
Python version
No response
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
No response
GPU model and memory
No response
Current behavior?
When updating from tensorflow:2.17-gpu-jupyter
to tensorflow:2.18-gpu-jupyter
we expect GPU support. As per the 2.18 update local drivers are not supported and an install of Hermetic CUDA is needed. We would need to install tensorflow[and-cuda]
again in the requirements.txt
file.
As users of the tensorflow:2.18-gpu-jupyter
, having read "Optional Features" at https://hub.docker.com/r/tensorflow/tensorflow, we expect GPU support or the existance of a seperate tag for [and-cuda]
.
Standalone code to reproduce the issue
The following is for Run:AI with Kubernetes:
export job_name="acceptance-test-${CI_PIPELINE_ID}"
curl -Lsk -o /usr/local/bin/runai <URL>
chmod +x /usr/local/bin/runai
source runai_login
runai config project $runai_project
runai submit $job_name -i $image:$build_number -g 1 -- python3 -c 'import tensorflow as tf; print(len(tf.config.list_physical_devices("GPU")))'
while [[ $(runai describe job $job_name | grep "Status:" | awk '{print $2}') != "Succeeded" ]]; do sleep 10; echo Waiting for pod status to be completed...; done
kubectl logs $pod_name -n "runai-${runai_project}"
Relevant log output
root@3bdc05a33062:/tf# python3 -c 'import tensorflow as tf; print(len(tf.config.list_physical_devices("GPU")))'
2024-11-29 13:25:40.471862: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1732886740.493342 11 cuda_dnn.cc:8498] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1732886740.499908 11 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-29 13:25:40.523155: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
W0000 00:00:1732886743.034830 11 gpu_device.cc:2342] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
0
root@3bdc05a33062:/tf# nvidia-smi
Fri Nov 29 13:31:56 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01 Driver Version: 535.216.01 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB Off | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 65W / 400W | 3MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
luksi1 and arroyomejias
Metadata
Metadata
Assignees
Labels
TF 2.18comp:gpuGPU related issuesGPU related issuesstat:awaiting tensorflowerStatus - Awaiting response from tensorflowerStatus - Awaiting response from tensorflowertype:supportSupport issuesSupport issues