Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN, cuFFT, and cuBLAS Errors #62075

Open
Ke293-x2Ek-Qe-7-aE-B opened this issue Oct 9, 2023 · 124 comments
Open

cuDNN, cuFFT, and cuBLAS Errors #62075

Ke293-x2Ek-Qe-7-aE-B opened this issue Oct 9, 2023 · 124 comments
Assignees
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF2.14 For issues related to Tensorflow 2.14.x type:build/install Build and install issues

Comments

@Ke293-x2Ek-Qe-7-aE-B
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

GIT_VERSION:v2.14.0-rc1-21-g4dacf3f368e VERSION:2.14.0

Custom code

No

OS platform and distribution

WSL2 Linux Ubuntu 22

Mobile device

No response

Python version

3.10, but I can try different versions

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

CUDA version: 11.8, cuDNN version: 8.7

GPU model and memory

NVIDIA Geforce GTX 1660 Ti, 8GB Memory

Current behavior?

When I run the GPU test from the TensorFlow install instructions, I get several errors and warnings.
I don't care about the NUMA stuff, but the first 3 errors are that TensorFlow was not able to load cuDNN. I would really like to be able to use it to speed up training some RNNs and FFNNs. I do get my GPU in the list of physical devices, so I can still train, but not as fast as with cuDNN.

Standalone code to reproduce the issue

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Relevant log output

2023-10-09 13:36:23.355516: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-09 13:36:23.355674: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-09 13:36:23.355933: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-09 13:36:23.413225: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-09 13:36:25.872586: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-09 13:36:25.916952: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-09 13:36:25.917025: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
@SuryanarayanaY
Copy link
Collaborator

Hi @Ke293-x2Ek-Qe-7-aE-B ,

Starting from TF2.14 tensorflow provides CUDA package which can install all the cuDNN,cuFFT and cubLas libraries.

You can use pip install tensorflow[and-cuda] command for that.

Please try this command let us know if it helps. Thankyou!

@SuryanarayanaY SuryanarayanaY added TF2.14 For issues related to Tensorflow 2.14.x stat:awaiting response Status - Awaiting response from author labels Oct 10, 2023
@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

Ke293-x2Ek-Qe-7-aE-B commented Oct 10, 2023

@SuryanarayanaY I did not know that it now came bundled with cuDNN. I installed tensorflow with the [and-cuda] part, though, but I also installed cuda toolkit and cuDNN separately. I will try just installing the cuda toolkit and then installing tensorflow[and-cuda].
Also, is there a way to install tensorflow for GPU without it coming with cuDNN? If I just pip install tensorflow, will that install with GPU support, just without cuDNN, so that I can manually install them? I don't really need to, but I am curious if it can be installed that way too.

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Oct 10, 2023
@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@SuryanarayanaY I tried several times, reinstalling Ubuntu, but it still doesn't work.

@AthiemoneZero
Copy link

I also have the same issue, and this seems not to be due to cuda environment as I rebulid cuda and cudnn to make them suit for tf-2.14.0.

This is log out I find:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2023-10-11 18:21:57.387396: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2023-10-11 18:21:57.415774: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-10-11 18:21:57.415847: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-10-11 18:21:57.415877: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-10-11 18:21:57.421400: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-10-11 18:21:58.155058: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2023-10-11 18:21:59.113217: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. 2023-10-11 18:21:59.152044: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. 2023-10-11 18:21:59.152153: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node Your kernel may have been built without NUMA support. [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU.

@AthiemoneZero
Copy link

AthiemoneZero commented Oct 11, 2023

@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU.

Yeah. But I just found that when I downgrade to 2.13.0 version, errors in register won't appear again. It looks like this:

(TF) ephys3@ZhouLab-Ephy3:~$ python3 -c "import tensorrt as trt;import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2023-10-11 20:39:12.097457: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-11 20:39:12.130250: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-11 20:39:13.856721: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-11 20:39:13.870767: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-11 20:39:13.870941: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Although I haven't figured out how to solve NUMA node error, I found some clues from another issue (as I operated all above in WSL Ubuntu). This bug seems not to be significant as explaination from NVIDIA forums . So I guess errors in register might have something with the latest version and errors in NUMA might be caused by OS enviroment. Hope this information would help some guys.

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@AthiemoneZero I tried downgrading as well, but it didn't work for me. The NUMA errors are (as stated in the error message) because the kernel provided by Microsoft for WSL2 is not built with NUMA support. I tried cloning the repo (here) and building from source my own with NUMA support, but that didn't work, so I am just ignoring those errors for now.

@AthiemoneZero
Copy link

AthiemoneZero commented Oct 11, 2023

@Ke293-x2Ek-Qe-7-aE-B I rebuilt all in an independent conda environment as TF. My steps were to create a TF env with python 3.9.8 and tried python3 -m pip install tensorflow[and-cuda] --user according to instruction. Following these I tried python3 -m pip install tensorflow[and-cuda]=2.13.0 --user and found it solved some bug.

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@AthiemoneZero Thanks for the instructions. I'll try and see if it works on my system. I have been using python 3.10, so maybe that's why it didn't work. Did you have to install the CUDA toolkit?

@AthiemoneZero
Copy link

AthiemoneZero commented Oct 11, 2023

@Ke293-x2Ek-Qe-7-aE-B I didnt execute conda install cuda-toolkit here. I guess [and-cuda] argument help me install some dependencies.

@AthiemoneZero
Copy link

But I did double check version of cuda and cudnn. For this I even downgrade them again and again.

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@AthiemoneZero Usually, I would install the CUDA toolkit according to these instructions (here), then install cuDNN according to these instructions (here). I installed CUDA toolkit version 11.8 and cuDNN version 8.7, because they are the latest supported by TensorFlow, according to their support table here. I guess using [and-cuda] installs all of that for you.

@AthiemoneZero
Copy link

AthiemoneZero commented Oct 11, 2023

@Ke293-x2Ek-Qe-7-aE-B Apologize for my misunderstanding. I did the same in installing cuda toolkit as what you described above before I went directly to debug tf_gpu. I made sure my gpu and cuda could perform well as I have tried another task smoothly using cuda but without tf. What I concerned is some dependencies of tf have to be pre-installed in a conda env and this might be treated by [and-cuda] (my naive guess

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

@AthiemoneZero I always install CUDA toolkit and cuDNN globally for the whole system, and then install TensorFlow in a miniconda environment. This doesn't work anymore with the newest versions of TensorFlow, so I'll try your instructions. It does make sense to install everything in a conda env, I just hadn't thought of that since my other method had worked in the past. Thanks for sharing what you did to make it work.

@AthiemoneZero
Copy link

AthiemoneZero commented Oct 11, 2023

@Ke293-x2Ek-Qe-7-aE-B You're welcomed. BTW, I also followed the instruction to configure development including suitable version of bazel and clang-16, just before all my operation digging into conda env.

@Ke293-x2Ek-Qe-7-aE-B
Copy link
Author

Ke293-x2Ek-Qe-7-aE-B commented Oct 11, 2023

@AthiemoneZero Thanks, but it didn't work.

@FaisalAlj
Copy link

Hello,

I'm experiencing the same issue, even though I meticulously followed all the instructions for setting up CUDA 11.8 and CuDNN 8.7. The error messages I'm encountering are as follows:

Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered.
Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered.
Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered.

I've tried this with different versions of Python. Surprisingly, when I used Python 3.11, TensorFlow 2.13 was installed without these errors. However, when I used Python 3.10 or 3.9, I ended up with TensorFlow 2.14 and the aforementioned errors.

I've come across information suggesting that I may not need to manually install CUDA and CuDNN, as [and-cuda] should handle the installation of these components automatically.

Could someone please guide me on the correct approach to resolve this issue? I've tried various methods, but unfortunately, none of them have yielded a working solution.

P.S. I'm using conda in WSL 2 on Windows 11.

@nkinnaird
Copy link

nkinnaird commented Oct 17, 2023

I am having the same issue as FaisalAlj above, on Windows 10 with the same versions of CUDA and CuDNN. The package tensorflow[and-cuda] is not found by pip. I've tried different versions of python and tensorflow without success. In my case I'm using virtualenv rather than conda.

Edit 1:
I appear to be able to install tensorflow[and-cuda] as long as I use quotes around the package, like: pip install "tensorflow[and-cuda]".

Edit 2:
I still appear to be getting these messages however, so I'm not sure I've installed things correctly.

@SuryanarayanaY
Copy link
Collaborator

Hi @Ke293-x2Ek-Qe-7-aE-B ,

I have checked the installation on colab(linx environment) and observed same logs as per attached gist.

These logs seems generated from XLA compiler but GPU is able to detectable. Similar issue #62002 and already bought to Engineering team attention.

CC: @learning-to-play

@SuryanarayanaY SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 18, 2023
@qnlzgl
Copy link

qnlzgl commented Feb 15, 2024

Same error here I tried Cuda 12 and Cuda 11.8 (WSL2 in Ubuntu). All of them have this issue.

@NOORLEICESTER
Copy link

NOORLEICESTER commented Feb 15, 2024

Thank you for your comment @qnlzgl . I have attempted to fix the issue in various ways, but none have proven successful for me.

@qnlzgl
Copy link

qnlzgl commented Feb 15, 2024

Thank you for your comment @qnlzgl . I have attempted to fix the issue in various ways, but none have proven successful for me.

I feel it's okay to leave the errors as it is, I have this error while importing tensorflow, but still could use GPUs like normal.

@NOORLEICESTER
Copy link

But i am getting an error in keras because of the mentioned error @qnlzgl

@SomeUserName1
Copy link

SomeUserName1 commented Feb 19, 2024

Occurs with tensorflow[and-cuda]==2.15.0.post0
Tried building from source and installing via pip & conda.

$ python
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-02-19 12:55:16.996299: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-19 12:55:17.017067: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 12:55:17.017083: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 12:55:17.017652: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 12:55:17.020872: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print(tf.config.list_physical_devices('GPU'))
2024-02-19 12:56:09.845769: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-19 12:56:09.870239: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-19 12:56:09.870422: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
>>> 

On tf-nightly[and-cuda] this doesn't occur anymore.

$ python
Python 3.11.7 (main, Jan 29 2024, 16:03:57) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.platform import build_info as tf_build_info
2024-02-19 13:04:56.451101: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-19 13:04:56.474818: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print("cudnn_version",tf_build_info.build_info['cudnn_version'])
cudnn_version 8
>>> 
>>> print("cuda_version",tf_build_info.build_info['cuda_version'])
cuda_version 12.3
>>> print(tf.config.list_physical_devices('GPU'))
2024-02-19 13:13:48.002965: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-19 13:13:48.006970: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-19 13:13:48.007068: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

The NUMA thing persists. Also on Linux with NUMA configured.

System information:
(Arch Linux + Zen kernel patches + KDE Plasma)

$ uname -a
Linux somelegion 6.7.4-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Mon, 05 Feb 2024 22:07:37 +0000 x86_64 GNU/Linux
$ nvidia-smi
Mon Feb 19 13:09:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090 ...    On  | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P5              18W /  95W |   1414MiB / 16376MiB |     18%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
$ zgrep CONFIG_NUMA /proc/config.gz                                                                                                                                                                                             
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NUMA_KEEP_MEMINFO=y
sudo lspci -vvv -s 01:00.0 
01:00.0 VGA compatible controller: NVIDIA Corporation GN21-X11 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Lenovo GN21-X11
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 132
	IOMMU group: 16
	Region 0: Memory at 85000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at 4000000000 (64-bit, prefetchable) [size=16G]
	Region 3: Memory at 4400000000 (64-bit, prefetchable) [size=32M]
	Region 5: I/O ports at 6000 [size=128]
	Expansion ROM at 86080000 [virtual] [disabled] [size=512K]
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee003d8  Data: 0000
	Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM L1, Exit Latency L1 <16us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s (downgraded), Width x16
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR+ 10BitTagReq- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [b4] Vendor Specific Information: Len=14 <?>
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [250 v1] Latency Tolerance Reporting
		Max snoop latency: 34326183936ns
		Max no snoop latency: 34326183936ns
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=281600ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 16GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, **nvidia**

Might be from the NVidia side if one belives the linked doc:

What: /sys/bus/pci/devices/.../numa_node
Date: Oct 2014
Contact: Prarit Bhargava prarit@redhat.com
Description:
This file contains the NUMA node to which the PCI device is
attached, or -1 if the node is unknown. The initial value
comes from an ACPI _PXM method or a similar firmware
source. If that is missing or incorrect, this file can be
written to override the node. In that case, please report
a firmware bug to the system vendor. Writing to this file
taints the kernel with TAINT_FIRMWARE_WORKAROUND, which
reduces the supportability of your system.

@Nephalen
Copy link

Hi. I'm having the exact same 3 errors after updating tf 2.11 to 2.14. In my case, tensorflow-gpu is installed from conda-forge channel. Cuda library is installed through cudatoolkit-dev from conda-forge as well. tf 2.11 didn't show such errors.

It does not prevent the GPU usage. But I observed around 10%-15% slowdown on model prediction speed in tf 2.14 comparing to the exact same code in tf 2.11.

Could this be related to those errors? I'm kind of getting mixed messages from the overall discussion.

@ManzarIMalik
Copy link

Same error on Ubuntu 22.04 LTS Install in WSL2 / Windows 11. Has anyone found solution to this?

@SomeUserName1
Copy link

SomeUserName1 commented Feb 21, 2024

pip uninstall tensorflow && pip install tf-nightly[and-cuda]

@ManzarIMalik
Copy link

pip uninstall tensorflow && pip install tf-nightly[and-cuda]

This is not working either.

@Radiated-Coder
Copy link

Hi Guys,

Seems like I'm late to the party. I am running TF on WSL2. Guess what, I have completely messed up my set up which was running great on TF 2.10 configuration.
I Upgraded every thing to a stable configuration as mentioned by NVIDIA, PF below. Only to find out Microsoft killed NUMA.

TensorFlow : 2.15
Cuda kit : 12.3
CuDNN : 8.9
VS : Comm 2022
Python : 3.10
numactl : 2.0.14-3Ubuntu2 (force installed but no use)
UBUNTU : 24
CPU : Ryzen 7 5600
Mem : 32 Gigs
GPU : GeF RTX 4070 16 Gigs

Any help would be appreciated.
TY. I too will Update her If I find a fix.

@NOORLEICESTER
Copy link

@SomeUserName1
Unfortunately, I got the error below when I tried to run the pip install tf-nightly[and-cuda]
RROR: Could not find a version that satisfies the requirement tf-nightly[and-cuda] (from versions: none)
ERROR: No matching distribution found for tf-nightly[and-cuda]

@SomeUserName1
Copy link

Try quoting the package name

pip install 'tf-nightly[and-cuda]'

@NOORLEICESTER
Copy link

Please any comment to figure out this error?
ImportError: cannot import name 'cast' from partially initialized module 'keras.src.backend' (most likely due to a circular import)

@SomePersonSomeWhereInTheWorld

Python 3.9 same issue:

Python 3.9.18 (main, Jan  4 2024, 00:00:00) 
[GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-02-26 13:24:06.607339: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-26 13:24:06.609051: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 13:24:06.645912: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-02-26 13:24:06.646284: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2024-02-26 13:24:09.084670: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Num GPUs Available:  0
pip show tensorflow
Name: tensorflow
Version: 2.13.0
pip show tf-nightly
Name: tf-nightly
Version: 2.16.0

@SomeUserName1
Copy link

@NOORLEICESTER without further code it's difficult to say what's going on. I'd guess from the error that you actually have a circular import, so try to check your imports.

@SomePersonSomeWhereInTheWorld
Uninstall all tf versions but nightly to avoid version mismatches and the like.
uninstall keras as tf-nightly also installs keras-nightly.
You may want to remove the previous env and start with a clean slate.
Could not find cuda drivers on your machine, GPU will not be used.
Did you install the Nvidia driver (via apt/rpm/pacman) and cuda on your system?

@NOORLEICESTER
Copy link

@ManzarIMalik not worked as well.

@NOORLEICESTER
Copy link

@SomeUserName1 not worked as well.

@arvindbs2014
Copy link

@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU.

Yeah. But I just found that when I downgrade to 2.13.0 version, errors in register won't appear again. It looks like this:

(TF) ephys3@ZhouLab-Ephy3:~$ python3 -c "import tensorrt as trt;import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

2023-10-11 20:39:12.097457: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-11 20:39:12.130250: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-11 20:39:13.856721: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-11 20:39:13.870767: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-11 20:39:13.870941: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] could not open file to read NUMA node: /sys/bus/pci/devices/0000:65:00.0/numa_node
Your kernel may have been built without NUMA support.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Although I haven't figured out how to solve NUMA node error, I found some clues from another issue (as I operated all above in WSL Ubuntu). This bug seems not to be significant as explaination from NVIDIA forums . So I guess errors in register might have something with the latest version and errors in NUMA might be caused by OS enviroment. Hope this information would help some guys.


NUMA non zero problem can be solved this way

  1. Check Nodes
    lspci | grep -i nvidia

01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation TU106 High Definition Audio Controller (rev a1)
The first line shows the address of the VGA-compatible device, NVIDIA Geforce, as 01:00 . Each one will be different, so let’s change this part carefully.
2. Check and change NUMA setting values
If you run ls with this path /sys/bus/pci/devicecs/, you can see the following list:
ls /sys/bus/pci/devices/

0000:00:00.0 0000:00:06.0 0000:00:15.0 0000:00:1c.0 0000:00:1f.3 0000:00:1f.6 0000:02:00.0
0000:00:01.0 0000:00:14.0 0000:00:16.0 0000:00:1d.0 0000:00:1f.4 0000:01:00.0
0000:00:02.0 0000:00:14.2 0000:00:17.0 0000:00:1f.0 0000:00:1f.5 0000:01:00.1
01:00.0 checked above is visible. However, 0000: is attached in front.
3. Check if it is connected.
cat /sys/bus/pci/devices/0000:01:00.0/numa_node

-1

1 means no connection, and 0 means connected.
4. Fix it with the command below.
sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000:01:00.0/numa_node

0

@wangkuiyi
Copy link
Contributor

It's been five months, yet the problem remains.

@maulberto3
Copy link

It's been five months, yet the problem remains.

You are right, what a shame, I gave up and went to Rust.

Korred added a commit to Korred/unet-pp that referenced this issue Mar 5, 2024
…nd wsl

- "environment-win" holds is a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not supported anymore)
- "environment-wsl" holds a working env for WSL Ubuntu with Tensorflow 2.13 with GPU support (as installing 2.15 with "tensorflow[and-cuda]" on WSL has issues with registering cuDNN, cuFFT, cuBLAS and the GPU is sometimes not being found - tensorflow/tensorflow#62075)
- a 2.13 trained model can still be used on a Windows machine with Tensorflow 2.15 (only when you save the model as a .h5 file instead of .keras)
Korred added a commit to Korred/unet-pp that referenced this issue Mar 5, 2024
…nd wsl

- "environment-win" holds a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not supported anymore)
- "environment-wsl" holds a working env for WSL Ubuntu with Tensorflow 2.13 with GPU support (as installing 2.15 with "tensorflow[and-cuda]" on WSL has issues with registering cuDNN, cuFFT, cuBLAS and the GPU is sometimes not being found - tensorflow/tensorflow#62075)
- a 2.13 trained model can still be used on a Windows machine with Tensorflow 2.15 (only when you save the model as a .h5 file instead of .keras)
@Hato1
Copy link

Hato1 commented Mar 21, 2024

When 2.15.X didn't work, tensorflow 2.16.1 (without CUDA) solved this issue for me. Python3.10, CUDA driver 12.2, Cuda Toolkit 12.1, cuDNN 8.9.5.

@ManzarIMalik
Copy link

It finally worked, with Tensorflow 2.16.1 (upgrade to lastest) > pip install --upgrade tensorflow

@bojle
Copy link

bojle commented Mar 22, 2024

Can confirm it works with 2.16.1. For those who have to resort to using 2.9.0 workaround (some of my packages are limited to 2.15), use python <= 3.10 to install
it.

@jeb2112
Copy link

jeb2112 commented Apr 6, 2024

Can confirm that these error messages probably don't matter. I have nvidia GEforce 1650 and I had working tensorflow (2.15, cuda 12.2) and pytorch env's. But, the pytorch env was for some code that was stuck at python3.6, which I could no longer debug in vs code. So, I created a new pytorch env at cuda-12.2 and torch 2.2.0. That wouldn't detect the GPU so I backed off and went with cuda-11.8 and torch 2.0.0 but it still wouldn't detect the GPU. But, the broken pytorch attempt also broke the working tensorflow env, so it started giving me the above messages, and wouldn't detect the GPU at all (ie using the print physical devices thing). On a whim, I rebooted, after which both the new pytorch and tensorflow env's are back to detecting the GPU, but I still have those messages, but only in tensorflow.

@mikechen66
Copy link

Hi @Ke293-x2Ek-Qe-7-aE-B ,

Starting from TF2.14 tensorflow provides CUDA package which can install all the cuDNN,cuFFT and cubLas libraries.

You can use pip install tensorflow[and-cuda] command for that.

Please try this command let us know if it helps. Thankyou!

I chech the installation. The prerequisite of successful installing TensorFlow 2.14 ~2.16 is users need to install Nvdia Linux Driver, CUDA Toolkit and cuDNN in the original Linux environment and then set the environment in bashrc, not in the base environment of Anaconda/Miniconda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF2.14 For issues related to Tensorflow 2.14.x type:build/install Build and install issues
Projects
None yet
Development

No branches or pull requests