New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDNN, cuFFT, and cuBLAS Errors #62075
Comments
Starting from TF2.14 tensorflow provides CUDA package which can install all the cuDNN,cuFFT and cubLas libraries. You can use Please try this command let us know if it helps. Thankyou! |
@SuryanarayanaY I did not know that it now came bundled with cuDNN. I installed tensorflow with the [and-cuda] part, though, but I also installed cuda toolkit and cuDNN separately. I will try just installing the cuda toolkit and then installing tensorflow[and-cuda]. |
@SuryanarayanaY I tried several times, reinstalling Ubuntu, but it still doesn't work. |
I also have the same issue, and this seems not to be due to cuda environment as I rebulid cuda and cudnn to make them suit for tf-2.14.0. This is log out I find:
|
@AthiemoneZero Because it still does output a GPU device at the bottom of the log, I am training on GPU, just without cuDNN. It will be slower, but it is better than nothing or training on CPU. |
Yeah. But I just found that when I downgrade to 2.13.0 version, errors in register won't appear again. It looks like this:
Although I haven't figured out how to solve NUMA node error, I found some clues from another issue (as I operated all above in WSL Ubuntu). This bug seems not to be significant as explaination from NVIDIA forums . So I guess errors in register might have something with the latest version and errors in NUMA might be caused by OS enviroment. Hope this information would help some guys. |
@AthiemoneZero I tried downgrading as well, but it didn't work for me. The NUMA errors are (as stated in the error message) because the kernel provided by Microsoft for WSL2 is not built with NUMA support. I tried cloning the repo (here) and building from source my own with NUMA support, but that didn't work, so I am just ignoring those errors for now. |
@Ke293-x2Ek-Qe-7-aE-B I rebuilt all in an independent conda environment as TF. My steps were to create a TF env with |
@AthiemoneZero Thanks for the instructions. I'll try and see if it works on my system. I have been using |
@Ke293-x2Ek-Qe-7-aE-B I didnt execute |
But I did double check version of cuda and cudnn. For this I even downgrade them again and again. |
@AthiemoneZero Usually, I would install the CUDA toolkit according to these instructions (here), then install cuDNN according to these instructions (here). I installed CUDA toolkit version 11.8 and cuDNN version 8.7, because they are the latest supported by TensorFlow, according to their support table here. I guess using [and-cuda] installs all of that for you. |
@Ke293-x2Ek-Qe-7-aE-B Apologize for my misunderstanding. I did the same in installing cuda toolkit as what you described above before I went directly to debug tf_gpu. I made sure my gpu and cuda could perform well as I have tried another task smoothly using cuda but without tf. What I concerned is some dependencies of tf have to be pre-installed in a conda env and this might be treated by [and-cuda] (my naive guess |
@AthiemoneZero I always install CUDA toolkit and cuDNN globally for the whole system, and then install TensorFlow in a miniconda environment. This doesn't work anymore with the newest versions of TensorFlow, so I'll try your instructions. It does make sense to install everything in a conda env, I just hadn't thought of that since my other method had worked in the past. Thanks for sharing what you did to make it work. |
@Ke293-x2Ek-Qe-7-aE-B You're welcomed. BTW, I also followed the instruction to configure development including suitable version of bazel and clang-16, just before all my operation digging into conda env. |
@AthiemoneZero Thanks, but it didn't work. |
Hello, I'm experiencing the same issue, even though I meticulously followed all the instructions for setting up CUDA 11.8 and CuDNN 8.7. The error messages I'm encountering are as follows: Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered. I've tried this with different versions of Python. Surprisingly, when I used Python 3.11, TensorFlow 2.13 was installed without these errors. However, when I used Python 3.10 or 3.9, I ended up with TensorFlow 2.14 and the aforementioned errors. I've come across information suggesting that I may not need to manually install CUDA and CuDNN, as [and-cuda] should handle the installation of these components automatically. Could someone please guide me on the correct approach to resolve this issue? I've tried various methods, but unfortunately, none of them have yielded a working solution. P.S. I'm using conda in WSL 2 on Windows 11. |
I am having the same issue as FaisalAlj above, on Windows 10 with the same versions of CUDA and CuDNN. The package Edit 1: Edit 2: |
Same error here I tried Cuda 12 and Cuda 11.8 (WSL2 in Ubuntu). All of them have this issue. |
Thank you for your comment @qnlzgl . I have attempted to fix the issue in various ways, but none have proven successful for me. |
I feel it's okay to leave the errors as it is, I have this error while importing tensorflow, but still could use GPUs like normal. |
But i am getting an error in keras because of the mentioned error @qnlzgl |
Occurs with tensorflow[and-cuda]==2.15.0.post0
On tf-nightly[and-cuda] this doesn't occur anymore.
The NUMA thing persists. Also on Linux with NUMA configured. System information:
Might be from the NVidia side if one belives the linked doc:
|
Hi. I'm having the exact same 3 errors after updating tf 2.11 to 2.14. In my case, It does not prevent the GPU usage. But I observed around 10%-15% slowdown on model prediction speed in tf 2.14 comparing to the exact same code in tf 2.11. Could this be related to those errors? I'm kind of getting mixed messages from the overall discussion. |
Same error on Ubuntu 22.04 LTS Install in WSL2 / Windows 11. Has anyone found solution to this? |
|
This is not working either. |
Hi Guys, Seems like I'm late to the party. I am running TF on WSL2. Guess what, I have completely messed up my set up which was running great on TF 2.10 configuration. TensorFlow : 2.15 Any help would be appreciated. |
@SomeUserName1 |
Try quoting the package name pip install 'tf-nightly[and-cuda]' |
Please any comment to figure out this error? |
Python 3.9 same issue:
|
@NOORLEICESTER without further code it's difficult to say what's going on. I'd guess from the error that you actually have a circular import, so try to check your imports. @SomePersonSomeWhereInTheWorld |
@ManzarIMalik not worked as well. |
@SomeUserName1 not worked as well. |
NUMA non zero problem can be solved this way
01:00.0 VGA compatible controller: NVIDIA Corporation TU106 [GeForce RTX 2060 12GB] (rev a1) 0000:00:00.0 0000:00:06.0 0000:00:15.0 0000:00:1c.0 0000:00:1f.3 0000:00:1f.6 0000:02:00.0 -1 1 means no connection, and 0 means connected. 0 |
It's been five months, yet the problem remains. |
You are right, what a shame, I gave up and went to Rust. |
…nd wsl - "environment-win" holds is a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not supported anymore) - "environment-wsl" holds a working env for WSL Ubuntu with Tensorflow 2.13 with GPU support (as installing 2.15 with "tensorflow[and-cuda]" on WSL has issues with registering cuDNN, cuFFT, cuBLAS and the GPU is sometimes not being found - tensorflow/tensorflow#62075) - a 2.13 trained model can still be used on a Windows machine with Tensorflow 2.15 (only when you save the model as a .h5 file instead of .keras)
…nd wsl - "environment-win" holds a working env with Tensorflow 2.15 but only with CPU support (as GPU on bare Windows is not supported anymore) - "environment-wsl" holds a working env for WSL Ubuntu with Tensorflow 2.13 with GPU support (as installing 2.15 with "tensorflow[and-cuda]" on WSL has issues with registering cuDNN, cuFFT, cuBLAS and the GPU is sometimes not being found - tensorflow/tensorflow#62075) - a 2.13 trained model can still be used on a Windows machine with Tensorflow 2.15 (only when you save the model as a .h5 file instead of .keras)
When 2.15.X didn't work, tensorflow 2.16.1 (without CUDA) solved this issue for me. Python3.10, CUDA driver 12.2, Cuda Toolkit 12.1, cuDNN 8.9.5. |
It finally worked, with Tensorflow 2.16.1 (upgrade to lastest) > |
Can confirm it works with 2.16.1. For those who have to resort to using 2.9.0 workaround (some of my packages are limited to 2.15), use python <= 3.10 to install |
Can confirm that these error messages probably don't matter. I have nvidia GEforce 1650 and I had working tensorflow (2.15, cuda 12.2) and pytorch env's. But, the pytorch env was for some code that was stuck at python3.6, which I could no longer debug in vs code. So, I created a new pytorch env at cuda-12.2 and torch 2.2.0. That wouldn't detect the GPU so I backed off and went with cuda-11.8 and torch 2.0.0 but it still wouldn't detect the GPU. But, the broken pytorch attempt also broke the working tensorflow env, so it started giving me the above messages, and wouldn't detect the GPU at all (ie using the print physical devices thing). On a whim, I rebooted, after which both the new pytorch and tensorflow env's are back to detecting the GPU, but I still have those messages, but only in tensorflow. |
I chech the installation. The prerequisite of successful installing TensorFlow 2.14 ~2.16 is users need to install Nvdia Linux Driver, CUDA Toolkit and cuDNN in the original Linux environment and then set the environment in bashrc, not in the base environment of Anaconda/Miniconda. |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
GIT_VERSION:v2.14.0-rc1-21-g4dacf3f368e VERSION:2.14.0
Custom code
No
OS platform and distribution
WSL2 Linux Ubuntu 22
Mobile device
No response
Python version
3.10, but I can try different versions
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
CUDA version: 11.8, cuDNN version: 8.7
GPU model and memory
NVIDIA Geforce GTX 1660 Ti, 8GB Memory
Current behavior?
When I run the GPU test from the TensorFlow install instructions, I get several errors and warnings.
I don't care about the NUMA stuff, but the first 3 errors are that TensorFlow was not able to load cuDNN. I would really like to be able to use it to speed up training some RNNs and FFNNs. I do get my GPU in the list of physical devices, so I can still train, but not as fast as with cuDNN.
Standalone code to reproduce the issue
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Relevant log output
The text was updated successfully, but these errors were encountered: