New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker with GPU 2.3rc0 CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid #41132
Comments
Hmm... doesn't have trouble on my machine in the same container. Thanks a bunch for the exact replication commands.
Can you include the output of |
Sure, here it is.
Since it is mentioned in the TF docs that the NVIDIA® CUDA® Toolkit does not need to be installed I thought that the CUDA version on the host machine would not matter. |
CUDA Compute Capability is inherent to your graphics card. There were some size-reduction changes to our binaries in 2.3 that adjusted handling of old capabilities (such as 5.0), but I believe TensorFlow 2.3 should still support capabilities as old as 3.5. Can you try to replicate this outside of a Docker container to see if it's related to your graphics card vs. the container environment? FYI @chsigg |
We removed PTX for all but sm_70 from TF builds in cf1b6b3. We never shipped with kernels for sm_50, only sm_52. Apparently the driver was able to compile PTX for sm_52 to sm_50, even though it's not officially supported. If you want to run on a sm_50 card, it would be best to build TF from source. |
Forgive my ignorance here, but with the change on cf1b6b3, I don't see |
No, the driver will be able to JIT |
I believe (It would indeed be a problem if we have to JIT for V100 since that would add startup latency for a popular GPU.) Edit: see https://github.com/tensorflow/tensorflow/blob/master/third_party/gpus/cuda_configure.bzl. |
Actually, I have same error message after I run bazel tests with gpu benchmarks. System: Ubuntu 18.04 Input:
But, when I run models instead of running bazel test, it works fine. |
Are you running these models also using a TF built from source, or are you running them using a pip installed binary? |
I think it should be pip installed binary.
|
Ok, so that could explain the discrepancy -- the TF binary you built for tests probably does not include sm_70 or compute_70. Can you grep for |
I searched Let me clear my environment, I used pip to install tensorflow and download current tensorflow repo to run the bazel tests on |
It works for me after run |
Hi all, I'm running into the same issue here. Both the Docker installation of Tensorflow and the local pip installation are giving me the same error: tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid System information
I was able to reproduce this error in the docker container using the steps listed above, but also just following the steps listed here lead to the same error. The GPU is detected ok using 2.2.0 works fine without any issues - both locally and through Docker. |
@navganti PTAL here #41892 (comment). |
@sanjoy I see! Thank you for the update. |
This reverts commit 43e9ccd. Reason for revert: TensorFlow 2.3 don't work on Linux GPU tensorflow/tensorflow#41976 tensorflow/tensorflow#41132 Change-Id: I819d10e8129aeaf57bf5f202600d0b5e1086000e
I'm seeing this error when I try to call TF_LoadSessionFromSavedModel (C API). It works correctly via the non-GPU docker image, but fails with the GPU docker image with the following error: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid I'm using the latest docker image, i.e., tensorflow/tensorflow:latest-gpu built on 7/28/20. |
CUDA 10.1 and CUDNN 7.4 and 7.5 is failed. (TF 2.3) |
@sanjoy You're right, the chip (1660 Ti) absolutely does lack Tensor Cores, but that shouldn't be a problem at all. No GTX chip has tensor cores either and the C API works fine on those. Also, training via Python is GPU-accelerated on my laptop with this chip. It's just the C API that's giving me this "device kernel image is invalid" error, which is clearly a bug somewhere. :( |
@angerson I'll send a fix internally that matches the pip package config. |
I have a commit that should fix this. I'll keep monitoring to see if there are any other issues. |
@av8ramit Very sorry, I'm new to all of this. Will the fix eventually be available in the "latest" docker image and the C API libraries that are posted to the web page that I linked to (above)? Do you have any idea when that might happen? Thanks in advance. |
Unfortunately it appears my change did not work, digging into why today. |
We're seeing the same issue as @motrek : Then on the same system using If I hack our C++ application to link against OS: Ubuntu 20.04 |
Sorry I forgot to update this thread, but the latest GCS builds are built with the following computes:
|
Sorry for the [probably very basic] question but how do you "link against python"? I found "_pywrap_tensorflow_internal.so" and tried linking with it but get a bunch of unresolved python symbols. (As one would expect from reading your post.) |
I'm working off of this page to get the libraries: https://www.tensorflow.org/install/lang_c There's a link to a "GCS bucket" but it goes to an XML file in my browser (Chrome) which seems like it's supposed to be read by a different piece of software that I don't have. I see that "GCS" is short for "Google Cloud Storage" but when I try to access (?) "Google Cloud Storage," I'm asked to sign up for a service (including entering payment/credit card details). If somebody could point me in the right direction for accessing the files in this "GCS bucket" I would appreciate it. (Also, looking at that XML file in my browser, all of the files that were modified on 09-15 are 'libtensorflow-cpu-linux-x86_64.tar.gz' ... I don't see any files that seem like they would have GPU support?) |
So that link is a browser representation of the GCS bucket |
Thanks for building this and posting this link. I appreciate it. I installed it and tried it--I no longer got the error about "device kernel image is invalid" but I did get a bunch of errors about not being able to load libraries that seem related to CUDA 11. I will have to consider whether or not I want to switch to CUDA 11, seems like a big change and I'm reluctant to touch anything since my training with GPU acceleration is already working well with my current setup. |
Glad we were able to solve the first issue, sorry about the new issues you are facing. Do you mind uploading some logs so I can see if that's something we can fix on our end? The package was built with our CUDA 11 toolchain. |
I used If the issue has been fixed in the nightly which is TF 2.4 and not in the stable 2.3 release then I'm going to have the same dependency problem with CUDA that you are since the default CUDA version is 10.1 in the Ubuntu 20.04 package repos. |
@av8ramit yeah, all the errors I'm seeing are to do with not being able to load CUDA 11 libraries, which I'm sure is expected if you built the libraries against CUDA 11 and I don't have CUDA 11 installed. I'm happy to post logs here but I can't imagine they would be useful to anybody. I will think about installing CUDA 11 later, but it might not be for a while, since I was able to get my code working by linking it against the Python libraries as parnham suggested. |
@parnham Thanks so much for this. I don't have a good understanding of how the linker works with shared libraries but I was able to hack something together:
So everything seems pretty fragile but IT'S WORKING. Thanks so much. My C API inference workload is now ~5x as fast running on my GPU vs. the CPU. Great speedup. This is for a private project so I don't mind that I'm linking against those python libraries. If you can share what you did to allow gcc to link nicely against those libraries (not use absolute paths, etc.) that would be great, really appreciated. Thanks again! |
Glad I could help a little @motrek For linking to the
and then ran There's a slightly cleaner solution to setting the "allow growth" option by including the experimental header #include <tensorflow/c/c_api_experimental.h> and then use the auto options = TF_NewSessionOptions();
auto config = TF_CreateConfig(true, true, 8);
TF_SetConfig(options, config->data, config->length, this->status);
TF_DeleteBuffer(config); Use the session options as normal. |
Apologies for adding more activity to this issue @av8ramit but we wanted to find out if there was going to be a point release of the TensorFlow C library v2.3 that has been patched with the correct CUDA capabilities? |
Looping in the release manager. @geetachavan1 would we be able to patch the fix for libtensorflow and release new binaries with the correct CUDA capabilities. Happy to help get this done internally. |
For Nvidia 3090, Ubuntu 20.04, Cuda 10.1, Cudnn 7.6, Nvidia GPU driver 455 have the same isseu |
Hi. We have uploaded the 2.3.1 libtensorflow binaries. Apologies for the delay, I missed them during the patch release |
Using the link from https://www.tensorflow.org/install/lang_c#download and assuming that the new version was in the same location, I downloaded https://storage.googleapis.com/tensorflow/libtensorflow/libtensorflow-gpu-linux-x86_64-2.3.1.tar.gz Unfortunately that build seems to have the same issue:
In fact running a diff between the old and new libtensorflow.so.2.3.0 files shows them to be identical. EDIT: The 2.3.0 and 2.3.1 tar.gz files have the same md5sum |
This is interesting. Even if our recent changes to the build script had no effects, I would have expected the binaries to differ based on the patch release changes. |
Apolgies. It seems our CI uploaded the wrong package under the new name after we refactored parts of the CI. I think it should be fixed now, can you give it a try please?
|
This now works for me. Thanks for sorting this out. Might want to update the links on the C API page to point to the new versions. GPU: mobile GeForce GTX 1660 Ti |
Thank you so much @av8ramit and @mihaimaruseac - v2.3.1 is now working for us also! |
No problem! Hats off to @geetachavan1 and @mihaimaruseac who did the heavy lifting! |
It seem that the Docker image tensorflow/tensorflow:2.3.0rc0-gpu won't work with my GPU BUT on the other hand the image tensorflow/tensorflow:2.2.0rc0-gpu works fine
Or in other words, the solution to the present issue was to "downgrade" to tensorflow/tensorflow:2.2.0rc0-gpu
tensorflow/tensorflow:2.3.0rc0-gpu also works fine with CPU only.
System information
how to reproduce
full stack trace:
The text was updated successfully, but these errors were encountered: