Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF Java 2.3 failed on Tesla V100 GPU #104

Closed
roywei opened this issue Aug 24, 2020 · 1 comment · Fixed by #113
Closed

TF Java 2.3 failed on Tesla V100 GPU #104

roywei opened this issue Aug 24, 2020 · 1 comment · Fixed by #113

Comments

@roywei
Copy link
Contributor

roywei commented Aug 24, 2020

Hi,

tensorflow-core-api:0.2.0-SNAPSHOT is failing on AWS EC2 instances with Tesla V100 GPUs.
all the core-api tests failed with error code:

CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

I have root caused the issue to be CUDA compute 7.0 is not enabled during compilation, running the following command and build from source again fixed the issue.
export TF_CUDA_COMPUTE_CAPABILITIES=7.0

Somehow in the release build of tensorflow-core-api, compute 7.0 is not enabled, but compute 3.5 and 7.0 should be the default capability of TF 2.3 according to here. The python packages and main repo built from source works fine without any modification.

References:
tensorflow/tensorflow#41132
tensorflow/tensorflow@cf1b6b3

@saudet
Copy link
Contributor

saudet commented Sep 10, 2020

BTW, from what I can see, it's compiling PTX code for 3.5 and 5.2, so it should still work.
Is it because you have a need to disable the JIT compiler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants