New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix driver version to install CUDA on ubuntu 18.04 #1599
Conversation
The issue seems to appear while running line 102 of this doc to install CUDA on Ubuntu 18.04. When I run the following script as it's written in the documentation, I ran to an error. <code class="devsite-terminal">sudo apt-get install --no-install-recommends \ cuda-10-1 \ libcudnn7=7.6.4.38-1+cuda10.1 \ libcudnn7-dev=7.6.4.38-1+cuda10.1 </code> The error is as follows: ``` cuda-10-1 : Depends: cuda-runtime-10-1 (>= 10.1.243) but it is not going to be installed Depends: cuda-demo-suite-10-1 (>= 10.1.243) but it is not going to be installed E: Unable to correct problems, you have held broken packages. ``` I finally figured out why this happened: When you install nvidia-driver-430, apt actually installed nvidia-driver-440 After that, when you install development and runtime libraries, <code class="devsite-terminal">sudo apt-get install --no-install-recommends \ cuda-10-1 \ libcudnn7=7.6.4.38-1+cuda10.1 \ libcudnn7-dev=7.6.4.38-1+cuda10.1 </code> This will get the error that I elaborated above, cuda-10-1 dependencies cannot be installed. That's because, NVIDIA has updated their package to cuda-drivers=450.36.06-1, which is compatible with CUDA 11, but not CUDA 10. When you install cuda-10-1, you actually installing cuda-drivers=450.36.06-1, but this will make the whole system upgraded to nvidia-driver-450 with CUDA 11. In conclude: Update command from nvidia-driver-430 to nvidia-driver-440, because when you install 430, you actually installing 440. Limit cuda-drivers version to 440.64.00-1, which means cuda-drivers=440.64.00-1
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Nice work figuring this out—appreciate it! |
site/en/install/gpu.md
Outdated
# Reboot. Check that GPUs are visible using the command: nvidia-smi | ||
|
||
# Install development and runtime libraries (~4GB) | ||
<code class="devsite-terminal">sudo apt-get install --no-install-recommends \ | ||
cuda-10-1 \ | ||
libcudnn7=7.6.4.38-1+cuda10.1 \ | ||
libcudnn7-dev=7.6.4.38-1+cuda10.1 | ||
libcudnn7-dev=7.6.4.38-1+cuda10.1 \ | ||
cuda-drivers=440.64.00-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded versions look brittle. @nluehr is there a way where we can avoid hardcoding the versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! When I run the sh.compile.sh command, it compiles successfully, but when I run train.py, I get an error: undefined symbol:_ZTIN10tensorflow8opkernelE. I checked a lot of information on the Internet and did not solve it. Hope you help me see what is the problem. Thanks again! ! ! |
Maybe you just install the compiled version rather than compiling by yourself conda install tensorflow==1.13.1 tensorflow-gpu==1.13.1 |
Thank you for your reply. The compilation problem has been solved, but a new problem has occurred, and it has no effect according to the online method.
ValueError: Variable degridding/conv1/weights already exists, disallowed. Did you mean to set reuse=True or reuse=tf.AUTO_REUSE in VarScope? Originally defined at:
…------------------ 原始邮件 ------------------
发件人: "hackerliang"<notifications@github.com>;
发送时间: 2020年6月18日(星期四) 晚上11:48
收件人: "tensorflow/docs"<docs@noreply.github.com>;
抄送: "Dandelion's Fled"<794919561@qq.com>;"Comment"<comment@noreply.github.com>;
主题: Re: [tensorflow/docs] Fix driver version to install CUDA on ubuntu 18.04 (#1599)
Hello! When I run the sh.compile.sh command, it compiles successfully, but when I run train.py, I get an error: undefined symbol:_ZTIN10tensorflow8opkernelE. I checked a lot of information on the Internet and did not solve it. Hope you help me see what is the problem. Thanks again! ! !
ubuntu :16.04 ; nvidia-driver-440;cuda:10.0; cudnn:7.5.0; tensorflow-gpu:1.13.1; gcc/g++:5.4.0;
@
@hackerliang
Maybe you just install the compiled version rather than compiling by yourself
conda install tensorflow==1.13.1 tensorflow-gpu==1.13.1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hello, I changed the version of tf==1.10.0, the compilation process is ok, but the error is still reported when running train.py
…------------------ 原始邮件 ------------------
发件人: "hackerliang"<notifications@github.com>;
发送时间: 2020年6月18日(星期四) 晚上11:48
收件人: "tensorflow/docs"<docs@noreply.github.com>;
抄送: "Dandelion's Fled"<794919561@qq.com>;"Comment"<comment@noreply.github.com>;
主题: Re: [tensorflow/docs] Fix driver version to install CUDA on ubuntu 18.04 (#1599)
Hello! When I run the sh.compile.sh command, it compiles successfully, but when I run train.py, I get an error: undefined symbol:_ZTIN10tensorflow8opkernelE. I checked a lot of information on the Internet and did not solve it. Hope you help me see what is the problem. Thanks again! ! !
ubuntu :16.04 ; nvidia-driver-440;cuda:10.0; cudnn:7.5.0; tensorflow-gpu:1.13.1; gcc/g++:5.4.0;
@
@hackerliang
Maybe you just install the compiled version rather than compiling by yourself
conda install tensorflow==1.13.1 tensorflow-gpu==1.13.1
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Updating the drivers to the newly released r450 as in PR 1616 should fix this issue without needing to explicitly install cuda-drivers or pin its version. |
@nluehr |
TF stable won't work with Cuda 11 |
Nvidia drivers are not bound to specific CUDA toolkit versions, but are backward compatible with older toolkits. So the r450 driver supports CUDA 10.1 and will work with stable TF releases. |
sudo apt install nvidia-driver-450 This command is actually installing NV Driver 450 with CUDA 11. |
So what we need to do is fix this thing ASAP in the following two ways:
The installation guide for the official website is outdated. |
Unfortunately, upgrading tf stable to cuda 11 will not happen until 2.4, which will be cut in september. |
@hackerliang, |
@nluehr |
There is a way to install driver 450 and cuda 10.1. Use the .run files that nvidia provides which give you an option to only install the driver or the toolkit. So you can install 450 driver from the cuda 11 toolkit(https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal) and then install cuda 10.1 (but not the driver) from cuda 10.1 toolkit. This works since I did this yesterday. But I don't think this can be added to the install instructions since its not straightforward. |
@hackerliang That issue should be fixed in the newly released 450.51.05 driver. |
When I tried that install command for 450, it gave me an error. Can you paste your results from your try? |
Following the instructions (and installing the 450 driver) on a clean system I see no errors. If I first install the 430 driver, I see the following error when attempting to install the 10.1 toolkit (same as reported at the top of this issue thread).
Now if I attempt to install the 450 driver over the top of the installed 430 driver I get a similar error.
This error is the result of the broken 430 driver package. Removing it first with What error are you seeing @hackerliang? |
@nluehr
|
@hackerliang the table linked refers to the 'CUDA compatibility platform' which is a Tesla-only feature that provides limited FORWARD compatibility (using older driver with newer toolkits). Using older toolkits with newer drivers does not depend on that feature and is documented in section 2.2 of the same doc. |
The driver was update to 450. Is this PR still valid? |
The issue seems to appear while running line 102 of this doc to install CUDA on Ubuntu 18.04. When I run the following script as it's written in the documentation, I ran to an error.
sudo apt-get install --no-install-recommends
cuda-10-1
libcudnn7=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1
The error is as follows:
I finally figured out why this happened:
When you install nvidia-driver-430, apt actually installed nvidia-driver-440
After that, when you install development and runtime libraries,
sudo apt-get install --no-install-recommends
cuda-10-1
libcudnn7=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1
This will get the error that I elaborated above, cuda-10-1 dependencies cannot be installed.
That's because, NVIDIA has updated their package to cuda-drivers=450.36.06-1, which is compatible with CUDA 11, but not CUDA 10.
When you install cuda-10-1, you actually installing cuda-drivers=450.36.06-1, but this will make the whole system upgraded to nvidia-driver-450 with CUDA 11.
In conclude:
Update command from nvidia-driver-430 to nvidia-driver-440, because when you install 430, you actually installing 440.
Limit cuda-drivers version to 440.64.00-1, which means cuda-drivers=440.64.00-1