Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix driver version to install CUDA on ubuntu 18.04 #1599

Closed
wants to merge 4 commits into from
Closed

Fix driver version to install CUDA on ubuntu 18.04 #1599

wants to merge 4 commits into from

Conversation

hackerliang
Copy link

The issue seems to appear while running line 102 of this doc to install CUDA on Ubuntu 18.04. When I run the following script as it's written in the documentation, I ran to an error.

sudo apt-get install --no-install-recommends
cuda-10-1
libcudnn7=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1

The error is as follows:

cuda-10-1 : Depends: cuda-runtime-10-1 (>= 10.1.243) but it is not going to be installed
            Depends: cuda-demo-suite-10-1 (>= 10.1.243) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

I finally figured out why this happened:

When you install nvidia-driver-430, apt actually installed nvidia-driver-440
After that, when you install development and runtime libraries,
sudo apt-get install --no-install-recommends
cuda-10-1
libcudnn7=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1

This will get the error that I elaborated above, cuda-10-1 dependencies cannot be installed.

That's because, NVIDIA has updated their package to cuda-drivers=450.36.06-1, which is compatible with CUDA 11, but not CUDA 10.
When you install cuda-10-1, you actually installing cuda-drivers=450.36.06-1, but this will make the whole system upgraded to nvidia-driver-450 with CUDA 11.

In conclude:
Update command from nvidia-driver-430 to nvidia-driver-440, because when you install 430, you actually installing 440.
Limit cuda-drivers version to 440.64.00-1, which means cuda-drivers=440.64.00-1

The issue seems to appear while running line 102 of this doc to install CUDA on Ubuntu 18.04. When I run the following script as it's written in the documentation, I ran to an error.

<code class="devsite-terminal">sudo apt-get install --no-install-recommends \
    cuda-10-1 \
    libcudnn7=7.6.4.38-1+cuda10.1  \
    libcudnn7-dev=7.6.4.38-1+cuda10.1
</code>

The error is as follows: 
```
cuda-10-1 : Depends: cuda-runtime-10-1 (>= 10.1.243) but it is not going to be installed
            Depends: cuda-demo-suite-10-1 (>= 10.1.243) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
```
I finally figured out why this happened:

When you install nvidia-driver-430, apt actually installed nvidia-driver-440
After that, when you install development and runtime libraries,
<code class="devsite-terminal">sudo apt-get install --no-install-recommends \
    cuda-10-1 \
    libcudnn7=7.6.4.38-1+cuda10.1  \
    libcudnn7-dev=7.6.4.38-1+cuda10.1
</code>
This will get the error that I elaborated above, cuda-10-1 dependencies cannot be installed.

That's because, NVIDIA has updated their package to cuda-drivers=450.36.06-1, which is compatible with CUDA 11, but not CUDA 10.
When you install cuda-10-1, you actually installing cuda-drivers=450.36.06-1, but this will make the whole system upgraded to nvidia-driver-450 with CUDA 11.

In conclude:
Update command from nvidia-driver-430 to nvidia-driver-440, because when you install 430, you actually installing 440.
Limit cuda-drivers version to 440.64.00-1, which means cuda-drivers=440.64.00-1
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no CLA has not been signed label Jun 12, 2020
@hackerliang
Copy link
Author

@googlebot I signed it!

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes CLA has been signed and removed cla: no CLA has not been signed labels Jun 12, 2020
@hackerliang
Copy link
Author

Install nvidia-driver-430, actually installing nvidia-driver-440
图片

Dependency error
图片

When I limit cuda-drivers version, error fix
图片

@lamberta lamberta requested review from sanjoy and gunan June 12, 2020 19:57
@lamberta
Copy link
Member

Nice work figuring this out—appreciate it!

# Reboot. Check that GPUs are visible using the command: nvidia-smi

# Install development and runtime libraries (~4GB)
<code class="devsite-terminal">sudo apt-get install --no-install-recommends \
cuda-10-1 \
libcudnn7=7.6.4.38-1+cuda10.1 \
libcudnn7-dev=7.6.4.38-1+cuda10.1
libcudnn7-dev=7.6.4.38-1+cuda10.1 \
cuda-drivers=440.64.00-1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded versions look brittle. @nluehr is there a way where we can avoid hardcoding the versions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following commands are feasible, but there may be a better solution?

sudo apt-get install --no-install-recommends \
    cuda-10-1 \
    libcudnn7=7.6.4.38-1+cuda10.1  \
    libcudnn7-dev=7.6.4.38-1+cuda10.1 \
    cuda-drivers=440\*

图片

@skq-5233
Copy link

Hello! When I run the sh.compile.sh command, it compiles successfully, but when I run train.py, I get an error: undefined symbol:_ZTIN10tensorflow8opkernelE. I checked a lot of information on the Internet and did not solve it. Hope you help me see what is the problem. Thanks again! ! !
ubuntu :16.04 ; nvidia-driver-440;cuda:10.0; cudnn:7.5.0; tensorflow-gpu:1.13.1; gcc/g++:5.4.0;
@image
@hackerliang

@hackerliang
Copy link
Author

Hello! When I run the sh.compile.sh command, it compiles successfully, but when I run train.py, I get an error: undefined symbol:_ZTIN10tensorflow8opkernelE. I checked a lot of information on the Internet and did not solve it. Hope you help me see what is the problem. Thanks again! ! !
ubuntu :16.04 ; nvidia-driver-440;cuda:10.0; cudnn:7.5.0; tensorflow-gpu:1.13.1; gcc/g++:5.4.0;
@image
@hackerliang

Maybe you just install the compiled version rather than compiling by yourself

conda install tensorflow==1.13.1 tensorflow-gpu==1.13.1

@skq-5233
Copy link

skq-5233 commented Jun 19, 2020 via email

@skq-5233
Copy link

skq-5233 commented Jun 19, 2020 via email

@nluehr
Copy link
Contributor

nluehr commented Jul 8, 2020

Updating the drivers to the newly released r450 as in PR 1616 should fix this issue without needing to explicitly install cuda-drivers or pin its version.

@hackerliang
Copy link
Author

@nluehr
The problem is that we don't know if Tensorflow can works with CUDA 11 well. NVIDIA Driver 450 is bound with CUDA 11 and this PR is to prevent the whole system from upgrading to NVIDIA Driver 450 with CUDA 11.
If you have tested that all functionality of TensorFlow can work normally with the CUDA 11, then it can be upgraded to CUDA 11.

@hackerliang hackerliang requested a review from gunan July 8, 2020 19:58
@yashk2810
Copy link
Member

TF stable won't work with Cuda 11

@nluehr
Copy link
Contributor

nluehr commented Jul 8, 2020

Nvidia drivers are not bound to specific CUDA toolkit versions, but are backward compatible with older toolkits. So the r450 driver supports CUDA 10.1 and will work with stable TF releases.

@hackerliang
Copy link
Author

Nvidia drivers are not bound to specific CUDA toolkit versions, but are backward compatible with older toolkits. So the r450 driver supports CUDA 10.1 and will work with stable TF releases.

sudo apt install nvidia-driver-450

This command is actually installing NV Driver 450 with CUDA 11.

@hackerliang
Copy link
Author

TF stable won't work with Cuda 11

So what we need to do is fix this thing ASAP in the following two ways:

  1. Prevent the APT package manager from installing NV Driver 450 with CUDA 11(this PR).
  2. Upgrade TensorFlow stable to make it works with CUDA 11.

The installation guide for the official website is outdated.

@gunan
Copy link
Contributor

gunan commented Jul 8, 2020

Unfortunately, upgrading tf stable to cuda 11 will not happen until 2.4, which will be cut in september.
Cuda 11 support did not make it into 2.3 release cut.

@nluehr
Copy link
Contributor

nluehr commented Jul 8, 2020

@hackerliang, sudo apt install nvidia-driver-450 does not install the CUDA 11 toolkit, just the driver. The 10.1 toolkit is installed by sudo apt-get install --no-install-recommends cuda-10-1.

@hackerliang
Copy link
Author

@nluehr
You can try to install nvidia-driver-450 then install CUDA 10.
You will find that there is a dependency error and you can not install CUDA 10 but only CUDA 11.

@yashk2810
Copy link
Member

There is a way to install driver 450 and cuda 10.1.

Use the .run files that nvidia provides which give you an option to only install the driver or the toolkit. So you can install 450 driver from the cuda 11 toolkit(https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal) and then install cuda 10.1 (but not the driver) from cuda 10.1 toolkit.

This works since I did this yesterday. But I don't think this can be added to the install instructions since its not straightforward.

@nluehr
Copy link
Contributor

nluehr commented Jul 8, 2020

@hackerliang That issue should be fixed in the newly released 450.51.05 driver.
@yashk2810 The 450.51.05 driver is available in the same nvidia developer debian repo already used in the current instructions. So the install instructions should work by just replacing the nvidia-driver-430 with nvidia-driver-450 like here.

@yashk2810
Copy link
Member

yashk2810 commented Jul 8, 2020

When I tried that install command for 450, it gave me an error.

Can you paste your results from your try?

@nluehr
Copy link
Contributor

nluehr commented Jul 9, 2020

Following the instructions (and installing the 450 driver) on a clean system I see no errors. If I first install the 430 driver, I see the following error when attempting to install the 10.1 toolkit (same as reported at the top of this issue thread).

apt-get install --no-install-recommends cuda-10-1 \
                                        libcudnn7=7.6.4.38-1+cuda10.1 \
                                        libcudnn7-dev=7.6.4.38-1+cuda10.1
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda-10-1 : Depends: cuda-runtime-10-1 (>= 10.1.243) but it is not going to be installed
             Depends: cuda-demo-suite-10-1 (>= 10.1.243) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Now if I attempt to install the 450 driver over the top of the installed 430 driver I get a similar error.

apt-get install --no-install-recommends nvidia-driver-450
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 nvidia-driver-450 : Depends: nvidia-dkms-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: nvidia-kernel-source-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: libnvidia-extra-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: nvidia-compute-utils-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: nvidia-utils-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: libnvidia-ifr1-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
                     Depends: libnvidia-fbc1-450 (= 450.51.05-0ubuntu1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

This error is the result of the broken 430 driver package. Removing it first with apt-get remove nvidia-driver-430 and then installing the 450 driver succeeds. And with the 450 driver, the cuda 10.1 install also succeeds.

What error are you seeing @hackerliang?

@hackerliang
Copy link
Author

@nluehr
This thing happened a month ago, and I don't remember the details anymore and I don't have a spare machine now, maybe someone else can help with this test. However, I found a table from NVIDIA's official website. I don't really know what N/A means.

图片
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-application-compatibility

@nluehr
Copy link
Contributor

nluehr commented Jul 13, 2020

@hackerliang the table linked refers to the 'CUDA compatibility platform' which is a Tesla-only feature that provides limited FORWARD compatibility (using older driver with newer toolkits).

Using older toolkits with newer drivers does not depend on that feature and is documented in section 2.2 of the same doc.

@yashk2810
Copy link
Member

The driver was update to 450. Is this PR still valid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA has been signed
Projects
None yet
7 participants