NVidia cuda support #1637

flesicek · 2017-02-19T20:42:45Z

Could you please consider supporting NVidia cuda drivers implementation in Rancher OS?

NVidia is already providing docker support here https://github.com/NVIDIA/nvidia-docker

doprdele · 2018-03-15T17:48:45Z

+1

lost-carrier · 2018-03-15T20:54:59Z

+1

wchao1241 · 2018-04-16T02:52:32Z

Working on the support of Nvidia cuda drivers implementation in Rancher OS.But now there is a problem that hardware devices cannot be identified in the os kernel，which leads to the fact that driver cannot be installed.

The following are the contrast between RancherOS and Ubuntu16.04
RancherOS:
00:1e.0 Class 0302: 10de:102d

Ubuntu16.04:
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

vincent99 · 2018-04-16T03:16:44Z

Drivers identify the hardware by their PCI IDs; 10de:102dis the ID of that card. There is not a database of ID->human friendly name mappings loaded in RancherOS, but this shouldn't affect anything with the driver detecting it.

wchao1241 · 2018-04-20T07:34:57Z

@vincent99 Yes, drivers identify hardware by PCI IDs.The error of installation caused by other reasons. Thank you very much.

kingsd041 · 2018-05-15T08:01:45Z

Tested with rancheros v1.4.0-rc1.
@wchao1241 I verified this issue with reference to https://github.com/rancher/os-services/tree/master/n, but I encountered some errors in the execution of /var/lib/rancher/nvidia/build.sh that made me No way to continue.
The below is error output

  ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.


  ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See
         /var/log/nvidia-installer.log for details.

  ERROR: The nvidia kernel module was not created.
  
  ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation
         problems in the README available on the Linux driver download page at www.nvidia.com.

niusmallnan · 2018-05-28T07:36:21Z

The nvidia-docker cannot work with our new kernel, we have asked for help to that community.
Keep this open before we find the solution and remove this feature in v1.4.0 milestone.

doprdele · 2018-08-15T19:30:47Z

Does this work now?

niusmallnan · 2018-08-16T00:56:48Z

We have fixed the kernel issue, I think we can add this nvidia-docker support on next release.

doprdele · 2018-08-16T03:16:27Z

Awesome. Do you know when the next release might be for RancherOS?

…

On Aug 15 2018, at 8:56 pm, niusmallnan ***@***.***> wrote: We have fixed the kernel issue, I think we can add this nvidia-docker support on next release. — You are receiving this because you commented. Reply to this email directly, view it on GitHub (#1637 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABu0acPTYAmES5dv9H0-UlFyoB_owcRyks5uRMNXgaJpZM4MFled).

mcapuccini · 2018-11-15T15:23:57Z

Hi there! Any update on this?

niusmallnan · 2018-11-16T01:25:55Z

It can work in the Ubuntu console, but we want to support it in the default console.
We are making the final effort.

tech98321469320842 · 2019-01-21T16:27:34Z

@niusmallnan What is the current status for the nVidia CUDA integration? Is it possible to deploy it in some way?

tech98321469320842 · 2019-01-21T21:16:14Z

@niusmallnan At which point in time will the version 1.5.1 be released?

niusmallnan · 2019-01-22T01:14:10Z

@tech98321469320842 At the end of Feb.

niusmallnan · 2019-02-08T15:05:28Z

I want to integrate nvidia-docker2, currently mainly related to these projects.

They almost only provide deb and rpm packages, and it seems difficult to install from binary. So at this stage I don't plan to support it in the default console. I will give priority to supporting it in the ubuntu console.

Usually we just need to add the apt source and install the corresponding deb, but there will be a problem in ROS. The nvidia-docker2 relies on docker deb files, ROS does not use the deb to manage docker.

# https://github.com/NVIDIA/nvidia-docker/blob/master/debian/control

Package: nvidia-docker2

Architecture: all
Breaks: nvidia-docker (<< 2.0.0)
Replaces: nvidia-docker (<< 2.0.0)
Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@

So I can customize nvidia-docker2, just remove this dependency.

Boot a vm(Ubuntu 18.04), and build the package after this patch

diff --git a/debian/control b/debian/control
index d06d85f..86d2023 100644
--- a/debian/control
+++ b/debian/control
@@ -12,7 +12,7 @@ Package: nvidia-docker2
 Architecture: all
 Breaks: nvidia-docker (<< 2.0.0)
 Replaces: nvidia-docker (<< 2.0.0)
-Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@
+Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@)
 Description: nvidia-docker CLI wrapper
  Replaces nvidia-docker with a new implementation based on
  nvidia-container-runtime

Just run make 18.06.1-ubuntu18.04, you can replace 18.06.1 if you want to support other docker version.

Boot a ROS(v1.5.0) instance, and add nvidia-docker repo, but we need to use the ubuntu console

ros console switch ubuntu

apt update && apt install gnupg

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt update

Install pakages

apt install  nvidia-container-runtime=2.0.0+docker18.06.1-1

# install your custom nvidia-docker 
dpkg -i nvidia-docker2_2.0.3+docker18.06.1-1_all.deb

rkdgo · 2019-02-10T21:56:38Z

Tesla K80 installed

lspci | grep NVIDIA
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Is there a special way of installing NVIDIA driver on RancherOS 1.5.0?

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

reports an error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused "process_linux.go:385: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=8281 /var/lib/docker/overlay2/9b72f828525e4a83bd6084006f7caf8af91ad99b8249cc50e89de7053f24462e/merged]\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\""": unknown.

mathieupost · 2019-02-19T14:23:47Z

I got a similar error

lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GK106M [GeForce GTX 765M] (rev a1)

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=8168 /var/lib/docker/overlay2/514c92b8bb41862f9810364512638163ada28875a193b4f4200e7d6563ee15ac/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.

although the error is different at the end initialization error: driver error: failed to process request

EDIT:
Turned out the nvidia driver was not correctly installed for me. I can't install it correctly because the nouveau kernel module gets loaded every boot although is it blacklisted by /etc/modprobe.d/nvidia-installer-disable-nouveau.conf

lygstate · 2019-03-24T23:26:58Z

@niusmallnan So it's fixed now?

turley · 2019-04-04T19:25:16Z

I went through the steps outlined by @niusmallnan above and kept running into the following error:

nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1

After a little digging, I found that it's failing when trying to pivot_root here:
https://github.com/NVIDIA/libnvidia-container/blob/deccb2801502675bd283c6936861814dbca99ecd/src/nvc_ldcache.c#L117

I'm not sure why it's failing there or how to fix it, but thought I'd share my findings in case it helps someone else narrow down the issue.

as per rancher/os#1637 (comment)

kidhasmoxy · 2019-07-31T21:47:53Z

@davidhyman did you get this to work on rancher OS using the patched package?

tobylo · 2019-09-06T21:58:14Z

Is this issue still being worked on and if so, any update on the status?

Confusingboat · 2019-10-01T18:17:16Z