Skip to content
This repository has been archived by the owner on Oct 11, 2023. It is now read-only.

NVidia cuda support #1637

Open
flesicek opened this issue Feb 19, 2017 · 29 comments
Open

NVidia cuda support #1637

flesicek opened this issue Feb 19, 2017 · 29 comments

Comments

@flesicek
Copy link

Could you please consider supporting NVidia cuda drivers implementation in Rancher OS?

NVidia is already providing docker support here https://github.com/NVIDIA/nvidia-docker

@doprdele
Copy link

+1

1 similar comment
@lost-carrier
Copy link

+1

@niusmallnan niusmallnan added this to the v1.4.0 milestone Mar 16, 2018
@wchao1241
Copy link
Contributor

Working on the support of Nvidia cuda drivers implementation in Rancher OS.But now there is a problem that hardware devices cannot be identified in the os kernel,which leads to the fact that driver cannot be installed.

The following are the contrast between RancherOS and Ubuntu16.04
RancherOS:
00:1e.0 Class 0302: 10de:102d

Ubuntu16.04:
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

@vincent99
Copy link
Contributor

Drivers identify the hardware by their PCI IDs; 10de:102dis the ID of that card. There is not a database of ID->human friendly name mappings loaded in RancherOS, but this shouldn't affect anything with the driver detecting it.

@wchao1241
Copy link
Contributor

@vincent99 Yes, drivers identify hardware by PCI IDs.The error of installation caused by other reasons. Thank you very much.

@kingsd041
Copy link
Contributor

kingsd041 commented May 15, 2018

Tested with rancheros v1.4.0-rc1.
@wchao1241 I verified this issue with reference to https://github.com/rancher/os-services/tree/master/n, but I encountered some errors in the execution of /var/lib/rancher/nvidia/build.sh that made me No way to continue.
The below is error output

  ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.


  ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See
         /var/log/nvidia-installer.log for details.

  ERROR: The nvidia kernel module was not created.
  
  ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation
         problems in the README available on the Linux driver download page at www.nvidia.com.

@niusmallnan
Copy link
Contributor

The nvidia-docker cannot work with our new kernel, we have asked for help to that community.
Keep this open before we find the solution and remove this feature in v1.4.0 milestone.

@niusmallnan niusmallnan modified the milestones: v1.4.0, unscheduled May 28, 2018
@doprdele
Copy link

Does this work now?

@niusmallnan
Copy link
Contributor

We have fixed the kernel issue, I think we can add this nvidia-docker support on next release.

@doprdele
Copy link

doprdele commented Aug 16, 2018 via email

@mcapuccini
Copy link

Hi there! Any update on this?

@niusmallnan
Copy link
Contributor

It can work in the Ubuntu console, but we want to support it in the default console.
We are making the final effort.

@niusmallnan niusmallnan modified the milestones: unscheduled, v1.5.1 Dec 25, 2018
@kingsd041 kingsd041 removed their assignment Dec 28, 2018
@tech98321469320842
Copy link

@niusmallnan What is the current status for the nVidia CUDA integration? Is it possible to deploy it in some way?

@tech98321469320842
Copy link

@niusmallnan At which point in time will the version 1.5.1 be released?

@niusmallnan
Copy link
Contributor

@tech98321469320842 At the end of Feb.

@niusmallnan
Copy link
Contributor

I want to integrate nvidia-docker2, currently mainly related to these projects.

They almost only provide deb and rpm packages, and it seems difficult to install from binary. So at this stage I don't plan to support it in the default console. I will give priority to supporting it in the ubuntu console.

Usually we just need to add the apt source and install the corresponding deb, but there will be a problem in ROS. The nvidia-docker2 relies on docker deb files, ROS does not use the deb to manage docker.

# https://github.com/NVIDIA/nvidia-docker/blob/master/debian/control

Package: nvidia-docker2

Architecture: all
Breaks: nvidia-docker (<< 2.0.0)
Replaces: nvidia-docker (<< 2.0.0)
Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@

So I can customize nvidia-docker2, just remove this dependency.

Boot a vm(Ubuntu 18.04), and build the package after this patch

diff --git a/debian/control b/debian/control
index d06d85f..86d2023 100644
--- a/debian/control
+++ b/debian/control
@@ -12,7 +12,7 @@ Package: nvidia-docker2
 Architecture: all
 Breaks: nvidia-docker (<< 2.0.0)
 Replaces: nvidia-docker (<< 2.0.0)
-Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@), @DOCKER_VERSION@
+Depends: ${misc:Depends}, nvidia-container-runtime (= @RUNTIME_VERSION@)
 Description: nvidia-docker CLI wrapper
  Replaces nvidia-docker with a new implementation based on
  nvidia-container-runtime

Just run make 18.06.1-ubuntu18.04, you can replace 18.06.1 if you want to support other docker version.

Boot a ROS(v1.5.0) instance, and add nvidia-docker repo, but we need to use the ubuntu console

ros console switch ubuntu

apt update && apt install gnupg

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
  sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
apt update

Install pakages

apt install  nvidia-container-runtime=2.0.0+docker18.06.1-1

# install your custom nvidia-docker 
dpkg -i nvidia-docker2_2.0.3+docker18.06.1-1_all.deb

@niusmallnan niusmallnan removed this from the v1.5.1 milestone Feb 9, 2019
@rkdgo
Copy link

rkdgo commented Feb 10, 2019

Tesla K80 installed

lspci | grep NVIDIA
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

Is there a special way of installing NVIDIA driver on RancherOS 1.5.0?

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

reports an error:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused "process_linux.go:385: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 --pid=8281 /var/lib/docker/overlay2/9b72f828525e4a83bd6084006f7caf8af91ad99b8249cc50e89de7053f24462e/merged]\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\n\""": unknown.

@mathieupost
Copy link

mathieupost commented Feb 19, 2019

I got a similar error

lspci | grep NVIDIA
01:00.0 3D controller: NVIDIA Corporation GK106M [GeForce GTX 765M] (rev a1)
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=8168 /var/lib/docker/overlay2/514c92b8bb41862f9810364512638163ada28875a193b4f4200e7d6563ee15ac/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.

although the error is different at the end initialization error: driver error: failed to process request

EDIT:
Turned out the nvidia driver was not correctly installed for me. I can't install it correctly because the nouveau kernel module gets loaded every boot although is it blacklisted by /etc/modprobe.d/nvidia-installer-disable-nouveau.conf

@lygstate
Copy link

@niusmallnan So it's fixed now?

@turley
Copy link

turley commented Apr 4, 2019

I went through the steps outlined by @niusmallnan above and kept running into the following error:

nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1

After a little digging, I found that it's failing when trying to pivot_root here:
https://github.com/NVIDIA/libnvidia-container/blob/deccb2801502675bd283c6936861814dbca99ecd/src/nvc_ldcache.c#L117

I'm not sure why it's failing there or how to fix it, but thought I'd share my findings in case it helps someone else narrow down the issue.

davidhyman added a commit to fiveai/nvidia-docker that referenced this issue May 28, 2019
@kidhasmoxy
Copy link

@davidhyman did you get this to work on rancher OS using the patched package?

@tobylo
Copy link

tobylo commented Sep 6, 2019

Is this issue still being worked on and if so, any update on the status?

@Confusingboat
Copy link

+1 interested in the status of this issue

3 similar comments
@stlaurentc
Copy link

+1 interested in the status of this issue

@NM4
Copy link

NM4 commented Nov 25, 2019

+1 interested in the status of this issue

@andrew-mcgrath
Copy link

+1 interested in the status of this issue

@user-name-is-taken
Copy link

Looking through the rancher docs I found this page that talks about scheduling pods to nodes with gpus for what it's worth.

@piersdd
Copy link

piersdd commented Mar 21, 2020

+1 interested in the status of this issue

1 similar comment
@redbaron-gt
Copy link

+1 interested in the status of this issue

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests