Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xgboost and stan install and test scripts #331

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

RobJY
Copy link
Contributor

@RobJY RobJY commented Jan 11, 2022

Here's code to install and test gpu supported xgboost and stan. Please let me know if I should move or add anything.

I'm not sure if I've put install_stan.R in the right place since it needs to be run by users to install stan. Is there a better place for it.

I got the xgboost test script from https://rdrr.io/cran/xgboost/src/demo/gpu_accelerated.R. Please let me know if you think there's something better we could use.

@RobJY RobJY changed the title Add xgboost stan install and test scripts Add xgboost and stan install and test scripts Jan 13, 2022
@cboettig
Copy link
Member

These look good to me! @noamross any thoughts here?

@noamross
Copy link
Collaborator

Only that I'm not sure these are updated since @RobJY found the issue in NVIDIA/nvidia-docker#342, so these should likely change to install something other than nvidia-cuda-toolkit, right @RobJY?

@noamross
Copy link
Collaborator

Also, these are interactive test scripts, not scripts that return a value or a specific error for failure. Wouldn't it be better to have something that can integrate into a smoke test?

@RobJY
Copy link
Contributor Author

RobJY commented Jan 31, 2022

That's right @noamross, I'll have to update the xgboost install script when I've got a fix working.

scripts/install_xgboost.sh Show resolved Hide resolved
scripts/install_stan.R Outdated Show resolved Hide resolved
scripts/install_xgboost.sh Show resolved Hide resolved
@RobJY
Copy link
Contributor Author

RobJY commented Feb 3, 2022

I've pushed an update based on a fix that I'll be deploying to our systems tomorrow. Let me explain what I think the issue is and a possible solution.

The issue we've been having seems to be that because the cuda libraries are being so actively developed it's very easy for a library to be updated after the cuda driver has been installed and then the library versions on the host and in the container no longer match. Keeping the nvidia-cuda-toolkit would probably be fine, but it contains so many packages that it's more likely to run into this problem. For this reason, the updated code uses a minimal set of libraries to help mitigate the issue.
Updating the driver would require scheduling a downtime for machine reboot, so it's better for us to lock the library versions and use apt-mark hold to try to keep them from being updated. This has worked in my test container.

It seems like using this script for Rocker builds has the opposite problem in that we will be building images with the current versions of the libraries, but will have no way of knowing what version of the driver users will have on their host machines. It seems very likely that this will often result in this library mismatch error. For this reason I removed the versions from the apt call and now that I think of it we probably don't want to hold the installed version either. I'll update that too. Maybe the best way to deal with this is to have a document that helps users performs a set of steps to try to resolve the problem if they receive the library mismatch error message. That may be as simple as updating their driver and then updating the libraries in the container.

Is there a better way to handle this?

@RobJY
Copy link
Contributor Author

RobJY commented Feb 7, 2022

Also, these are interactive test scripts, not scripts that return a value or a specific error for failure. Wouldn't it be better to have something that can integrate into a smoke test?

I was thinking we'd collect standard out from running this and get the return code. If the code isn't 0 we would just echo the standard out. I see now I should do that inside the script though and not create more work outside it.
Thanks for pointing this out @noamross!

I've added the bash scripts that do what I described above. Let me know if you'd like any other changes.

Shouldn't we not need to install here?
@RobJY
Copy link
Contributor Author

RobJY commented Mar 11, 2022

I just took another look at this to see if I could move it along. I could add some text about possible fixes a user could try if they encounter the Failed to initialize NVML: Driver/library version mismatch error message, but I think we first need to learn a little more about the issue on a long running server like we have on premises with the fixes we've proposed there. Those being that we will not peg versions of GPU packages and will install a new version of the Nvidia driver with each update. This should cut down on the number of this kind of mismatch errors, but I'm not sure it will completely eliminate them. I'll update this PR when we've gone through another local update and deploy.

@eitsupi eitsupi marked this pull request as draft May 4, 2022 10:38
@RobJY
Copy link
Contributor Author

RobJY commented Jul 14, 2022

We've gone through another deploy and had the same 'version mismatch' issue. To correct it I modified the script (attached as text file) we run on our host machine to install the CUDA toolkit at version 11.1. Curiously, nvidia-smi shows CUDA version = 11.7 on both the host and in the container after running the script. It failed previously in the container with the 'version mismatch' message. I've read that this version report can be incorrect, but I wanted to check. We don't have nvcc installed on the host so I can't check the version there. I'd rather not install it now and risk causing versions of other packages to change. We also don't have a /usr/local/cuda/version.txt file on either the host or in the container. I can see that in /usr/local we only have the directories cuda-11.1 and cuda and cuda is just a symlink to cuda-11.1. For this reason I think we're actually running toolkit version 11.1 as intended and this technique can be used by Rocker users to update their host machines to match the toolkit version in the container.

The block of code at the bottom of the script to install verison 11.1 of the toolkit comes from this Nvidia page with the only modification being the change from cuda to cuda-11.1 in the last line. We could pass instructions for Rocker users encountering this 'version mismatch' problem to go to this link and make the required modification to have them install the 11.1 version of the toolkit on their host machine and hopefully correct the error as it did for us.

How does this sound?

install_container_toolkit.txt

@cboettig
Copy link
Member

@RobJY thanks for the follow-up. the driver mismatch error is definitely an annoying one and not one that I've completely wrapped my head around either. A few comments/observations so far:

Curiously, nvidia-smi shows CUDA version = 11.7 on both the host and in the container after running the script.

This is as expected. nvidia-smi shows the CUDA driver version, i.e. the host version, not the version of the toolkit software. The driver version must be >= the toolkit version accoding to NVIDIA.

Updating the driver on the host while the container is still running is one way to get the Failed to initialize NVML: Driver/library version mismatch, but I also see this error when I don't think I've done any updates on the host? Have you observed that? Either way, simply doing docker restart <container_id> resolves the error, though obviously not ideal.

@RobJY
Copy link
Contributor Author

RobJY commented Jul 15, 2022

Thank you @cboettig for the explanation of the CUDA version from nvidia-smi! That makes sense.

You're right about suggesting a machine reboot. That should be the first thing we ask users to try. I did see that that was successful for many users on the forums. Unfortunately, it didn't work for me.

Let me explain what I think the issue was for our servers and please let me know if you think I've got any of it wrong.
At the top of my attached script we install the updated driver and container toolkit. I think this installs package version greater than those used by CUDA 11.1. To counteract this I install the 11.1 toolkit on the host machine at the bottom of the script. This downgrades the versions previously installed.
I have two pieces of evidence that this is what's happening:

  1. I no longer get the package mismatch error
  2. It fails if I don't use the flag --allow-downgrades in my call to apt at the top of the script with a message about not being able to satisfy dependencies when installing CUDA toolkit 11.1.

Is there a better way to achieve this?

@cboettig
Copy link
Member

Thanks @RobJY , apologies I missed that the script was not part of the commit but just attached to the comment. Very interesting! if it works it works, but honestly this just deepens the puzzle for me -- this script is all being run outside the container, on the host machine, yeah?

I don't think you should need the toolkit of any kind installed on the host machine though? Maybe I'm just confused. All software above the kernel level (i.e. not including the drivers) should be handled in the containerized layer? Or does that assumption go out the window when we're using nvidia-docker?

I've also had to mess with apt-pinning to get nvidia container runtime to play nicely with upgraded drivers, as described here: pop-os/pop#1708 (comment) . Maybe this is related to / an alternative version of your script above?

But I still see containers that have been up a while start to throw the error Failed to initialize NVML: Unknown Error whenever i try GPU operations (e.g. nvidia-smi) until I restart them, even when (I think) I haven't updated the drivers, so I don't really know why that is... in these cases, the host can still run nvidia-smi no problem, it's just the container that is stuck. Is that the situation you were seeing as well?

@RobJY
Copy link
Contributor Author

RobJY commented Jul 15, 2022

Apologies about the attached script. I did have it hidden at the bottom of the post. Yes, we're running the attached script on the host machine.

Yeah, I think you're right that the nvidia-docker install loads some packages on the host and those are at a higher version than those in the container. This makes sense to me because the without any pinning it's going to install the newest versions available. I also tried pinning versions earlier, but Nvidia doesn't seem to distribute older versions of packages for very long and when I went back to use the pinned script I had the older versions of the packages were unavailable.

I haven't seen the Failed to initialize NVML: Unknown Error message before. I wonder if something running on the GPU got the card(s) into a bad state. I was just experimenting with some ML code that can make use of multiple cards. We have 2 cards of one type and 2 cards of another type on the machine. When I tried running the code with one or two cards it was fine, but when I tried 3 or 4 it crashed and in such a way that we weren't able to use those cards until the machine was rebooted. I saw behavior like this at a previous job where we were doing GPU development. When a card crashes it seems to take a reboot to make it usable again. I guess it makes sense that it can't report the nature of the error in that case, just that the card is not responsive.

@cboettig
Copy link
Member

Thanks! I now recall a bit more, I think the error I get is related to cgroup2 versions in docker, discussed here: NVIDIA/nvidia-container-toolkit#251

It seems if I run with explicit --device binding, my long-running containers don't lose the GPU state and stay up, e.g. like so:

docker run --name rstudio \
    --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia0:/dev/nvidia0 \
    --env-file private_env.sh \
    --memory 50g \
    --network cirrus-net \
    --gpus 0 \
    rocker/ml-verse:4.2.1-cuda11.1

feels like that shouldn't be necessary but needs some software to be fixed upstream first...

@RobJY
Copy link
Contributor Author

RobJY commented Jul 15, 2022

That's interesting. It looks like a nice clean solution to me. It probably protects you from other potential issues as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants