Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--gpus flag #360

Closed
dtrudg opened this issue Oct 11, 2021 · 6 comments
Closed

--gpus flag #360

dtrudg opened this issue Oct 11, 2021 · 6 comments
Labels
enhancement New feature or request

Comments

@dtrudg
Copy link
Member

dtrudg commented Oct 11, 2021

Describe the solution you'd like

The --gpus flag for the nvidia docker runtime will configure the nvidia-container-cli setup so that e.g.

--gpus "all,capabilities=utility"

is equivalent to NVIDIA_VISIBLE_DEVICES=all, NVIDIA_DRIVER_CAPABILITIES=utility.

It would be an advantage to be able to use --gpus rather than requiring the individual environment variables to be set. A matching SINGULARITY_GPUS env var would be appropriate.

Note that with #361 we would read the NVIDIA_ env vars from the container instead of the host, so --gpus / SINGULARITY_GPUS are required to override.

Edit - as noted in discussion below, because we aren't yet defaulting to -nvccli, it wouldn't be very friendly for --gpus not to apply to SingularityCE's own GPU setup. We need to handle device binding / masking in that case - but we could ignore the capabilities portion, and perhaps only support numeric GPU IDs, not MIG UUIDs etc.

@dtrudg
Copy link
Member Author

dtrudg commented Oct 11, 2021

This is still worth pursuing prior to #361 (comment)

The --gpus / SINGULARITY_GPUS could override host NVIDIA_xxx env vars for this purpose.

@vsoch
Copy link
Contributor

vsoch commented May 5, 2022

@dtrudg this looks like low hanging fruit so maybe I can help! Is this still desired, and if so, could you give a quick summary of what the implementation should do? E.g.,

  1. add a --gpus flag to run/exec/shell
  2. given presence of the flag, set those nvidia envars?
  3. and the same flag should be triggered with SINGULARITY_GPUS ?

And this

Note that with #361 we would read the NVIDIA_ env vars from the container instead of the host, so --gpus / SINGULARITY_GPUS are required to override.

Should this be tackled after this first set of envars are added or at the same time? And if at the same time, could we chat about what that means? I'm not familiar with the current interaction with nvidia gpus!

@dtrudg
Copy link
Member Author

dtrudg commented May 5, 2022

Hi @vsoch - it is, unfortunately, not as easy as it first seems.

For the case where the experimental --nvccli flag is used, and nvidia-container-cli sets up the GPUs in the container environment, we can just set the correct NVIDIA_xxx vars. nvidia-container-cli will then do the right thing based on those.

The catch is that --nvccli is still not our default in 3.10... most people will be using --nv only, where Singularity code is responsible for the binding of GPU devices. There we would need to make sure our own code can interpret the value of a gpus flag. It would then have to mask, or bind GPU devices from/into the container as appropriate. Currently we bind all devices, so there's a fair amount of logic that needs to go into this.

@dtrudg
Copy link
Member Author

dtrudg commented May 5, 2022

I should say explicitly... if you'd like to take this on further... please reach out on Slack or similar and I can demonstrate some of the issues to you. I don't want to put you off completely here :-)

@elezar
Copy link
Contributor

elezar commented Feb 20, 2023

@dtrudg from my perspective, I would like to argue against implementing a --gpus flag and would advocate for supporting CDI devices through the --device flag (or similar) as is done for podman.

@dtrudg
Copy link
Member Author

dtrudg commented Mar 1, 2023

Agreed. Seems clear that this would be better implemented via --device with CDI in #1394 and potentially #1395 ... given that things are moving in that direction generally across runtimes.

@dtrudg dtrudg closed this as completed Mar 1, 2023
@dtrudg dtrudg closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants