Consider adding CDI --device support to the singularity native runtime #1395

dtrudg · 2023-03-01T10:39:22Z

Is your feature request related to a problem? Please describe.

SingularityCE doesn't currently support the new CDI standard for making hardware devices available in containers.

https://github.com/container-orchestrated-devices/container-device-interface

The native singularity runtime can currently:

Perform a naive binding approach to expose NVIDIA GPUs in a container, via --nv.
Perform a naive binding approach to expose AMD GPUs in a container, via --rocm.
Call out to nvidia-container-cli to perform container setup for NVIDIA GPUs via --nvccli.

The --nv and --rocm naive binding approach cannot support a range of functionality that is valuable, such as masking specific GPUs, introducing only subsets of device functionality into a container etc.

The --nvccli approach places trust in the vendor nvidia-container-cli tool. In addition, NVIDIA are moving to CDI as the preferred method for container setup, so continuing to rely on nvidia-container-cli for direct non-CDI container setup may result in lack of support for future GPU features.

The existing mechanisms are vendor specific, but we'd like to support e.g. Intel GPUs #1094 without having to add more vendor specific code / flags.

Describe the solution you'd like

Singularity should consider offering a --device flag that allows configuration of devices in the container via CDI, in the native runtime mode.
Singularity should, if possible in a compatible manner, consider moving to a CDI config driven model for --nv and --rocm naive binding, to avoid proliferation of methods of exposing GPUs in the container.
Singularity should consider our ongoing support, or deprecation, of the --nvccli approach. --nvccli support has some exceptions for specific workflows, and if CDI is the future it may be wise to avoid user reliance on --nvccli.

Additional context

We are committed to adding CDI support to the --oci runtime mode - #1394

As the main focus for SingularityCE 4.0 development is the --oci mode, we might wish to avoid large changes to the native runtime in this cycle, unless there is compelling support for them from users. We need to gauge interest in CDI. Are users interested in using CDI also likely to move to the 4.0 oci mode, or are they interested in CDI support in the native singularity runtime mode?

There is reluctance among some users to add additional tooling, and have to manage system configuration, for GPUs on systems. This is particularly the case where a cluster containing heterogeneous GPU hardware (between nodes) is in operation. While singularity's --nv and --rocm naive binding are simple, and don't offer e.g. GPU masking etc, no node-specific configuration is necessary beyond driver installation. We should be conscious not to break this if we switch to a CDI approach for --nv / --rocm.

The text was updated successfully, but these errors were encountered:

ArangoGutierrez · 2023-03-01T13:08:06Z

++

dtrudg added enhancement New feature or request maybe Features / changes that maybe implemented in future, depending on need & resources labels Mar 1, 2023

dtrudg mentioned this issue Mar 1, 2023

Monitor / include progress on CDI based nvidia container setup #813

Closed

dtrudg mentioned this issue Mar 1, 2023

--gpus flag #360

Closed

elezar mentioned this issue Mar 14, 2024

[feature request] Support Intel GPU #1094

Closed

This was referenced Jun 14, 2024

NVIDIA Jetson Platform - singularity run --nv --nvccli can't map in GPU stack: nvidia-container-cli: mount error #1850

Closed

NVIDIA Jetson AGX, singularity exec --nv Could not find any nv files on this host! #2805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding CDI --device support to the singularity native runtime #1395

Consider adding CDI --device support to the singularity native runtime #1395

dtrudg commented Mar 1, 2023

ArangoGutierrez commented Mar 1, 2023

Consider adding CDI --device support to the singularity native runtime #1395

Consider adding CDI --device support to the singularity native runtime #1395

Comments

dtrudg commented Mar 1, 2023

ArangoGutierrez commented Mar 1, 2023