Skip to content

Add documentation for the systemd nvidia-container-toolkit.service #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArangoGutierrez
Copy link
Collaborator

No description provided.

@ArangoGutierrez ArangoGutierrez self-assigned this Jun 27, 2025
Copilot

This comment was marked as outdated.

Copy link

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-203

@@ -29,7 +30,68 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f

- You installed an NVIDIA GPU Driver.

### Procedure
### Automatic CDI Specification Generation (v1.18.0+)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Automatic CDI Specification Generation (v1.18.0+)
### Automatic CDI Specification Generation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service:

- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded
- Monitors changes to driver-related files (`modules.dep` and `modules.dep.bin`) to trigger regeneration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded
- Monitors changes to driver-related files (`modules.dep` and `modules.dep.bin`) to trigger regeneration
- Runs automatically on system boot to ensure the specification is up to date
- Is enabled and started automatically during fresh installations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Is enabled and started automatically during fresh installations

It should be enabled at all times.

Comment on lines 42 to 45
The automatic service handles the following scenarios:
- First-time driver installation
- Driver upgrades
- System reboots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not just saying the same as the points above?

- Runtime topology changes (MIG, hot-plug, module unload/load)
- Manual configuration changes

For these scenarios, you may still need to manually regenerate the CDI specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to the instructions here?

The automatic CDI refresh service does not handle:
- Driver removal (the CDI file is intentionally preserved)
- Runtime topology changes (MIG, hot-plug, module unload/load)
- Manual configuration changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "Manual configuration changes"?

```{note}
The automatic CDI refresh service does not handle:
- Driver removal (the CDI file is intentionally preserved)
- Runtime topology changes (MIG, hot-plug, module unload/load)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just replace this with "MIG device reconfiguration".

```bash
# /etc/nvidia-container-toolkit/cdi-refresh.env
NVIDIA_CTK_DEBUG=1
# Add other nvidia-ctk environment variables as needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reference the nvidia-ctk cdi generate command to give a list of envvars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a systemctl daemon-reload needed to reload the env?

Comment on lines 77 to 79
# Check service status
$ sudo systemctl status nvidia-cdi-refresh.path
$ sudo systemctl status nvidia-cdi-refresh.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add example output here

Comment on lines +88 to +118
# View service logs
$ sudo journalctl -u nvidia-cdi-refresh.service
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split these into separate sections.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We typically split up the sample command and the example output so its clearer what folks need to run

Comment on lines 81 to 82
# Manually trigger CDI generation
$ sudo systemctl start nvidia-cdi-refresh.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why are we listing this here and below in the manual generation section?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, on a second thought I decided to remove this

# Manually trigger CDI generation
$ sudo systemctl start nvidia-cdi-refresh.service

# Enable/disable the automatic monitoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this in a separate code block.

Also: What does "automatic monitoring" mean?

$ sudo systemctl start nvidia-cdi-refresh.service

# Enable/disable the automatic monitoring
$ sudo systemctl enable nvidia-cdi-refresh.path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the --now flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

$ sudo journalctl -u nvidia-cdi-refresh.service
```

### Manual CDI Specification Generation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the automatic process now writes to /var/run/cdi/nvidia.yaml we should update the instructions below to output to this file instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -77,6 +139,8 @@ You must generate a new CDI specification after any of the following changes:
- You use a location such as `/var/run/cdi` that is cleared on boot.

A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded.

**Note**: With NVIDIA Container Toolkit v1.18.0+, the automatic CDI refresh service handles most of these scenarios automatically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Note**: With NVIDIA Container Toolkit v1.18.0+, the automatic CDI refresh service handles most of these scenarios automatically.
**Note**: As of NVIDIA Container Toolkit v1.18.0, the automatic CDI refresh service handles most of these scenarios automatically.

Copilot

This comment was marked as outdated.


### Manual CDI Specification Generation

If you need to manually generate a CDI specification (for example, after MIG configuration changes or when using older versions), follow this procedure:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you need to manually generate a CDI specification (for example, after MIG configuration changes or when using older versions), follow this procedure:
If you need to manually generate a CDI specification, for example, after MIG configuration changes or if your are using a Container Toolkit version before v1.18.0, follow this procedure:

just a nit to future proof the docs a bit.

# Add other nvidia-ctk environment variables as needed
```

For a complete list of available environment variables, refer to the `nvidia-ctk cdi generate` command documentation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a link to these docs?

Comment on lines +88 to +118
# View service logs
$ sudo journalctl -u nvidia-cdi-refresh.service
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We typically split up the sample command and the example output so its clearer what folks need to run

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez requested a review from Copilot June 30, 2025 12:53
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds detailed documentation for the automatic and manual CDI specification generation via a new systemd service (nvidia-cdi-refresh), fixes typos in example workflows and release notes, and cleans up formatting in the install guide.

  • Introduces “Automatic CDI Specification Generation” with systemd path/service usage and customization instructions
  • Updates manual CDI generation path to /var/run/cdi/nvidia.yaml and revises related examples
  • Fixes spelling errors and removes an extraneous list marker

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
container-toolkit/sample-workload.md Corrected typo in “configure”
container-toolkit/release-notes.md Fixed “containers” spelling
container-toolkit/install-guide.md Removed stray bullet
container-toolkit/cdi-support.md Added systemd service documentation and updated paths
Comments suppressed due to low confidence (1)

container-toolkit/cdi-support.md:33

  • The PR title refers to nvidia-container-toolkit.service, but this section describes the nvidia-cdi-refresh units. Consider aligning the title and documentation to reference the correct service names or clarifying why they differ.
### Automatic CDI Specification Generation

@ArangoGutierrez
Copy link
Collaborator Author

@a-mccarthy / @elezar PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants