-
Notifications
You must be signed in to change notification settings - Fork 28
Add documentation for the systemd nvidia-container-toolkit.service #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Documentation preview |
container-toolkit/cdi-support.md
Outdated
@@ -29,7 +30,68 @@ CDI also improves the compatibility of the NVIDIA container stack with certain f | |||
|
|||
- You installed an NVIDIA GPU Driver. | |||
|
|||
### Procedure | |||
### Automatic CDI Specification Generation (v1.18.0+) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Automatic CDI Specification Generation (v1.18.0+) | |
### Automatic CDI Specification Generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
container-toolkit/cdi-support.md
Outdated
As of NVIDIA Container Toolkit `v1.18.0`, the CDI specification is automatically generated and updated by a systemd service called `nvidia-cdi-refresh`. This service: | ||
|
||
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded | ||
- Monitors changes to driver-related files (`modules.dep` and `modules.dep.bin`) to trigger regeneration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
container-toolkit/cdi-support.md
Outdated
- Automatically generates the CDI specification at `/var/run/cdi/nvidia.yaml` when NVIDIA drivers are installed or upgraded | ||
- Monitors changes to driver-related files (`modules.dep` and `modules.dep.bin`) to trigger regeneration | ||
- Runs automatically on system boot to ensure the specification is up to date | ||
- Is enabled and started automatically during fresh installations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is enabled and started automatically during fresh installations |
It should be enabled at all times.
container-toolkit/cdi-support.md
Outdated
The automatic service handles the following scenarios: | ||
- First-time driver installation | ||
- Driver upgrades | ||
- System reboots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not just saying the same as the points above?
container-toolkit/cdi-support.md
Outdated
- Runtime topology changes (MIG, hot-plug, module unload/load) | ||
- Manual configuration changes | ||
|
||
For these scenarios, you may still need to manually regenerate the CDI specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we link to the instructions here?
container-toolkit/cdi-support.md
Outdated
The automatic CDI refresh service does not handle: | ||
- Driver removal (the CDI file is intentionally preserved) | ||
- Runtime topology changes (MIG, hot-plug, module unload/load) | ||
- Manual configuration changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are "Manual configuration changes"?
container-toolkit/cdi-support.md
Outdated
```{note} | ||
The automatic CDI refresh service does not handle: | ||
- Driver removal (the CDI file is intentionally preserved) | ||
- Runtime topology changes (MIG, hot-plug, module unload/load) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just replace this with "MIG device reconfiguration".
```bash | ||
# /etc/nvidia-container-toolkit/cdi-refresh.env | ||
NVIDIA_CTK_DEBUG=1 | ||
# Add other nvidia-ctk environment variables as needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reference the nvidia-ctk cdi generate
command to give a list of envvars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a systemctl daemon-reload
needed to reload the env?
container-toolkit/cdi-support.md
Outdated
# Check service status | ||
$ sudo systemctl status nvidia-cdi-refresh.path | ||
$ sudo systemctl status nvidia-cdi-refresh.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add example output here
# View service logs | ||
$ sudo journalctl -u nvidia-cdi-refresh.service | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split these into separate sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We typically split up the sample command and the example output so its clearer what folks need to run
container-toolkit/cdi-support.md
Outdated
# Manually trigger CDI generation | ||
$ sudo systemctl start nvidia-cdi-refresh.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why are we listing this here and below in the manual generation section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, on a second thought I decided to remove this
container-toolkit/cdi-support.md
Outdated
# Manually trigger CDI generation | ||
$ sudo systemctl start nvidia-cdi-refresh.service | ||
|
||
# Enable/disable the automatic monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this in a separate code block.
Also: What does "automatic monitoring" mean?
container-toolkit/cdi-support.md
Outdated
$ sudo systemctl start nvidia-cdi-refresh.service | ||
|
||
# Enable/disable the automatic monitoring | ||
$ sudo systemctl enable nvidia-cdi-refresh.path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the --now
flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
$ sudo journalctl -u nvidia-cdi-refresh.service | ||
``` | ||
|
||
### Manual CDI Specification Generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the automatic process now writes to /var/run/cdi/nvidia.yaml
we should update the instructions below to output to this file instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
container-toolkit/cdi-support.md
Outdated
@@ -77,6 +139,8 @@ You must generate a new CDI specification after any of the following changes: | |||
- You use a location such as `/var/run/cdi` that is cleared on boot. | |||
|
|||
A configuration change can occur when MIG devices are created or removed, or when the driver is upgraded. | |||
|
|||
**Note**: With NVIDIA Container Toolkit v1.18.0+, the automatic CDI refresh service handles most of these scenarios automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**Note**: With NVIDIA Container Toolkit v1.18.0+, the automatic CDI refresh service handles most of these scenarios automatically. | |
**Note**: As of NVIDIA Container Toolkit v1.18.0, the automatic CDI refresh service handles most of these scenarios automatically. |
0f836b0
to
62cb885
Compare
container-toolkit/cdi-support.md
Outdated
|
||
### Manual CDI Specification Generation | ||
|
||
If you need to manually generate a CDI specification (for example, after MIG configuration changes or when using older versions), follow this procedure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to manually generate a CDI specification (for example, after MIG configuration changes or when using older versions), follow this procedure: | |
If you need to manually generate a CDI specification, for example, after MIG configuration changes or if your are using a Container Toolkit version before v1.18.0, follow this procedure: |
just a nit to future proof the docs a bit.
container-toolkit/cdi-support.md
Outdated
# Add other nvidia-ctk environment variables as needed | ||
``` | ||
|
||
For a complete list of available environment variables, refer to the `nvidia-ctk cdi generate` command documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a link to these docs?
# View service logs | ||
$ sudo journalctl -u nvidia-cdi-refresh.service | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We typically split up the sample command and the example output so its clearer what folks need to run
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
62cb885
to
e3ff414
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds detailed documentation for the automatic and manual CDI specification generation via a new systemd service (nvidia-cdi-refresh
), fixes typos in example workflows and release notes, and cleans up formatting in the install guide.
- Introduces “Automatic CDI Specification Generation” with systemd path/service usage and customization instructions
- Updates manual CDI generation path to
/var/run/cdi/nvidia.yaml
and revises related examples - Fixes spelling errors and removes an extraneous list marker
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
container-toolkit/sample-workload.md | Corrected typo in “configure” |
container-toolkit/release-notes.md | Fixed “containers” spelling |
container-toolkit/install-guide.md | Removed stray bullet |
container-toolkit/cdi-support.md | Added systemd service documentation and updated paths |
Comments suppressed due to low confidence (1)
container-toolkit/cdi-support.md:33
- The PR title refers to
nvidia-container-toolkit.service
, but this section describes thenvidia-cdi-refresh
units. Consider aligning the title and documentation to reference the correct service names or clarifying why they differ.
### Automatic CDI Specification Generation
@a-mccarthy / @elezar PTAL |
No description provided.