Description
Enhancement Description
Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.
Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.
This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.
/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu
- One-line enhancement description (can be used as a release note): Enable DRA drivers to store device health and other device status in the ResourceSlice
- Kubernetes Enhancement Proposal: TBD
- Discussion Link: https://docs.google.com/document/d/1Zz_xhPemY28EqpcKSLPl-S7tWnOHPU_IDLVmZOoGS5k/edit?tab=t.0#bookmark=id.j08fcqbp6h8d
- Primary contact (assignee): @johnbelamaric
- Responsible SIGs: Node
- Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.34
- Beta release target (x.y):
- Stable release target (x.y):
- Alpha
- KEP (
k/enhancements
) update PR(s): - Code (
k/k
) update PR(s): - Docs (
k/website
) update PR(s):
- KEP (
Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status