DRA: ResourceSlice Status for Device Health Tracking

### Enhancement Description

Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.

Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.

This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.

/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu 

- One-line enhancement description (can be used as a release note): Enable DRA drivers to store device health and other device status in the ResourceSlice
- Kubernetes Enhancement Proposal: TBD
- Discussion Link: https://docs.google.com/document/d/1Zz_xhPemY28EqpcKSLPl-S7tWnOHPU_IDLVmZOoGS5k/edit?tab=t.0#bookmark=id.j08fcqbp6h8d
- Primary contact (assignee): @johnbelamaric 
- Responsible SIGs: Node
- Enhancement target (which target equals to which milestone):
  - Alpha release target (x.y): 1.34
  - Beta release target (x.y):
  - Stable release target (x.y):
- [ ] Alpha
  - [ ] KEP (`k/enhancements`) update PR(s):
  - [ ] Code (`k/k`) update PR(s):
  - [ ] Docs (`k/website`) update PR(s):



_Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DRA: ResourceSlice Status for Device Health Tracking #5283

Enhancement Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DRA: ResourceSlice Status for Device Health Tracking #5283

Description

Enhancement Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions