Skip to content

DRA: ResourceSlice Status for Device Health Tracking #5283

Open
@johnbelamaric

Description

@johnbelamaric

Enhancement Description

Devices like GPUs sometimes fail. At the recent Maintainers Summit in London, during the unconference we asked what Kubernetes is still missing for AI/ML workloads. Handling GPU failure was high on the list.

Currently, it is the responsibility of the DRA driver to exclude unhealthy devices from the ResourceSlice. However, this means that only the driver is well positioned to mitigate those failures. For example, often a failure requires only a reset of the GPU device to repair. This can be done without even rebooting the node. Different organizations are building different tooling for managing this mitigation. A single API to surface device issues would help those efforts, and may even enable integration with existing tooling such as the pluggable Node Problem Detector.

This issue proposes a ResourceSlice.Status API that drivers can optionally use to publish health information about their devices. This can be used by maintenance tooling to attempt mitigation, or alert administrators, or even reschedule larger multi-node jobs into new placement groups.

/sig node
/wg device-management
/cc @pohly @klueska @SergeyKanzhelev @asm582 @tardieu

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Labels

sig/nodeCategorizes an issue or PR as relevant to SIG Node.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

📋 Backlog

Status

Pre-Alpha

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions