Skip to content

DRA: Handle permanent driver allocation failures #5322

Open
@nojnhuh

Description

@nojnhuh

Enhancement Description

DRA drivers may encounter errors such that the devices allocated by kube-scheduler for a pod can never be successfully returned from the NodePrepareResources gRPC call to the driver. Currently, pods in that state will be continuously retried forever, wasting CPU cycles in the kubelet and DRA driver. This proposal describes a method to break that cycle of continuous retries that are known will fail.

/sig node
/wg device-management
/assign @nojnhuh
/cc @pohly @lauralorenz @SergeyKanzhelev

  • One-line enhancement description (can be used as a release note): DRA: Handle permanent driver allocation failures
  • Kubernetes Enhancement Proposal: [in progress]
  • Discussion Link:
  • Primary contact (assignee): @nojnhuh
  • Responsible SIGs: SIG Node
  • Enhancement target (which target equals to which milestone):
    • Alpha release target (x.y): 1.34
    • Beta release target (x.y):
    • Stable release target (x.y):
  • Alpha
    • KEP (k/enhancements) update PR(s):
    • Code (k/k) update PR(s):
    • Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

Metadata

Metadata

Assignees

Labels

sig/nodeCategorizes an issue or PR as relevant to SIG Node.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

📋 Backlog

Status

Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions