Skip to content

Document that pods hanging in terminating if DRA driver is truly gone is WAI (re: 129402 discussion) #51012

Open
@pohly

Description

@pohly

If the DRA driver is well and truly gone, despite all the retry and reconciliation loops, a pod will be stuck in Terminating for as long as its NodeUnprepareResources call has not been fulfilled without error, which is (currently) impossible without a kubelet connection to the driver.

This is also true for networking plugins (as discussed in kubernetes/kubernetes#129402 (comment)), and volumes/CSI drivers (kubernetes/kubernetes#129402 (comment)) which have external services that handle the cleanup asynchronously, and sometimes untracked, by the pod phasing. Device Plugins don't have this issue (though they are at risk of leaving stuff lying around -- per kubernetes/kubernetes#129402 (comment)).

This issue is to document this behavior as it pertains to DRA and describe how it is WAI and what the mediation steps available to a cluster administrator are.

Metadata

Metadata

Assignees

Labels

needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

Type

No type

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions