KEP-5381: Mutable PersistentVolume Node Affinity #5382

huww98 · 2025-06-06T02:43:09Z

One-line PR description: adding new KEP

Issue link: Mutable PersistentVolume Node Affinity #5381

Other comments:

k8s-ci-robot · 2025-06-06T02:43:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign xing-yang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-storage/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mowangdk · 2025-06-06T09:43:40Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+
+1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable.
+
+2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement.


Point 2 and 3, though not directly compatible with the title, are connected as upstream and downstream changes. They're currently combined, but if needed, we'll create a new KEP for separation later.

Is this a new field or it can fit into the mutable parameters map? Actually nvm there is the proto change below.

mowangdk · 2025-06-06T09:49:12Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+These modification can be expressed by `VolumeAttributeClass` in Kubernetes.
+But sometimes, A modification to volume comes with change to its accessibility, such as:
+1. migration of data from one zone to regional storage;
+2. enabling features that is not supported by all the client nodes.


Though it deviates from nodeAffinity's original design（geography related）, this scenario is currently useful for users, hence we include it.

How do we prevent the new update to the nodeAffinity is not compatible with the current node?

Or maybe we have some rules around you can only add wider nodeAffinity support but not shrink it.

I think we just don't enforce this. We ask SPs to not disrupt the running workload. With this promise, we allow SP to return any new nodeAffinity, even if it is not compatible with currently published node. The affinity will be ignored for running pod, just like Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.

sunnylovestiramisu · 2025-06-06T16:23:57Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+* or being rejected from nodes that actually can access the volume, getting stuck in a `Pending` state.
+
+By making `PersistentVolume.spec.nodeAffinity` field mutable,
+we give storage providers a chance to propagate latest accessibility requirement to the scheduler,


How do we measure the consistency guarantee of the "latest" requirements?

That's a good question, but after some discussion we don't seem to have a suitable solution, especially since there is also an informer cache, and we noticed that the #4876 doesn't mention this problem either, so can we just ignore it?

No, we can't just ignore it. If I understood Sunny's question correctly, its trying to poke at potential race conditions or issues which could lead to unexpected results. For example, (in the current proposal) if two ModifyVolume calls are made in parallel with different accessibility requirements, how do we know what "latest" is? In other words, what are the mechanisms that ensure the actual state of the world reflects the desired state.

These sorts of questions were addressed in KEP 4876 -#4875 (comment).

I think this is fine. The latest requirement is the one corresponding to the mutable_parameters desired by CO passed in ControllerModifyVolumeRequest. In Kubernetes, external-resizer will save the VAC name to PV status after ModifyVolume finishes. It can save the returned requirement to PV just before that.

If anything wrong happens (race, crash, etc.), CO can always issue another ControllerModifyVolume request to fetch the latest topology. This means we should require SP to return accessible_topology if supported. I will update the KEP.

sunnylovestiramisu · 2025-06-06T16:30:29Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+
+| Condition | gRPC Code | Description | Recovery Behavior |
+|-----------|-----------|-------------|-------------------|
+| Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. |


Is this a infeasible error? What is the retry/rollback story for the users?

Kubernetes does not perform any automatic correction for this, it is exposed to Event as other errors.

Currently, external-resizer should just retry with exponential backoff. In the future, maybe we can retry after any ControllerUnpublishVolume succeeded, maybe after external-resizer and external-attacher combined into the same process.

User should just rollback the VAC to the previous value to cancel the request.

wojtek-t · 2025-06-09T08:26:38Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+  CRI or CNI may require updating that component before the kubelet.
+-->
+
+## Production Readiness Review Questionnaire


This requires filling PRR - once filled-in, I'm happy to review it.

Thanks, We would like to delay this KEP to v1.35.

ACK - I'm marking this as deferred in our tracking board then.

xing-yang · 2025-06-11T15:55:33Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+  // from a given node when scheduling workloads.
+  // This field is OPTIONAL. If it is not specified, the CO SHOULD assume
+  // the topology is not changed by this modify volume request.
+  repeated Topology accessible_topology = 5;


This should be an alpha_field.

xing-yang · 2025-06-11T15:58:35Z

keps/sig-storage/5381-mutable-pv-affinity/README.md

+We will extend CSI specification to add:
+```protobuf
+message ControllerModifyVolumeResponse {
+  option (alpha_message) = true;


We are moving VolumeAttributesClass to GA in 1.34. That means we need to move ControllerModifyVolume in CSI spec from alpha to GA and cut a CSI spec release before 1.34 release.
I'm wondering if we can move this message to GA while adding a new alpha_field.

I think we have two options to address this issue:

Target the entire KEP in v1.35

Isolate the VAC part, limiting mutable PV node affinity to v1.34, we still got one weeks before enhancement freeze

Maybe we can talk about this on the sig-storage meeting

mutable-pv-affinity init

006d6e6

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2025

k8s-ci-robot requested review from saad-ali and xing-yang June 6, 2025 02:43

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2025

huww98 mentioned this pull request Jun 6, 2025

Mutable PersistentVolume Node Affinity #5381

Open

4 tasks

mowangdk reviewed Jun 6, 2025

View reviewed changes

sunnylovestiramisu reviewed Jun 6, 2025

View reviewed changes

wojtek-t reviewed Jun 9, 2025

View reviewed changes

xing-yang reviewed Jun 11, 2025

View reviewed changes

address comments from sig-storage meeting

0602a5f

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 13, 2025

wojtek-t self-assigned this Jun 13, 2025


		1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable.

		2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement.

KEP-5381: Mutable PersistentVolume Node Affinity #5382

Are you sure you want to change the base?

KEP-5381: Mutable PersistentVolume Node Affinity #5382

Conversation

huww98 commented Jun 6, 2025

Uh oh!

k8s-ci-robot commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunnylovestiramisu Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunnylovestiramisu Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sunnylovestiramisu Jun 6, 2025 •

edited

Loading

sunnylovestiramisu Jun 6, 2025 •

edited

Loading