Skip to content

KEP-5381: Mutable PersistentVolume Node Affinity #5382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

huww98
Copy link

@huww98 huww98 commented Jun 6, 2025

  • One-line PR description: adding new KEP
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign xing-yang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 6, 2025

1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable.

2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point 2 and 3, though not directly compatible with the title, are connected as upstream and downstream changes. They're currently combined, but if needed, we'll create a new KEP for separation later.

Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new field or it can fit into the mutable parameters map? Actually nvm there is the proto change below.

These modification can be expressed by `VolumeAttributeClass` in Kubernetes.
But sometimes, A modification to volume comes with change to its accessibility, such as:
1. migration of data from one zone to regional storage;
2. enabling features that is not supported by all the client nodes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though it deviates from nodeAffinity's original design(geography related), this scenario is currently useful for users, hence we include it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we prevent the new update to the nodeAffinity is not compatible with the current node?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe we have some rules around you can only add wider nodeAffinity support but not shrink it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just don't enforce this. We ask SPs to not disrupt the running workload. With this promise, we allow SP to return any new nodeAffinity, even if it is not compatible with currently published node. The affinity will be ignored for running pod, just like Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution.

* or being rejected from nodes that actually can access the volume, getting stuck in a `Pending` state.

By making `PersistentVolume.spec.nodeAffinity` field mutable,
we give storage providers a chance to propagate latest accessibility requirement to the scheduler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we measure the consistency guarantee of the "latest" requirements?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question, but after some discussion we don't seem to have a suitable solution, especially since there is also an informer cache, and we noticed that the #4876 doesn't mention this problem either, so can we just ignore it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can't just ignore it. If I understood Sunny's question correctly, its trying to poke at potential race conditions or issues which could lead to unexpected results. For example, (in the current proposal) if two ModifyVolume calls are made in parallel with different accessibility requirements, how do we know what "latest" is? In other words, what are the mechanisms that ensure the actual state of the world reflects the desired state.

These sorts of questions were addressed in KEP 4876 -#4875 (comment).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine. The latest requirement is the one corresponding to the mutable_parameters desired by CO passed in ControllerModifyVolumeRequest. In Kubernetes, external-resizer will save the VAC name to PV status after ModifyVolume finishes. It can save the returned requirement to PV just before that.

If anything wrong happens (race, crash, etc.), CO can always issue another ControllerModifyVolume request to fetch the latest topology. This means we should require SP to return accessible_topology if supported. I will update the KEP.


| Condition | gRPC Code | Description | Recovery Behavior |
|-----------|-----------|-------------|-------------------|
| Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. |
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a infeasible error? What is the retry/rollback story for the users?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes does not perform any automatic correction for this, it is exposed to Event as other errors.

Currently, external-resizer should just retry with exponential backoff. In the future, maybe we can retry after any ControllerUnpublishVolume succeeded, maybe after external-resizer and external-attacher combined into the same process.

User should just rollback the VAC to the previous value to cancel the request.

CRI or CNI may require updating that component before the kubelet.
-->

## Production Readiness Review Questionnaire
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires filling PRR - once filled-in, I'm happy to review it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, We would like to delay this KEP to v1.35.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK - I'm marking this as deferred in our tracking board then.

// from a given node when scheduling workloads.
// This field is OPTIONAL. If it is not specified, the CO SHOULD assume
// the topology is not changed by this modify volume request.
repeated Topology accessible_topology = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an alpha_field.

We will extend CSI specification to add:
```protobuf
message ControllerModifyVolumeResponse {
option (alpha_message) = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving VolumeAttributesClass to GA in 1.34. That means we need to move ControllerModifyVolume in CSI spec from alpha to GA and cut a CSI spec release before 1.34 release.
I'm wondering if we can move this message to GA while adding a new alpha_field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have two options to address this issue:

  1. Target the entire KEP in v1.35
  2. Isolate the VAC part, limiting mutable PV node affinity to v1.34, we still got one weeks before enhancement freeze

Maybe we can talk about this on the sig-storage meeting

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 13, 2025
@wojtek-t wojtek-t self-assigned this Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants