-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5381: Mutable PersistentVolume Node Affinity #5382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
huww98
commented
Jun 6, 2025
- One-line PR description: adding new KEP
- Issue link: Mutable PersistentVolume Node Affinity #5381
- Other comments:
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: huww98 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
1. Change APIServer validation to allow `PersistentVolume.spec.nodeAffinity` to be mutable. | ||
|
||
2. Change CSI Specification to allow `ControllerModifyVolume` to return a new accessibility requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Point 2 and 3, though not directly compatible with the title, are connected as upstream and downstream changes. They're currently combined, but if needed, we'll create a new KEP for separation later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new field or it can fit into the mutable parameters map? Actually nvm there is the proto change below.
These modification can be expressed by `VolumeAttributeClass` in Kubernetes. | ||
But sometimes, A modification to volume comes with change to its accessibility, such as: | ||
1. migration of data from one zone to regional storage; | ||
2. enabling features that is not supported by all the client nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though it deviates from nodeAffinity's original design(geography related), this scenario is currently useful for users, hence we include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we prevent the new update to the nodeAffinity is not compatible with the current node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we have some rules around you can only add wider nodeAffinity support but not shrink it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just don't enforce this. We ask SPs to not disrupt the running workload. With this promise, we allow SP to return any new nodeAffinity, even if it is not compatible with currently published node. The affinity will be ignored for running pod, just like Pod.spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution
.
* or being rejected from nodes that actually can access the volume, getting stuck in a `Pending` state. | ||
|
||
By making `PersistentVolume.spec.nodeAffinity` field mutable, | ||
we give storage providers a chance to propagate latest accessibility requirement to the scheduler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we measure the consistency guarantee of the "latest" requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question, but after some discussion we don't seem to have a suitable solution, especially since there is also an informer cache, and we noticed that the #4876 doesn't mention this problem either, so can we just ignore it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we can't just ignore it. If I understood Sunny's question correctly, its trying to poke at potential race conditions or issues which could lead to unexpected results. For example, (in the current proposal) if two ModifyVolume calls are made in parallel with different accessibility requirements, how do we know what "latest" is? In other words, what are the mechanisms that ensure the actual state of the world reflects the desired state.
These sorts of questions were addressed in KEP 4876 -#4875 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine. The latest requirement is the one corresponding to the mutable_parameters
desired by CO passed in ControllerModifyVolumeRequest
. In Kubernetes, external-resizer will save the VAC name to PV status after ModifyVolume finishes. It can save the returned requirement to PV just before that.
If anything wrong happens (race, crash, etc.), CO can always issue another ControllerModifyVolume request to fetch the latest topology. This means we should require SP to return accessible_topology
if supported. I will update the KEP.
|
||
| Condition | gRPC Code | Description | Recovery Behavior | | ||
|-----------|-----------|-------------|-------------------| | ||
| Topology conflict | 9 FAILED_PRECONDITION | Indicates that the CO has requested a modification that would make the volume inaccessible to some already attached nodes. | Caller MAY detach the volume from the nodes that are in conflict and retry. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a infeasible error? What is the retry/rollback story for the users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes does not perform any automatic correction for this, it is exposed to Event as other errors.
Currently, external-resizer should just retry with exponential backoff. In the future, maybe we can retry after any ControllerUnpublishVolume succeeded, maybe after external-resizer and external-attacher combined into the same process.
User should just rollback the VAC to the previous value to cancel the request.
CRI or CNI may require updating that component before the kubelet. | ||
--> | ||
|
||
## Production Readiness Review Questionnaire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires filling PRR - once filled-in, I'm happy to review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, We would like to delay this KEP to v1.35.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK - I'm marking this as deferred in our tracking board then.
// from a given node when scheduling workloads. | ||
// This field is OPTIONAL. If it is not specified, the CO SHOULD assume | ||
// the topology is not changed by this modify volume request. | ||
repeated Topology accessible_topology = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an alpha_field.
We will extend CSI specification to add: | ||
```protobuf | ||
message ControllerModifyVolumeResponse { | ||
option (alpha_message) = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving VolumeAttributesClass to GA in 1.34. That means we need to move ControllerModifyVolume in CSI spec from alpha to GA and cut a CSI spec release before 1.34 release.
I'm wondering if we can move this message to GA while adding a new alpha_field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have two options to address this issue:
- Target the entire KEP in v1.35
- Isolate the VAC part, limiting mutable PV node affinity to v1.34, we still got one weeks before enhancement freeze
Maybe we can talk about this on the sig-storage meeting