Skip to content

Target GA for recover from Volume Expansion failure feature #5353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 16, 2025

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented May 29, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 29, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 29, 2025
@gnufied gnufied force-pushed the bump-recovery-ga branch 2 times, most recently from c94f4b5 to bd6bae1 Compare May 29, 2025 22:39
@gnufied gnufied force-pushed the bump-recovery-ga branch from bd6bae1 to d52c7d4 Compare May 29, 2025 22:51
@kannon92
Copy link
Contributor

Shadowing @deads2k on this. I had one item that jumped at me reading the PRR.

Was this metric added?

Are there any missing metrics that would be useful to have to improve observability if this feature?
We are planning to add new counter metrics that will record success and failure of recovery operations. In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.

Tentative name of metric is - operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}

The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often and hence it should be okay to use name of PVs that were recovered this way.

@kannon92
Copy link
Contributor

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

The answers were a bit vague so we should revisit that section and comment on the scalability questions with more information.

@gnufied
Copy link
Member Author

gnufied commented May 30, 2025

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

@gnufied
Copy link
Member Author

gnufied commented May 30, 2025

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

@kannon92
Copy link
Contributor

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

It would be worth cleaning up the KEP then. I think you mention using that metric in a few places in the PRR.

@kannon92
Copy link
Contributor

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

Maybe you can expand on what you mean by recovery.

@deads2k
Copy link
Contributor

deads2k commented Jun 2, 2025

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 2, 2025
@gnufied
Copy link
Member Author

gnufied commented Jun 3, 2025

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

So there is csi_sidecar_operations_seconds{method_name="/csi.v1.Controller/ControllerExpandVolume}, csi_operations_seconds{method_name="/csi.v1.Node/NodeExpandVolume"} and storage_operation_duration_seconds{operation_name="volume_fs_resize"}, which track these operations already on both controller and node side. The csi_xxx metrics track grpc error codes, so if either Controller or node side expansion is failing, it will be recorded appropriately and a metric emitted.

Some of these metrics have evolved since KEP (This KEP is now 5 years old) was originally written, I will update the KEP with newer metric names.

@gnufied
Copy link
Member Author

gnufied commented Jun 3, 2025

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

I have reworded those, ptal.

### GA

- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
- The test grid already has tests for the feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have both presubmits and periodics executing this feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are still running only in alpha jobs, despite the feature is already in beta. I am going to LGTM the KEP, but I will block feature gate graduation until the test run in regular e2e jobs.

Copy link
Member Author

@gnufied gnufied Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err I misread the text. I will move the test for normal non-featuregated default.

@kannon92
Copy link
Contributor

kannon92 commented Jun 6, 2025

  • Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?
    We have not fully tested upgrade and rollback but as part of beta process we will have it tested.

For stable, this needs to be explained and documented how this was done.


- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
- The test grid already has tests for the feature.

### Upgrade / Downgrade Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the version skew strategy section with the fixes we made?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@kannon92
Copy link
Contributor

kannon92 commented Jun 6, 2025

Reading the Version skew section, this feature seems to have a really complicated rollout plan. Is this documented anywhere? What order do you recommend turning on these features gates? Should someone enable this in external-resizer first? And then enable it in kas, kubelet, kcm?

What is the recommendation for rolling this feature out for operators?

external-resizer must have the feature gate enabled for this to work. The feature gate is still beta. Could we consider GAing this feature first and then promote the other feature gates?

edit: NVM. You call this out that this is actually bad..

So the recommendation here would be to turn this feature gate on in all kubernetes components first and then you promote the feature gate in external-resizer after skew is satisfied?

@gnufied
Copy link
Member Author

gnufied commented Jun 9, 2025

So the recommendation here would be to turn this feature gate on in all kubernetes components first and then you promote the feature gate in external-resizer after skew is satisfied?

Yes that is generally recommended and I have added this in the latest commit. d6f5b11

Having said that, there is some fail-safety builtinto this. Older kubelets can work with newer resizer upto a certain version and external-resizer can work with newer api-server etc. Please see version skew section for more details.

@kannon92
Copy link
Contributor

kannon92 commented Jun 9, 2025

LGTM from PRR shadow standpoint. Thank you for the updates!

/assign @deads2k

@deads2k
Copy link
Contributor

deads2k commented Jun 9, 2025

PRR lgtm

/approve

@gnufied
Copy link
Member Author

gnufied commented Jun 10, 2025

/assign @jsafrane @xing-yang @msau42

Copy link
Member

@jmickey jmickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit, I believe this should be nested under the Graduation Criteria

@gnufied gnufied force-pushed the bump-recovery-ga branch from d6f5b11 to 4e06845 Compare June 11, 2025 15:13
### Upgrade / Downgrade Strategy

Not Applicable
- For the case of older kubelet running with newer resizer, the kubelet must handle the newer fields introduced by this feature even if feature-gate is not enabled. Having said that, if kubelet is older and has no notion of this feature at all and `api-server` and `external-resizer` is newer, it is possible that kubelet will not properly update `allocatedResourceStatus` after expansion is finished on the node. This is a known limitation and if user wants to keep running older version of kubelet (older than v1.31 as of this writing and hence unsupported), then they MUST not update external-resizer.
- In general `external-resizer` should only be upgraded(and `RecoverVolumeExpansionFailure` enabled), after `kubelet`, `api-server` and `kube-controller-manager` already has this feature enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we support 3 version node skew, does that mean we realistically cannot enable the feature until 3 releases past GA?

Copy link
Member Author

@gnufied gnufied Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't versions upto 1.31 (which have kubelet fix) cover this? Even if version skew with node is 3 versions? This does mean - users must upgrade their older nodes to latest z-streams, if they want to upgrade control-plane and use newer resizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for the kubelet case, it's not about enabling the feature. It's about being a node version that has the fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave external-resizer beta in this release while moving k/k components GA? Or are you thinking we should move everything GA in 1.35 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go ahead and GA external-resizer. My comment is just a wording nitpick because it currently says don't enable resizer until kubelet has the feature enabled, which would imply kubelets have to be on 1.32 for default feature gate settings.

@jsafrane
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2025
@k8s-ci-robot k8s-ci-robot merged commit 9c6ddfd into kubernetes:master Jun 16, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants