Target GA for recover from Volume Expansion failure feature #5353

gnufied · 2025-05-29T22:32:52Z

keps/sig-storage/1790-recover-resize-failure/kep.yaml

kannon92 · 2025-05-30T18:39:42Z

Shadowing @deads2k on this. I had one item that jumped at me reading the PRR.

Was this metric added?

Are there any missing metrics that would be useful to have to improve observability if this feature?
We are planning to add new counter metrics that will record success and failure of recovery operations. In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.

Tentative name of metric is - operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}

The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often and hence it should be okay to use name of PVs that were recovered this way.

kannon92 · 2025-05-30T18:44:30Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

The answers were a bit vague so we should revisit that section and comment on the scalability questions with more information.

gnufied · 2025-05-30T18:51:19Z

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

gnufied · 2025-05-30T19:12:06Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

kannon92 · 2025-05-30T19:15:43Z

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

It would be worth cleaning up the KEP then. I think you mention using that metric in a few places in the PRR.

kannon92 · 2025-05-30T19:16:34Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

Maybe you can expand on what you mean by recovery.

deads2k · 2025-06-02T12:57:30Z

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

gnufied · 2025-06-03T16:46:43Z

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

So there is csi_sidecar_operations_seconds{method_name="/csi.v1.Controller/ControllerExpandVolume}, csi_operations_seconds{method_name="/csi.v1.Node/NodeExpandVolume"} and storage_operation_duration_seconds{operation_name="volume_fs_resize"}, which track these operations already on both controller and node side. The csi_xxx metrics track grpc error codes, so if either Controller or node side expansion is failing, it will be recorded appropriately and a metric emitted.

Some of these metrics have evolved since KEP (This KEP is now 5 years old) was originally written, I will update the KEP with newer metric names.

gnufied · 2025-06-03T17:00:53Z

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

I have reworded those, ptal.

kannon92 · 2025-06-06T17:55:28Z

keps/sig-storage/1790-recover-resize-failure/README.md

+### GA  
+
+- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
+- The test grid already has tests for the feature.


Do we have both presubmits and periodics executing this feature?

The tests are still running only in alpha jobs, despite the feature is already in beta. I am going to LGTM the KEP, but I will block feature gate graduation until the test run in regular e2e jobs.

err I misread the text. I will move the test for normal non-featuregated default.

kannon92 · 2025-06-06T18:00:35Z

Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?
We have not fully tested upgrade and rollback but as part of beta process we will have it tested.

For stable, this needs to be explained and documented how this was done.

msau42 · 2025-06-06T18:03:57Z

keps/sig-storage/1790-recover-resize-failure/README.md

+
+- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166
+- The test grid already has tests for the feature.
+
 ### Upgrade / Downgrade Strategy


Can you update the version skew strategy section with the fixes we made?

kannon92 · 2025-06-06T18:04:40Z

Reading the Version skew section, this feature seems to have a really complicated rollout plan. Is this documented anywhere? What order do you recommend turning on these features gates? Should someone enable this in external-resizer first? And then enable it in kas, kubelet, kcm?

What is the recommendation for rolling this feature out for operators?

external-resizer must have the feature gate enabled for this to work. The feature gate is still beta. Could we consider GAing this feature first and then promote the other feature gates?

edit: NVM. You call this out that this is actually bad..

So the recommendation here would be to turn this feature gate on in all kubernetes components first and then you promote the feature gate in external-resizer after skew is satisfied?

gnufied · 2025-06-09T18:27:22Z

So the recommendation here would be to turn this feature gate on in all kubernetes components first and then you promote the feature gate in external-resizer after skew is satisfied?

Yes that is generally recommended and I have added this in the latest commit. d6f5b11

Having said that, there is some fail-safety builtinto this. Older kubelets can work with newer resizer upto a certain version and external-resizer can work with newer api-server etc. Please see version skew section for more details.

kannon92 · 2025-06-09T19:01:20Z

LGTM from PRR shadow standpoint. Thank you for the updates!

/assign @deads2k

deads2k · 2025-06-09T19:12:30Z

PRR lgtm

/approve

gnufied · 2025-06-10T14:48:10Z

/assign @jsafrane @xing-yang @msau42

jmickey

Small nit, I believe this should be nested under the Graduation Criteria

keps/sig-storage/1790-recover-resize-failure/README.md

msau42 · 2025-06-12T16:30:50Z

keps/sig-storage/1790-recover-resize-failure/README.md

 ### Upgrade / Downgrade Strategy

-Not Applicable
+- For the case of older kubelet running with newer resizer, the kubelet must handle the newer fields introduced by this feature even if feature-gate is not enabled. Having said that, if kubelet is older and has no notion of this feature at all and `api-server` and `external-resizer` is newer, it is possible that kubelet will not properly update `allocatedResourceStatus` after expansion is finished on the node. This is a known limitation and if user wants to keep running older version of kubelet (older than v1.31 as of this writing and hence unsupported), then they MUST not update external-resizer.
+- In general `external-resizer` should only be upgraded(and `RecoverVolumeExpansionFailure` enabled), after `kubelet`, `api-server` and `kube-controller-manager` already has this feature enabled.


Now that we support 3 version node skew, does that mean we realistically cannot enable the feature until 3 releases past GA?

Doesn't versions upto 1.31 (which have kubelet fix) cover this? Even if version skew with node is 3 versions? This does mean - users must upgrade their older nodes to latest z-streams, if they want to upgrade control-plane and use newer resizer.

I guess for the kubelet case, it's not about enabling the feature. It's about being a node version that has the fix.

We can leave external-resizer beta in this release while moving k/k components GA? Or are you thinking we should move everything GA in 1.35 ?

I think we can go ahead and GA external-resizer. My comment is just a wording nitpick because it currently says don't enable resizer until kubelet has the feature enabled, which would imply kubelets have to be on 1.32 for default feature gate settings.

jsafrane · 2025-06-16T07:51:38Z

/lgtm
/approve

k8s-ci-robot · 2025-06-16T07:51:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [deads2k]
~~keps/sig-storage/OWNERS~~ [jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes label May 29, 2025

k8s-ci-robot requested review from saad-ali and xing-yang May 29, 2025 22:32

k8s-ci-robot added kind/kep sig/storage size/XS labels May 29, 2025

gnufied mentioned this pull request May 29, 2025

Support recovery from volume expansion failure #1790

Open

12 tasks

gnufied force-pushed the bump-recovery-ga branch 2 times, most recently from c94f4b5 to bd6bae1 Compare May 29, 2025 22:39

Target GA for recover from Volume Expansion failure feature

d52c7d4

gnufied force-pushed the bump-recovery-ga branch from bd6bae1 to d52c7d4 Compare May 29, 2025 22:51

jsafrane reviewed May 30, 2025

View reviewed changes

keps/sig-storage/1790-recover-resize-failure/kep.yaml Outdated Show resolved Hide resolved

keps/sig-storage/1790-recover-resize-failure/kep.yaml Show resolved Hide resolved

keps/sig-storage/1790-recover-resize-failure/kep.yaml Show resolved Hide resolved

Address comments about GA milestone

316f5aa

k8s-ci-robot added size/M and removed size/XS labels Jun 2, 2025

Add update for annotation API

698cd3d

Update metric names

fc95a65

kannon92 reviewed Jun 6, 2025

View reviewed changes

msau42 reviewed Jun 6, 2025

View reviewed changes

k8s-ci-robot assigned deads2k Jun 9, 2025

k8s-ci-robot assigned jsafrane, msau42 and xing-yang Jun 10, 2025

jmickey reviewed Jun 11, 2025

View reviewed changes

keps/sig-storage/1790-recover-resize-failure/README.md Outdated Show resolved Hide resolved

keps/sig-storage/1790-recover-resize-failure/README.md Outdated Show resolved Hide resolved

Add Upgrade/Downgrade strategy

4e06845

gnufied force-pushed the bump-recovery-ga branch from d6f5b11 to 4e06845 Compare June 11, 2025 15:13

msau42 reviewed Jun 12, 2025

View reviewed changes

k8s-ci-robot added the lgtm label Jun 16, 2025

k8s-ci-robot added the approved label Jun 16, 2025

k8s-ci-robot merged commit 9c6ddfd into kubernetes:master Jun 16, 2025
4 checks passed

k8s-ci-robot added this to the v1.34 milestone Jun 16, 2025

Target GA for recover from Volume Expansion failure feature #5353

Target GA for recover from Volume Expansion failure feature #5353

Uh oh!

Conversation

gnufied commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

deads2k commented Jun 2, 2025

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnufied Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Jun 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gnufied commented Jun 9, 2025

Uh oh!

kannon92 commented Jun 9, 2025

Uh oh!

deads2k commented Jun 9, 2025

Uh oh!

gnufied commented Jun 10, 2025

Uh oh!

jmickey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnufied Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsafrane commented Jun 16, 2025

Uh oh!

k8s-ci-robot commented Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

gnufied Jun 16, 2025 •

edited

Loading

kannon92 commented Jun 6, 2025 •

edited

Loading

gnufied Jun 12, 2025 •

edited

Loading