-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Target GA for recover from Volume Expansion failure feature #5353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c94f4b5
to
bd6bae1
Compare
Shadowing @deads2k on this. I had one item that jumped at me reading the PRR. Was this metric added?
|
It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation. The answers were a bit vague so we should revisit that section and comment on the scalability questions with more information. |
No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough. The metric item was tentative, so not sure if blocker for GA. |
I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted. |
It would be worth cleaning up the KEP then. I think you mention using that metric in a few places in the PRR. |
The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur. Maybe you can expand on what you mean by recovery. |
What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes. |
So there is Some of these metrics have evolved since KEP (This KEP is now 5 years old) was originally written, I will update the KEP with newer metric names. |
I have reworded those, ptal. |
### GA | ||
|
||
- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166 | ||
- The test grid already has tests for the feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have both presubmits and periodics executing this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests are still running only in alpha jobs, despite the feature is already in beta. I am going to LGTM the KEP, but I will block feature gate graduation until the test run in regular e2e jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err I misread the text. I will move the test for normal non-featuregated default.
For stable, this needs to be explained and documented how this was done. |
|
||
- The feature has been extensively tested in CI and deployed with drivers in real world. For example - https://github.com/kubernetes-sigs/azuredisk-csi-driver/blob/master/charts/latest/azuredisk-csi-driver/templates/csi-azuredisk-controller.yaml#L166 | ||
- The test grid already has tests for the feature. | ||
|
||
### Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the version skew strategy section with the fixes we made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
Reading the Version skew section, this feature seems to have a really complicated rollout plan. Is this documented anywhere? What order do you recommend turning on these features gates? Should someone enable this in external-resizer first? And then enable it in kas, kubelet, kcm? What is the recommendation for rolling this feature out for operators? external-resizer must have the feature gate enabled for this to work. The feature gate is still beta. Could we consider GAing this feature first and then promote the other feature gates? edit: NVM. You call this out that this is actually bad.. So the recommendation here would be to turn this feature gate on in all kubernetes components first and then you promote the feature gate in external-resizer after skew is satisfied? |
Yes that is generally recommended and I have added this in the latest commit. d6f5b11 Having said that, there is some fail-safety builtinto this. Older kubelets can work with newer resizer upto a certain version and external-resizer can work with newer api-server etc. Please see version skew section for more details. |
LGTM from PRR shadow standpoint. Thank you for the updates! /assign @deads2k |
PRR lgtm /approve |
/assign @jsafrane @xing-yang @msau42 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit, I believe this should be nested under the Graduation Criteria
d6f5b11
to
4e06845
Compare
### Upgrade / Downgrade Strategy | ||
|
||
Not Applicable | ||
- For the case of older kubelet running with newer resizer, the kubelet must handle the newer fields introduced by this feature even if feature-gate is not enabled. Having said that, if kubelet is older and has no notion of this feature at all and `api-server` and `external-resizer` is newer, it is possible that kubelet will not properly update `allocatedResourceStatus` after expansion is finished on the node. This is a known limitation and if user wants to keep running older version of kubelet (older than v1.31 as of this writing and hence unsupported), then they MUST not update external-resizer. | ||
- In general `external-resizer` should only be upgraded(and `RecoverVolumeExpansionFailure` enabled), after `kubelet`, `api-server` and `kube-controller-manager` already has this feature enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we support 3 version node skew, does that mean we realistically cannot enable the feature until 3 releases past GA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't versions upto 1.31 (which have kubelet fix) cover this? Even if version skew with node is 3 versions? This does mean - users must upgrade their older nodes to latest z-streams, if they want to upgrade control-plane and use newer resizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess for the kubelet case, it's not about enabling the feature. It's about being a node version that has the fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave external-resizer beta in this release while moving k/k components GA? Or are you thinking we should move everything GA in 1.35 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go ahead and GA external-resizer. My comment is just a wording nitpick because it currently says don't enable resizer until kubelet has the feature enabled, which would imply kubelets have to be on 1.32 for default feature gate settings.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, gnufied, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
#1790