Skip to content

KEP-961: Bump Statefulset MaxUnavailable to beta (built on top of #4474) #5228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 17, 2025

Conversation

Edwinhr716
Copy link
Contributor

Supersedes #4474

  • One-line PR description: Bump Statefulset MaxUnavailable to beta
  • Other comments: kept @kerthcet's and @knelasevero's commits here. My commits: adding metric information

Metrics PR: kubernetes/kubernetes#130951

kerthcet and others added 10 commits April 3, 2025 17:13
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: kerthcet <kerthcet@gmail.com>
Signed-off-by: Lucas Severo Alves <lseveroa@redhat.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 3, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Apr 3, 2025
@Edwinhr716
Copy link
Contributor Author

/assign @janetkuo

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Apr 3, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 3, 2025
cluster required to make on upgrade, in order to maintain previous behavior?
- This is a new field, it will only be enabled after an upgrade. To maintain the
previous behavior, you can disable the feature gate manually in Beta or leave
the field unconfigured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can discuss the case of when existing clusters already use this feature in alpha before upgrade.

Also, discuss what happens if the feature gate was disabled, and the field was set, before upgrading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both cases, PTAL

If their StatefulSet rollingUpdate section has the field maxUnavailable specified with
a value different than 1.
The below command should show maxUnavailable value:
a value different than 1. While in alpha and beta, the feature-gate needs to be enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still set it to 1 when this feature is enabled, right? Perhaps they just need to check the existence of the maxUnavailable field, and that the feature gate is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, it can be both. During rollout though you won't see a behavioral change if it is set to 1, but you will see it if it set to anything else. As an example, LWS sets the value even if the feature flag was disabled. The only way we could tell it wasn't enabled by default was by testing it with a value larger than one.

@soltysh soltysh mentioned this pull request Apr 17, 2025
8 tasks
@soltysh
Copy link
Contributor

soltysh commented Jun 5, 2025

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 5, 2025
@kannon92
Copy link
Contributor

kannon92 commented Jun 13, 2025

PRR shadow:

Taking a look before @wojtek-t.

@soltysh calls out here that there are a lot of TODO, bugs and metrics to add for alpha -> beta generation.

I even see unresolved threads about existing bugs that have blocked the promotion before. I see that PRs are up for these.

As part of https://github.com/kubernetes/enhancements/tree/master/keps/sig-architecture/5241-beta-featuregate-promotion-requirements#summary, a requirement for features to be promoted to beta is that metrics, features and bugs are all complete.

I think the PRR seems fine to me but I want to call out that all these items must be complete before the feature gate gets turned on.

My main concern is beta requirements. This seems like a pretty hefty list to achieve in a few weeks.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two minor comments from me - other than that it LGTM from PRR pov and is waiting for SIG approval from @soltysh


Administrators should monitor the following metrics to assess the need for a rollback:

- `statefulset_unavailability_violation`: Ths metric reflects the number of times the maxUnavailable condition is violated (i.e. spec.replicas - availableReplicas > maxUnavailable).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure that this metric will get added before enabling the feature by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

@soltysh
Copy link
Contributor

soltysh commented Jun 17, 2025

I think the PRR seems fine to me but I want to call out that all these items must be complete before the feature gate gets turned on.

I think that's the best path forward. I will personally ensure we DON'T enable the feature gate before all the missing bits pointed out here are addressed.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
from sig-apps pov

/hold
for @wojtek-t PRR approval

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 17, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2025
@wojtek-t
Copy link
Member

The PRR looks fine - please just ensure that the following happen before FG is turned on to beta:

  • the metric is added as describe
  • the enablement test mentioned in the PRR
  • the manual test is run
    [in addition to some other tests from tests section]

/approve PRR
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Edwinhr716, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wojtek-t
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot merged commit 4a9783f into kubernetes:master Jun 17, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 17, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Done in SIG Apps Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants