Skip to content

KEP-2837: Beta Graduation Criteria for 1.34 #5362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ndixita
Copy link
Contributor

@ndixita ndixita commented Jun 3, 2025

  • One-line PR description: Update details around alignment managers, autoscalers
  • Other comments: For Beta, following features are in scope:
    • eviction manager
    • events associated with topology, memory and cpu manager to bubble up unsupported alignment operations for guaranteed pods with pod level resources
    • implementation details around HPA, VPA and cluster autoscaler
    • moved in place pod level resources resize to Adding Alpha criteria for KEP to support IPPR & Pod Level Resources #5423
    • full support for alignment managers moved to future work
    • support for windows moved to future work
    • advanced vpa algorithm to recommend proportionate pod limits with container limits moved to future work and subject to further discussions

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 3, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2025
Copy link

@pravk03 pravk03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user attempts to configure scope=container for such a pod

Topology Manager scope (container/pod) is a node level setting.

@ndixita ndixita force-pushed the pod-level-resources-beta branch 3 times, most recently from 3da0389 to 471221b Compare June 11, 2025 23:04
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2025
@ndixita ndixita marked this pull request as ready for review June 11, 2025 23:10
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2025
@k8s-ci-robot k8s-ci-robot requested a review from palnabarun June 11, 2025 23:10
@ndixita
Copy link
Contributor Author

ndixita commented Jun 12, 2025

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label Jun 12, 2025
@ndixita ndixita force-pushed the pod-level-resources-beta branch from b776557 to 3f3e191 Compare June 12, 2025 17:13
@ndixita
Copy link
Contributor Author

ndixita commented Jun 12, 2025

/assign @tallclair

@ndixita ndixita force-pushed the pod-level-resources-beta branch from 3f3e191 to 5e556cc Compare June 12, 2025 20:12
@ndixita ndixita changed the title Adding details for CPU, memory and topology manager implementation an… KEP-2837: Beta Graduation Criteria Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 18, 2025
@ndixita ndixita force-pushed the pod-level-resources-beta branch 6 times, most recently from 5fb75de to 04f28d7 Compare June 18, 2025 21:53
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 18, 2025
@ndixita ndixita force-pushed the pod-level-resources-beta branch from 04f28d7 to 7a26908 Compare June 18, 2025 22:19
@ndixita ndixita requested a review from sanposhiho June 18, 2025 22:23
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ndixita, sanposhiho
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric, mrunalp for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ndixita ndixita force-pushed the pod-level-resources-beta branch from 7a26908 to 5773936 Compare June 18, 2025 22:41
Copy link
Member

@tallclair tallclair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@ndixita ndixita force-pushed the pod-level-resources-beta branch from 5773936 to 5b51362 Compare June 18, 2025 23:16
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@ndixita ndixita force-pushed the pod-level-resources-beta branch from 5b51362 to 1d8e4df Compare June 18, 2025 23:24
@mrunalp
Copy link
Contributor

mrunalp commented Jun 18, 2025

Can we summarize in the description and/or the commit message the proposed changes (some are lost in the discussion threads)?

@tallclair
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
Signed-off-by: ndixita <ndixita@google.com>
@ndixita ndixita force-pushed the pod-level-resources-beta branch from 1d8e4df to 83bf364 Compare June 19, 2025 06:11
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3 PRR questions (tests for enablement/disablement, upgrade->downgrade->upgrade and new fields estimation) are the remaining blocking items.

- apiserver_rejected_requests
- schedule_attempts_total{result="error|unschedulable"}
- node_collector_evictions_total`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- node_collector_evictions_total`
- node_collector_evictions_total

- node_collector_evictions_total`
- started_pods_errors_total
- started_containers_errors_total`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- started_containers_errors_total`
- started_containers_errors_total

@@ -71,7 +72,7 @@
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- [VPA](#vpa)
- [VPA](#vpa-1)
<!-- /toc -->


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to include the second one in the release, it's currently missing lead-opted-in label and milestone set. This needs to be done by sig-node leads.

* Cgroup settings when pod-level resources are set.
* Validate scheduling and admission.
* Validate the containers with no limits set are throttled on CPU when CPU usage reaches Pod level CPU limits.
* Validate the containers with no limits set are OOMKilled when memory usage
reaches Pod level memory limits.
* Test the correct values in TotalResourcesRequested.

- [test name](https://github.com/ndixita/kubernetes/blob/master/test/e2e/common/node/pod_level_resources.go): [SIG Node](https://testgrid.k8s.io/sig-node-presubmits#pr-kubelet-serial-e2e-podresources), [triage search](https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&sig=node&job=pull-kubernetes-node-kubelet-serial-podresources)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [test name](https://github.com/ndixita/kubernetes/blob/master/test/e2e/common/node/pod_level_resources.go): [SIG Node](https://testgrid.k8s.io/sig-node-presubmits#pr-kubelet-serial-e2e-podresources), [triage search](https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&sig=node&job=pull-kubernetes-node-kubelet-serial-podresources)
- [Pod Level Resources](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/pod_level_resources.go): [SIG Node](https://testgrid.k8s.io/sig-node-presubmits#pr-kubelet-serial-e2e-podresources), [triage search](https://storage.googleapis.com/k8s-triage/index.html?ci=0&pr=1&sig=node&job=pull-kubernetes-node-kubelet-serial-podresources)

`k8s.io/kubernetes/pkg/kubelet/cm`: `20250618` - 18.4
`k8s.io/kubernetes/pkg/kubelet/kuberuntime`: `20250618` - 69.1
`k8s.io/kubernetes/pkg/scheduler/framework` - `20250618` - 71.7
`k8s.io/kubernetes/pkg/apis/core/validation` - `20250618` - 84.7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you're missing integration tests? If none are applicable, please make sure to add such information.

* Pod Level Resources Support With In Place Pod Vertical Scaling KEP is past alpha.
* User feedback (ideally from at least two distinct users) is green
* Resource Allocation Managers i.e. Topology, Memory and CPU managers support with
Pod-level resources is past alpha.

### Upgrade / Downgrade Strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In ###### Are there any tests for feature enablement/disablement? you're describing tests that were planned for alpha, can you add there links to those tests?
  2. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? is missing answers, it currently mentions only that it will be tested? Was it? Can you describe the steps?
  3. In ###### Will enabling / using this feature result in increasing size or count of the existing API objects? you're missing estimated size increase, and amount of new objects (see comments above that section)

Copy link

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for SIG-Autoscaling

method.
Collaboration with sig-autoscaling has been established to integrate support for
VPA with Pod-level resources, slated for VPA 1.5. The changes to support pod-level
resources in VPA will be worked on in two phases:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - you can remove the "two phases" part

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.