KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta #5390

wongchar · 2025-06-09T20:01:39Z

One-line PR description: Promoting CPUManager feature prefer-align-cpus-by-uncorecache to beta

Issue link: Split L3 Cache Topology Awareness in CPU Manager #5109

Other comments:

k8s-ci-robot · 2025-06-09T20:01:47Z

Welcome @wongchar!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-06-09T20:01:48Z

Hi @wongchar. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ffromani · 2025-06-11T17:00:23Z

/ok-to-test

keps/prod-readiness/sig-node/4800.yaml

ffromani

We need to fill the beta graduation criterias. I can think of:

tests to ensure the compatibility with the other relevant cpumanager options
tests to ensure and report the incompatibility with other relevant cpumanager options. How this incompatibility should surface?
review if we have missing features: is kubernetes/kubernetes#131850 a beta requirement?

wongchar · 2025-06-12T17:05:02Z

We need to fill the beta graduation criterias. I can think of:

tests to ensure the compatibility with the other relevant cpumanager options

tests to ensure and report the incompatibility with other relevant cpumanager options. How this incompatibility should surface?

review if we have missing features: is cpumanager: uncorecache alignment for odd integer cpus kubernetes#131850 a beta requirement?

agreed. updated graduation criteria

soltysh

There are several comments that need to be addressed.

soltysh · 2025-06-13T11:57:59Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -292,6 +293,13 @@ N/A. This feature requires a e2e test for testing.
 - E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta.
 - Providing a metric to verify uncore cache alignment will be required to graduate to beta.

+#### Beta


Earlier in the document in unit tests section you've listed new tests to be added, do we need to update that section with appropriate links?

Similarly, where e2e tests added for this functionality? It seems this PR added some, can you update that section accordingly in that case?

regarding e2e tests, this work started pre- #5242 . Let's add them.
We have indeed some e2e tests but these only cover the metrics reporting.

This comment of mine still holds. I don't see test section filled in according to template.

added e2e tests for metrics.
additional e2e tests needed and added to beta graduation scope

soltysh · 2025-06-13T12:00:37Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -336,7 +344,7 @@ To enable this feature requires enabling the feature gates for static policy in
 For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled.

 - [x] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name: `CPUManagerAlphaPolicyOptions`
+  - Feature gate name: `CPUManagerBetaPolicyOptions`


Below in the question ###### Are there any tests for feature enablement/disablement? can you link those tests?

In ###### What specific metrics should inform a rollback? I'm missing explicit metric(s) being called out.

###### What steps should be taken if SLOs are not being met to determine the problem? is missing answer

In ###### How can a rollout or rollback fail? Can it impact already running workloads? is there a possibility that a kubelet restart will fail after enabling this feature, if so what, and how to react to it?

In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? I suggest checking out https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md and answering that question using data from that file.

Updated the PRR.
I might not still have a clear understanding of the SLO you are looking for. Originally mentioned latency which seems to be the objective in the link provided, but referencing other CPUManager policy options KEPs, they seem to mention tracking the provided metric for feature enablement.
Let me know if this is not what you were looking for and if you can provide more context.
Thanks

In ###### What specific metrics should inform a rollback? I'm missing explicit metric(s) being called out.

we can use kubelet_container_aligned_compute_resources_count

What steps should be taken if SLOs are not being met to determine the problem?

My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms).

So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.

We can refine further but this should be a good starting point.

ffromani · 2025-06-17T13:39:41Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -336,7 +344,7 @@ To enable this feature requires enabling the feature gates for static policy in
 For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled.

 - [x] Feature gate (also fill in values in `kep.yaml`)
-  - Feature gate name: `CPUManagerAlphaPolicyOptions`
+  - Feature gate name: `CPUManagerBetaPolicyOptions`


What steps should be taken if SLOs are not being met to determine the problem?

My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms).

So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.

We can refine further but this should be a good starting point.

ffromani · 2025-06-17T13:57:22Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+#### Beta
+
+- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
+- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order 


this was raised by me during the alpha stage but as discussion point. While this seemed (and seems) logical to me, I don't have any hard data to back the claim this improves over the current algorithm.

I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.

ffromani · 2025-06-17T13:58:03Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- Add test cases to ensure functional compatibility with existing CPUManager options
+- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
+- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature
+- Add metric for uncore cache alignment and incorporate to E2E tests


metrics and tests should be covered in kubernetes/kubernetes#130133 . Of course is possible there are gaps, if so feel free to point them out.

ffromani · 2025-06-17T13:59:05Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order 
+- Add test cases to ensure functional compatibility with existing CPUManager options
+- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
+- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature


if we are up to benchmarking this can be a testing ground to justify the sorting change. I for myself I'm not pushing for this though, I'm just open to this option.

helpful and welcome but no longer necessary, please see the above comment.

ffromani · 2025-06-17T14:04:47Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default.
+Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads.
+In response, drain the node and restart the kubelet.


Thanks for elaborating. I think this is a overly cautious stance. In out case the feature is best-effort, does not require changes to the cpumanager state file, has no dependency on extra fields (cadvisor long provides the data we need) and kubelet restart must not affect the running workloads anyway. So the chance for a rollout or rollback to fail are admittedly non zero, but reasonnably tiny. This is especially true for rollback.

A rollout may fail in the sense that the feature being best-effort, it can cause the workload to go running without proper LLC alignment, because internal resource fragmentation or because the workload characteristics. This is nuanced: the workload will go actually running (alignment is best-effort) but it won't enjoy the LLC alignment and then the promised performance gain.

The metrics "container_aligned_compute_resources_count" and "container_aligned_compute_resources_failure_count" can let the user notify about these "failures" and take corrective measures, but rollback will not help because of the best-effort nature of the change.

ffromani · 2025-06-17T14:05:20Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


 ###### What specific metrics should inform a rollback?

-Increased pod startup time/latency 
+`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.


Generally correct, but please see above about the name of the metrics and the nature of the "rollback"

The metrics name is not correct, it's a variable name, we need externally visible name.

ffromani · 2025-06-17T14:06:40Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?

-Utilized proposed 'container_aligned_compute_resources_count' in PR#127155 to be extended for uncore cache alignment count.
+No.


Nothing I can think of either

ffromani · 2025-06-17T14:06:42Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -409,16 +417,16 @@ Reference podresources API to determine CPU assignment.

 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

-Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible.
+CPUset allocation should be on the fewest amount of uncore caches as possible on the node.


I think we can reuse some of the content we discussed in https://github.com/kubernetes/enhancements/pull/5390/files#r2144953827 .

This still holds.

updated SLO, thanks for the explanation!

ffromani

sig-node review.
From node perspective, the beta content matches the previous conversation and agreements (my memory needed a little boost):

we will address the gap identified during the second alpha cycle
we will address the gap identified from feedback (kubernetes/kubernetes#131850)
we will add the necessary e2e test coverage

Hence, LGTM from node perspective for beta1.

ffromani · 2025-06-17T15:02:23Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -289,8 +290,16 @@ N/A. This feature requires a e2e test for testing.
 #### Alpha

 - Feature implemented behind a feature gate flag option
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta.
- Providing a metric to verify uncore cache alignment will be required to graduate to beta.
+- Test cases created for feature


let's clarify that we did unit tests, added relevant metrics and added e2e tests for the metrics. It could read like:

- Add unit test coverage - Added metrics to cover observability needs - Added e2e tests for the metrics only

ffromani · 2025-06-17T15:04:27Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+#### Beta
+
+- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment
+- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order 


I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.

ffromani · 2025-06-17T15:04:49Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

+- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order 
+- Add test cases to ensure functional compatibility with existing CPUManager options
+- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache
+- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature


helpful and welcome but no longer necessary, please see the above comment.

soltysh

Several comments still hold:

missing test section filled in
metric names and SLO

and a few minor, non-blocking nits. But the above 2 will need to be addressed to merge this.

soltysh · 2025-06-18T11:26:51Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -292,6 +293,13 @@ N/A. This feature requires a e2e test for testing.
 - E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta.
 - Providing a metric to verify uncore cache alignment will be required to graduate to beta.

+#### Beta


This comment of mine still holds. I don't see test section filled in according to template.

soltysh · 2025-06-18T11:30:32Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md


 ###### What specific metrics should inform a rollback?

-Increased pod startup time/latency 
+`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary.


The metrics name is not correct, it's a variable name, we need externally visible name.

soltysh · 2025-06-18T11:32:13Z

keps/sig-node/4800-cpumanager-split-uncorecache/kep.yaml

@@ -38,7 +38,7 @@ milestone:
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
-  - name: "CPUManagerPolicyAlphaOptions"


At the bottom of this document there is metrics section that needs to be filled in.

soltysh · 2025-06-18T11:33:02Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

-Unit tests will be implemented to test if the feature is enabled/disabled.
-E2e node serial suite can be use to test the enablement/disablement of the feature since it allows the kubelet to be restarted.
-
+E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.


Do you have those tests already? Those are a requirement for beta promotion.

tests will need to be added. Added e2e test coverage for beta scope

soltysh · 2025-06-18T11:34:25Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

-
+E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled.
+Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled. 
+See PR#130133 (https://github.com/kubernetes/kubernetes/pull/130133)


Nit, can we just link to a place in the code, not PR?

soltysh · 2025-06-18T11:34:43Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -397,7 +405,7 @@ Reference CPUID info in podresources API to be able to verify assignment.
 ###### How can an operator determine if the feature is in use by workloads?

 Reference podresources API to determine CPU assignment and CacheID assignment per container.
-Use proposed 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).
+Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155).


Same comment as above, please change that to pointer to code, not a PR.

soltysh · 2025-06-18T11:34:55Z

keps/sig-node/4800-cpumanager-split-uncorecache/README.md

@@ -409,16 +417,16 @@ Reference podresources API to determine CPU assignment.

 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

-Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible.
+CPUset allocation should be on the fewest amount of uncore caches as possible on the node.


This still holds.

wongchar · 2025-06-18T13:09:29Z

@soltysh @ffromani made an update based on feedback. Left as a separate commit for now in case it's easier to track. LMK if there's any issues

soltysh

/approve
the PRR section

ffromani

/approve
/lgtm

my comments where addressed so I think we can move forward

k8s-ci-robot · 2025-06-18T16:51:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ffromani, soltysh, wongchar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-node/4800-cpumanager-split-uncorecache/OWNERS~~ [ffromani]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes kind/kep labels Jun 9, 2025

k8s-ci-robot added the sig/node label Jun 9, 2025

k8s-ci-robot requested review from derekwaynecarr and palnabarun June 9, 2025 20:01

k8s-ci-robot added needs-ok-to-test size/S labels Jun 9, 2025

ajcaldelas mentioned this pull request Jun 9, 2025

Split L3 Cache Topology Awareness in CPU Manager #5109

Open

16 tasks

wongchar force-pushed the align-by-uncore-beta branch from c53c16d to 02d43d2 Compare June 10, 2025 14:03

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jun 11, 2025

ffromani reviewed Jun 11, 2025

View reviewed changes

keps/prod-readiness/sig-node/4800.yaml Outdated Show resolved Hide resolved

wongchar force-pushed the align-by-uncore-beta branch from 723bade to f67d977 Compare June 11, 2025 17:14

ffromani reviewed Jun 12, 2025

View reviewed changes

wongchar force-pushed the align-by-uncore-beta branch from 7375c52 to 4af3b23 Compare June 12, 2025 17:02

cpumanager: promote perfer-align-cpus-by-uncorecache to beta

f68fee3

wongchar force-pushed the align-by-uncore-beta branch from 71c843c to f68fee3 Compare June 12, 2025 17:13

soltysh reviewed Jun 13, 2025

View reviewed changes

k8s-ci-robot added size/M and removed size/S labels Jun 16, 2025

update prr

388cb73

wongchar force-pushed the align-by-uncore-beta branch from cdf199d to 388cb73 Compare June 16, 2025 19:15

ffromani reviewed Jun 17, 2025

View reviewed changes

soltysh reviewed Jun 18, 2025

View reviewed changes

respond to feedback

effb5cb

soltysh approved these changes Jun 18, 2025

View reviewed changes

ffromani reviewed Jun 18, 2025

View reviewed changes

k8s-ci-robot assigned ffromani Jun 18, 2025

k8s-ci-robot added the lgtm label Jun 18, 2025

k8s-ci-robot added the approved label Jun 18, 2025

k8s-ci-robot merged commit 0528ad0 into kubernetes:master Jun 18, 2025
4 checks passed

k8s-ci-robot added this to the v1.34 milestone Jun 18, 2025

KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta #5390

KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta #5390

Uh oh!

Conversation

wongchar commented Jun 9, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

ffromani commented Jun 11, 2025

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

wongchar commented Jun 12, 2025

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!