-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta #5390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4800: Promote prefer-align-cpus-by-uncorecache CPUManager feature to beta #5390
Conversation
wongchar
commented
Jun 9, 2025
- One-line PR description: Promoting CPUManager feature prefer-align-cpus-by-uncorecache to beta
- Issue link: Split L3 Cache Topology Awareness in CPU Manager #5109
- Other comments:
Welcome @wongchar! |
Hi @wongchar. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
c53c16d
to
02d43d2
Compare
/ok-to-test |
723bade
to
f67d977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fill the beta graduation criterias. I can think of:
- tests to ensure the compatibility with the other relevant cpumanager options
- tests to ensure and report the incompatibility with other relevant cpumanager options. How this incompatibility should surface?
- review if we have missing features: is kubernetes/kubernetes#131850 a beta requirement?
7375c52
to
4af3b23
Compare
agreed. updated graduation criteria |
71c843c
to
f68fee3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several comments that need to be addressed.
@@ -292,6 +293,13 @@ N/A. This feature requires a e2e test for testing. | |||
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | |||
- Providing a metric to verify uncore cache alignment will be required to graduate to beta. | |||
|
|||
#### Beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier in the document in unit tests section you've listed new tests to be added, do we need to update that section with appropriate links?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, where e2e tests added for this functionality? It seems this PR added some, can you update that section accordingly in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding e2e tests, this work started pre- #5242 . Let's add them.
We have indeed some e2e tests but these only cover the metrics reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment of mine still holds. I don't see test section filled in according to template.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added e2e tests for metrics.
additional e2e tests needed and added to beta graduation scope
@@ -336,7 +344,7 @@ To enable this feature requires enabling the feature gates for static policy in | |||
For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled. | |||
|
|||
- [x] Feature gate (also fill in values in `kep.yaml`) | |||
- Feature gate name: `CPUManagerAlphaPolicyOptions` | |||
- Feature gate name: `CPUManagerBetaPolicyOptions` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below in the question ###### Are there any tests for feature enablement/disablement?
can you link those tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ###### What specific metrics should inform a rollback?
I'm missing explicit metric(s) being called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
###### What steps should be taken if SLOs are not being met to determine the problem?
is missing answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ###### How can a rollout or rollback fail? Can it impact already running workloads?
is there a possibility that a kubelet restart will fail after enabling this feature, if so what, and how to react to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
I suggest checking out https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md and answering that question using data from that file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PRR.
I might not still have a clear understanding of the SLO you are looking for. Originally mentioned latency which seems to be the objective in the link provided, but referencing other CPUManager policy options KEPs, they seem to mention tracking the provided metric for feature enablement.
Let me know if this is not what you were looking for and if you can provide more context.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
###### What specific metrics should inform a rollback?
I'm missing explicit metric(s) being called out.
we can use kubelet_container_aligned_compute_resources_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What steps should be taken if SLOs are not being met to determine the problem?
My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms
).
So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.
We can refine further but this should be a good starting point.
cdf199d
to
388cb73
Compare
@@ -336,7 +344,7 @@ To enable this feature requires enabling the feature gates for static policy in | |||
For `CPUManager` it is a requirement going from `none` to `static` policy cannot be done dynamically because of the `cpu_manager_state file`. The node needs to be drained and the policy checkpoint file (`cpu_manager_state`) need to be removed before restarting Kubelet. This feature specifically relies on the `static` policy being enabled. | |||
|
|||
- [x] Feature gate (also fill in values in `kep.yaml`) | |||
- Feature gate name: `CPUManagerAlphaPolicyOptions` | |||
- Feature gate name: `CPUManagerBetaPolicyOptions` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What steps should be taken if SLOs are not being met to determine the problem?
My take is on this: resource allocation in kubelet is done during the admission stage. This feature plugs in the resource allocation flow. And this feature is best-effort, so it can't cause failed admission. It can however cause more admission delay. The SLI here is the pod admission time, measured from the moment the kubelet begin admission to the end of the admission stage (captured by topology_manager_admission_duration_ms
).
So the SLO can be "In default Kubernetes installation, 99th percentile per cluster-day <= X".
Meaning, the slowdown in the admission phase this feature causes, which contributes to the pod startup latency, should have a upper bound and should be a fraction of the admission time without this feature enabled. E.g, causing the admission time to take up to double time would be bound, but not acceptable.
We can refine further but this should be a good starting point.
#### Beta | ||
|
||
- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment | ||
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was raised by me during the alpha stage but as discussion point. While this seemed (and seems) logical to me, I don't have any hard data to back the claim this improves over the current algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.
- Add test cases to ensure functional compatibility with existing CPUManager options | ||
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature | ||
- Add metric for uncore cache alignment and incorporate to E2E tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics and tests should be covered in kubernetes/kubernetes#130133 . Of course is possible there are gaps, if so feel free to point them out.
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order | ||
- Add test cases to ensure functional compatibility with existing CPUManager options | ||
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are up to benchmarking this can be a testing ground to justify the sorting change. I for myself I'm not pushing for this though, I'm just open to this option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helpful and welcome but no longer necessary, please see the above comment.
Rollout/rollback should not fail since the feature is hidden behind feature gates and will not be enabled by default. | ||
Enabling the feature will require the Kubelet to restart, introducing potential for kubelet to fail to start or crash, which can affect existing workloads. | ||
In response, drain the node and restart the kubelet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for elaborating. I think this is a overly cautious stance. In out case the feature is best-effort, does not require changes to the cpumanager state file, has no dependency on extra fields (cadvisor long provides the data we need) and kubelet restart must not affect the running workloads anyway. So the chance for a rollout or rollback to fail are admittedly non zero, but reasonnably tiny. This is especially true for rollback.
A rollout may fail in the sense that the feature being best-effort, it can cause the workload to go running without proper LLC alignment, because internal resource fragmentation or because the workload characteristics. This is nuanced: the workload will go actually running (alignment is best-effort) but it won't enjoy the LLC alignment and then the promised performance gain.
The metrics "container_aligned_compute_resources_count" and "container_aligned_compute_resources_failure_count" can let the user notify about these "failures" and take corrective measures, but rollback will not help because of the best-effort nature of the change.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
Increased pod startup time/latency | ||
`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally correct, but please see above about the name of the metrics and the nature of the "rollback"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics name is not correct, it's a variable name, we need externally visible name.
|
||
###### Are there any missing metrics that would be useful to have to improve observability of this feature? | ||
|
||
Utilized proposed 'container_aligned_compute_resources_count' in PR#127155 to be extended for uncore cache alignment count. | ||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing I can think of either
@@ -409,16 +417,16 @@ Reference podresources API to determine CPU assignment. | |||
|
|||
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | |||
|
|||
Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible. | |||
CPUset allocation should be on the fewest amount of uncore caches as possible on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can reuse some of the content we discussed in https://github.com/kubernetes/enhancements/pull/5390/files#r2144953827 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated SLO, thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sig-node review.
From node perspective, the beta content matches the previous conversation and agreements (my memory needed a little boost):
- we will address the gap identified during the second alpha cycle
- we will address the gap identified from feedback (kubernetes/kubernetes#131850)
- we will add the necessary e2e test coverage
Hence, LGTM from node perspective for beta1.
@@ -289,8 +290,16 @@ N/A. This feature requires a e2e test for testing. | |||
#### Alpha | |||
|
|||
- Feature implemented behind a feature gate flag option | |||
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | |||
- Providing a metric to verify uncore cache alignment will be required to graduate to beta. | |||
- Test cases created for feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's clarify that we did unit tests, added relevant metrics and added e2e tests for the metrics. It could read like:
- Add unit test coverage
- Added metrics to cover observability needs
- Added e2e tests for the metrics only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#### Beta | ||
|
||
- Address bug fixes: ability to schedule odd-integer CPUs for uncore cache alignment | ||
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the KEP and the past convos again. In the above comment I was partially wrong because I didn't remember the nature of the change. My bad. So, the main thing is that the KEP implies the aforementioned sorting which the implementation doesn't (see: #5110 (comment)); incidentally, this seems to leave a potential optimization on the table. But the main point is the divergence between the implementation and the design, which we agreed to rectify (and this is what I was not remembering). I for myself thus I don't think this is a new feature or a change, but rather something between an optimization and a bugfix, so it totally makes sense in the context of the beta process.
- Add missing feature: sort uncore caches by largest quantity of available CPUs instead of numerical order | ||
- Add test cases to ensure functional compatibility with existing CPUManager options | ||
- Add test cases to ensure and report incompatibility with existing CPUManager options that are not supported with prefer-align-cpus-by-uncore-cache | ||
- Additional benchmarks to show performance benefit of prefer-align-cpus-by-uncore-cache feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
helpful and welcome but no longer necessary, please see the above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments still hold:
- missing test section filled in
- metric names and SLO
and a few minor, non-blocking nits. But the above 2 will need to be addressed to merge this.
@@ -292,6 +293,13 @@ N/A. This feature requires a e2e test for testing. | |||
- E2E Tests will be skipped until nodes with uncore cache can be provisioned within CI hardware. Work is ongoing to add required systems (https://github.com/kubernetes/k8s.io/issues/7339). E2E testing will be required to graduate to beta. | |||
- Providing a metric to verify uncore cache alignment will be required to graduate to beta. | |||
|
|||
#### Beta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment of mine still holds. I don't see test section filled in according to template.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
Increased pod startup time/latency | ||
`AlignedUncoreCache` metric can be tracked to measure if there are issues in the cpuset allocation that can determine if a rollback is necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics name is not correct, it's a variable name, we need externally visible name.
@@ -38,7 +38,7 @@ milestone: | |||
# The following PRR answers are required at alpha release | |||
# List the feature gate name and the components for which it must be enabled | |||
feature-gates: | |||
- name: "CPUManagerPolicyAlphaOptions" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the bottom of this document there is metrics section that needs to be filled in.
Unit tests will be implemented to test if the feature is enabled/disabled. | ||
E2e node serial suite can be use to test the enablement/disablement of the feature since it allows the kubelet to be restarted. | ||
|
||
E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have those tests already? Those are a requirement for beta promotion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests will need to be added. Added e2e test coverage for beta scope
|
||
E2E test will demonstrate default behavior is preserved when `CPUManagerPolicyOptions` feature gate is disabled. | ||
Metric created to check uncore cache alignment after cpuset is determined and utilized in E2E tests with feature enabled. | ||
See PR#130133 (https://github.com/kubernetes/kubernetes/pull/130133) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, can we just link to a place in the code, not PR?
@@ -397,7 +405,7 @@ Reference CPUID info in podresources API to be able to verify assignment. | |||
###### How can an operator determine if the feature is in use by workloads? | |||
|
|||
Reference podresources API to determine CPU assignment and CacheID assignment per container. | |||
Use proposed 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155). | |||
Use 'container_aligned_compute_resources_count' metric which reports the count of containers getting aligned compute resources. See PR#127155 (https://github.com/kubernetes/kubernetes/pull/127155). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above, please change that to pointer to code, not a PR.
@@ -409,16 +417,16 @@ Reference podresources API to determine CPU assignment. | |||
|
|||
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? | |||
|
|||
Measure the time to deploy pods under default settings and compare to the time to deploy pods with align-by-uncorecache enabled. Time difference should be negligible. | |||
CPUset allocation should be on the fewest amount of uncore caches as possible on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still holds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the PRR section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
my comments where addressed so I think we can move forward
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, soltysh, wongchar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |