-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[KEP-5254] DRA: Constraints with CEL #5391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
aws-ruhip
commented
Jun 10, 2025
- One-line PR description: Add support for new CEL expression constraint in DRA
- Issue link: DRA: Constraints with CEL #5254
- Other comments:
|
Welcome @aws-ruhip! |
Hi @aws-ruhip. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
409e0f7
to
f803450
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aws-ruhip The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b6c73a0
to
a97c185
Compare
e4440e1
to
b392c23
Compare
/wg device-management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked at the current KEP yet, I only checked whether we have the right reviewers assigned.
- "@klueska" | ||
- "@pohly" | ||
approvers: | ||
- "@alculquicondor" # SIG Scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will be OOO starting from tomorrow, so I cannot promise the review. @macsko will be OOO from Thursday and @sanposhiho may have limited bandwidth as well.
- "@pohly" | ||
approvers: | ||
- "@alculquicondor" # SIG Scheduling | ||
- "@dchen1107" # SIG Node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove SIG node entirely. This does not affect the kubelet at all.
- "@geetasg" | ||
owning-sig: sig-scheduling | ||
participating-sigs: | ||
- sig-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove SIG node entirely. This does not affect the kubelet at all.
@@ -0,0 +1,3 @@ | |||
kep-number: 5254 | |||
alpha: | |||
approver: "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether @johnbelamaric has time - I doubt it.
Let's see who from the PRR reviewers takes this, if it's not too late.
@@ -0,0 +1,3 @@ | |||
kep-number: 5254 | |||
alpha: | |||
approver: "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approver: "@johnbelamaric" | |
approver: "@deads2k" |
I'm not sure the context but david is assigned already in the PRR board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general comments and some PRR comments.
|
||
The kube-scheduler is responsible for selecting devices to allocate based on this new CELDeviceConstraint. The scheduler already supports selecting devices according to MatchAttribute - this KEP will extend that to also support CELDeviceConstraint. | ||
|
||
#### Non-monotonic constraint evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we have to evaluate the CEL against every possible set of devices to determine if they match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the CEL cannot be evaluated until the required number of devices are present in the device set. If they do not match, a new device set will be evaluated. This process will be repeated until a valid device set is found, according to the CEL expression.
|
||
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes, through the feature gate. Pods that are already running will continue unaffected. New workloads trying to utilize the feature will not work. Admission will fail due to API error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubelet admission or kube-apiserver admission? kube-apiserver admission plugins are optional. Additionally, I don't see any description of admission in the KEP above, so I'm not clear on how this would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are modifying the resource claim template API spec to add this new cel constraint, any workloads trying to use this cel constraint while the feature gate is disabled will see an error like "cel.expression is not a field"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are modifying the resource claim template API spec to add this new cel constraint, any workloads trying to use this cel constraint while the feature gate is disabled will see an error like "cel.expression is not a field"
That's not a common approach and extremely uncommon in the admission stage of the apiserver flow. Can you link to where you're familiar with an existing feature that behaves this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k We will not be relying on admission plugins. We will update the KEP to clarify the feature gate behavior. Thanks for the feedback!
question. | ||
--> | ||
|
||
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheduler throughput and the scheduler latency both seem like valuable metrics here.
|
||
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? | ||
|
||
Merely enabling the feature has no resource impact. When in use, resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merely enabling the feature has no resource impact. When in use, resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns. | |
Resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns. |
assume the feature is going to be used when writing PRR answers. How will we check the actual cost in a running cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CEL selectors, we use cost estimation from the cel k8s library: https://github.com/kubernetes/kubernetes/blob/2cc3dbf2250177c60b2967e3e8481ac3f7e33829/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go#L191
We will leverage the same for CEL constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to expose that as a metric? Also, I remember this as an estimate, not the actual. This new API is going to result in repeated evaluations and seems pertinent to track its (potentially significant) expense.
cc @wojtek-t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to expose that as a metric? Also, I remember this as an estimate, not the actual. This new API is going to result in repeated evaluations and seems pertinent to track its (potentially significant) expense.
Did we have any metric once we introduced CEL for validation in apiserver?
I don't have any reasonable suggestion from the top of my head:
- what we really care is cpu itself, but getting data about how much is used here in particular I think can only be read from profiles
- the only thing that comes to my mind is trying to report how many combinations are evaluated (and then use some average cel-evaluation cost to just estimated the cost of it); but that's super hand-wavy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing a slack thread in here. Permutations take smaller numbers and make really big ones. On a cluster with 10^3 nodes and 10^3 pods using it, an expression that is P(16,8) becomes a potential worst case of 10^14 checks (compared with 10^6 without this change).
On the surface this appears to be a big change. Agreeing to what sort of parameters are good for a benchmark prior and thinking through acceptable thresholds for changes to that benchmark with sig-scalability need to happen. I haven't yet thought of a different way to be comfortable with the PRR scaling questions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I remember this as an estimate, not the actual.
CEL can also track actual cost during execution, which is indicative of cpu usage: https://pkg.go.dev/github.com/google/cel-go/cel#EvalDetails.ActualCost
This is used in CEL selector evaluation here: https://github.com/kubernetes/kubernetes/blob/b0b52f4fb29223f829b50ae58323c20a7764dfc9/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L970
We can surface a metric that tracks cost evaluation for cel constraint workloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @pohly @johnbelamaric do we have rough scaling targets for DRA devices currently?
b392c23
to
a10403e
Compare
|
||
A malicious or buggy workload can specify CEL expressions that degrade the performance of constraint evaluation and scheduling. We will specify a limit on evaluation cost for the expression. There is already a mechanism to cap this today with CEL selectors that we can reuse. [Reference](https://github.com/kubernetes/kubernetes/blob/6188e5cb7b2f106b047493b7b498c1882723cab4/pkg/apis/resource/types.go#L910-L933) | ||
|
||
#### Performance at scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly feel that just adding a CEL expression is a simple solution, but it significantly (even exponentially) increases the complexity of the filtering phase in all scenarios not supported by MatchAttribute
. If we could group use cases and implement faster algorithms or, at least, reduce the problem space for them, that would be beneficial.
Also, remember about workload-aware scheduling, which will be introduced in upcoming releases. In some scenarios, it might need to run the filtering multiple times, and relying on really heavy algorithms could completely slow down scheduling.
But, we could leave this to the beta stage or next revision of alpha, so please document it accordingly in the KEP (and say that this has to be revisited before beta).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macsko Thank you for the review. We will update the KEP as per the feedback. Can you please link issue/info on workload aware scheduling ? cc @aws-ruhip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue: kubernetes/kubernetes#132192
that might indicate a serious problem? | ||
--> | ||
|
||
After enabling the feature, if the scheduler_pending_pods metric in the kube-scheduler suddenly increases, then perhaps scheduling no longer works as intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or when no new pods are created in the cluster, but metric is constant, indicating that no pods have been scheduled.
a10403e
to
e5a7fb4
Compare