Skip to content

[KEP-5254] DRA: Constraints with CEL #5391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

aws-ruhip
Copy link

  • One-line PR description: Add support for new CEL expression constraint in DRA
  • Other comments:

Copy link

linux-foundation-easycla bot commented Jun 10, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: aws-ruhip / name: Ruhi Prasad (e5a7fb4)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 10, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @aws-ruhip!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @aws-ruhip. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from dom4ha June 10, 2025 06:05
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 10, 2025
@k8s-ci-robot k8s-ci-robot requested a review from macsko June 10, 2025 06:05
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jun 10, 2025
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 10, 2025
@geetasg
Copy link

geetasg commented Jun 10, 2025

@dims
Copy link
Member

dims commented Jun 10, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 10, 2025
@aws-ruhip aws-ruhip force-pushed the dra-constraints-with-cel branch 2 times, most recently from 409e0f7 to f803450 Compare June 10, 2025 17:23
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aws-ruhip
Once this PR has been reviewed and has the lgtm label, please assign johnbelamaric, macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aws-ruhip aws-ruhip force-pushed the dra-constraints-with-cel branch 4 times, most recently from b6c73a0 to a97c185 Compare June 10, 2025 18:53
@geetasg geetasg mentioned this pull request Jun 11, 2025
4 tasks
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 12, 2025
@geetasg geetasg force-pushed the dra-constraints-with-cel branch from e4440e1 to b392c23 Compare June 12, 2025 15:28
@k8s-ci-robot k8s-ci-robot removed the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 12, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 12, 2025
@pohly
Copy link
Contributor

pohly commented Jun 13, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 13, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 13, 2025
Copy link
Contributor

@pohly pohly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the current KEP yet, I only checked whether we have the right reviewers assigned.

- "@klueska"
- "@pohly"
approvers:
- "@alculquicondor" # SIG Scheduling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dom4ha, @macsko: who of you can take point on this one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will be OOO starting from tomorrow, so I cannot promise the review. @macsko will be OOO from Thursday and @sanposhiho may have limited bandwidth as well.

- "@pohly"
approvers:
- "@alculquicondor" # SIG Scheduling
- "@dchen1107" # SIG Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove SIG node entirely. This does not affect the kubelet at all.

- "@geetasg"
owning-sig: sig-scheduling
participating-sigs:
- sig-node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove SIG node entirely. This does not affect the kubelet at all.

@@ -0,0 +1,3 @@
kep-number: 5254
alpha:
approver: "@johnbelamaric"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether @johnbelamaric has time - I doubt it.

Let's see who from the PRR reviewers takes this, if it's not too late.

@@ -0,0 +1,3 @@
kep-number: 5254
alpha:
approver: "@johnbelamaric"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
approver: "@johnbelamaric"
approver: "@deads2k"

I'm not sure the context but david is assigned already in the PRR board.

Copy link
Contributor

@deads2k deads2k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general comments and some PRR comments.


The kube-scheduler is responsible for selecting devices to allocate based on this new CELDeviceConstraint. The scheduler already supports selecting devices according to MatchAttribute - this KEP will extend that to also support CELDeviceConstraint.

#### Non-monotonic constraint evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we have to evaluate the CEL against every possible set of devices to determine if they match?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the CEL cannot be evaluated until the required number of devices are present in the device set. If they do not match, a new device set will be evaluated. This process will be repeated until a valid device set is found, according to the CEL expression.


###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, through the feature gate. Pods that are already running will continue unaffected. New workloads trying to utilize the feature will not work. Admission will fail due to API error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet admission or kube-apiserver admission? kube-apiserver admission plugins are optional. Additionally, I don't see any description of admission in the KEP above, so I'm not clear on how this would happen.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are modifying the resource claim template API spec to add this new cel constraint, any workloads trying to use this cel constraint while the feature gate is disabled will see an error like "cel.expression is not a field"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are modifying the resource claim template API spec to add this new cel constraint, any workloads trying to use this cel constraint while the feature gate is disabled will see an error like "cel.expression is not a field"

That's not a common approach and extremely uncommon in the admission stage of the apiserver flow. Can you link to where you're familiar with an existing feature that behaves this way?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k We will not be relying on admission plugins. We will update the KEP to clarify the feature gate behavior. Thanks for the feedback!

question.
-->

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheduler throughput and the scheduler latency both seem like valuable metrics here.


###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Merely enabling the feature has no resource impact. When in use, resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Merely enabling the feature has no resource impact. When in use, resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns.
Resource consumption will scale with the number of device combinations being evaluated, varying based on specific allocation patterns.

assume the feature is going to be used when writing PRR answers. How will we check the actual cost in a running cluster?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CEL selectors, we use cost estimation from the cel k8s library: https://github.com/kubernetes/kubernetes/blob/2cc3dbf2250177c60b2967e3e8481ac3f7e33829/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go#L191

We will leverage the same for CEL constraints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to expose that as a metric? Also, I remember this as an estimate, not the actual. This new API is going to result in repeated evaluations and seems pertinent to track its (potentially significant) expense.

cc @wojtek-t

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to expose that as a metric? Also, I remember this as an estimate, not the actual. This new API is going to result in repeated evaluations and seems pertinent to track its (potentially significant) expense.

Did we have any metric once we introduced CEL for validation in apiserver?

I don't have any reasonable suggestion from the top of my head:

  • what we really care is cpu itself, but getting data about how much is used here in particular I think can only be read from profiles
  • the only thing that comes to my mind is trying to report how many combinations are evaluated (and then use some average cel-evaluation cost to just estimated the cost of it); but that's super hand-wavy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing a slack thread in here. Permutations take smaller numbers and make really big ones. On a cluster with 10^3 nodes and 10^3 pods using it, an expression that is P(16,8) becomes a potential worst case of 10^14 checks (compared with 10^6 without this change).

On the surface this appears to be a big change. Agreeing to what sort of parameters are good for a benchmark prior and thinking through acceptable thresholds for changes to that benchmark with sig-scalability need to happen. I haven't yet thought of a different way to be comfortable with the PRR scaling questions here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I remember this as an estimate, not the actual.

CEL can also track actual cost during execution, which is indicative of cpu usage: https://pkg.go.dev/github.com/google/cel-go/cel#EvalDetails.ActualCost

This is used in CEL selector evaluation here: https://github.com/kubernetes/kubernetes/blob/b0b52f4fb29223f829b50ae58323c20a7764dfc9/staging/src/k8s.io/dynamic-resource-allocation/structured/allocator.go#L970

We can surface a metric that tracks cost evaluation for cel constraint workloads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @pohly @johnbelamaric do we have rough scaling targets for DRA devices currently?

@github-project-automation github-project-automation bot moved this from Needs Triage to Needs Review in SIG Scheduling Jun 16, 2025
@aws-ruhip aws-ruhip force-pushed the dra-constraints-with-cel branch from b392c23 to a10403e Compare June 18, 2025 04:56

A malicious or buggy workload can specify CEL expressions that degrade the performance of constraint evaluation and scheduling. We will specify a limit on evaluation cost for the expression. There is already a mechanism to cap this today with CEL selectors that we can reuse. [Reference](https://github.com/kubernetes/kubernetes/blob/6188e5cb7b2f106b047493b7b498c1882723cab4/pkg/apis/resource/types.go#L910-L933)

#### Performance at scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly feel that just adding a CEL expression is a simple solution, but it significantly (even exponentially) increases the complexity of the filtering phase in all scenarios not supported by MatchAttribute. If we could group use cases and implement faster algorithms or, at least, reduce the problem space for them, that would be beneficial.
Also, remember about workload-aware scheduling, which will be introduced in upcoming releases. In some scenarios, it might need to run the filtering multiple times, and relying on really heavy algorithms could completely slow down scheduling.

But, we could leave this to the beta stage or next revision of alpha, so please document it accordingly in the KEP (and say that this has to be revisited before beta).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macsko Thank you for the review. We will update the KEP as per the feedback. Can you please link issue/info on workload aware scheduling ? cc @aws-ruhip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that might indicate a serious problem?
-->

After enabling the feature, if the scheduler_pending_pods metric in the kube-scheduler suddenly increases, then perhaps scheduling no longer works as intended.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or when no new pods are created in the cluster, but metric is constant, indicating that no pods have been scheduled.

@aws-ruhip aws-ruhip force-pushed the dra-constraints-with-cel branch from a10403e to e5a7fb4 Compare June 18, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Review
Development

Successfully merging this pull request may close these issues.