KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379

mortent · 2025-06-05T19:48:15Z

One-line PR description: DRA: ReservedFor Workloads as alpha in 1.34
Issue link: DRA: ReservedFor Workloads #5194
Other comments: first revision

johnbelamaric · 2025-06-06T21:34:22Z

keps/prod-readiness/sig-scheduling/5194.yaml

@@ -0,0 +1,3 @@
+kep-number: 5194
+alpha:
+  approver: "@johnbelamaric"


Probably best to pick someone else since I will be out all but two days between now and KEP freeze

Changed this to @wojtek-t. @wojtek-t are you ok with that?

Yes - I can take it

keps/sig-scheduling/5194-reserved-for-workloads/README.md

johnbelamaric · 2025-06-06T21:46:17Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+and therefore know that it is safe to deallocate the `ResourceClaim`.
+
+This requires that the controller/user has permissions to update the status
+subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if


So, if the other controllers need to do it, then they each need to be granted update permissions to the status. Whereas for the resource claim controller to do it, it needs get/list/watch for that resource type. I think that the latter option has way less opportunity for orphaned resource claims. I would suggest we lead with that as the primary mechanism.

If the resource claim controller sees something in there and doesn't have the right permissions, it can complain in events and logs.

Realistically, this probably means giving the resourceclaim controller the ability to watch deployments, jobs, and statefulsets. And maybe a few others.

So I've updated the design here a bit.

What I think might be a possible challenge with making the resource claim controller responsible for removing the reference in the ReservedFor list is that we probably can only do this safely when the resource has been deleted (and even then I guess there can be orphaned pods). If a Deployment is scaled down to zero, there will not be any pods using the ResourceClaim, so it could potentially be deallocated. But I don't think the resource claim controller can deallocate just based on the number of replicas, since there will be a race between the deallocation and the possible creation of new pods. If the workload controller is responsible for handling this, I think it should be able to handle this safely.

Based on other feedback and thinking more about this, I have changed this to actually put the responsibility on the controllers. Two primary reasons for this:

It lets us implement a single solution that is available to both in-tree and out-of-tree controllers.

Controllers have more context about the workloads, so they are in a better position than the resourceclaim controller to decide when deallocation is safe.

keps/sig-scheduling/5194-reserved-for-workloads/README.md

johnbelamaric · 2025-06-06T21:55:50Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes. Applications that were already running will continue to run and the allocated
+devices will remain so.


What happens when they finish, though? Won't the dangling reference to, say, a Job in the ReservedFor do something weird in the deallocation process?

Yeah, I was thinking that removing them was handled by the workload controller and therefore wouldn't be affected if the feature was disabled. But with the change that we want the resource claim controller responsible for removing the reference that is obviously not true. And even if we made it the responsibility of the workload controller, that functionality would probably be covered by the same feature gate.

I have updated this section.

What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

keps/sig-scheduling/5194-reserved-for-workloads/README.md

mortent · 2025-06-09T14:59:48Z

/assign @pohly

mortent · 2025-06-09T21:46:43Z

/wg device-management

dom4ha · 2025-06-10T14:17:51Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+
+Rather than expecting the `ReservedFor` field to contain an exhaustive list of
+all pods using the ResourceClaim, we propose letting the controller managing
+a ResourceClaim specify the reference to the resource consuming the claim


Isn't it enough to specify the ResourceClaim itself to which all pods that are supposed to use the allocated device are referring to?

What is a use case for using a different object, since it potentially may contain other set of pods than the ones that refer to this ResourceClaim?

So the main reason for having the reference here is that we need a way to signal to the resourceclaim controller that there are no more pods consuming the claim and it can therefore be deallocated. We have discussed two ways this can happen:

The resourceclaim controller checks whether the referenced resource exists, and if not, concludes where are no pods consuming the claim and it deallocates it.

The workload controller decides when there are no more pods consuming the resourceclaim and therefore decides it can be deallocated.

In both cases, I think it is useful that we capture information about the "owning workload" of the ResourceClaim and I think the ReservedFor field seems like a reasonable place to do it.

As for having a workload managing pods where not all of them are consuming the ResourceClaim, for that scenario I think it will be up to the workload controller to decide when it is safe to remove the reference in the ReservedFor list and therefore deallocate the claim.

+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

dom4ha · 2025-06-10T14:24:21Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+The `ReservedFor` list already accepts generic resource references, so this
+field doesn't need to be changed. However, we are proposing adding two new
+fields to the `ResourceClaim` type:
+* `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in


If we assume that the ResourceClaim itself could be specified (and so all pods referring to it), why not treat it as a default way of reserving resources (after making the feature flag gated)?

Is there any reason why the new approach cannot replace the existing one?

So I think there are two ways we use the ReservedFor list today:

To determine when the ResourceClaim can be deallocated.

To find all pods consuming (or referencing) the ResourcecClaim.

We can deallocate when there are no more pods consuming the ResourceClaim. Without the ReservedFor list of pods, the only entity that can determine this is the controller responsible for managing creation/deletion of the pods, i.e. the workload controller. And it "communicates" with the resourceclaim controller by removing the reference in the ReservedFor list.

Finding the pods referencing the ResourceClaim is not sufficient for deallocation, because we have a race between the check and new pods being created. But for the situations where we just need the pods referencing the claim, that is pretty easy to find. And the proposal here is that we will do that in the device_taint_eviction controller.

dom4ha · 2025-06-10T14:35:19Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+  the spec which resource is the consumer of the `ResourceClaim`. When the first pod
+  referencing the `ResourceClaim` is scheduled, the reference will be copied into
+  the `status.ReservedFor` list.
+* `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA


The name ReservedForAnyPod may be a bit misleading and inaccurate. The claim would be reserved for for any pod that is referencing this claim (or other object specified in spec.ReservedFor).

Actually the name ReservedFor itself is misleading, as we should rather say that the ResourceClaim is AllocatedFor all pods that are referencing it. We probably should also say that the ResourceClaim should be deallocated when non of the pods referencing it is scheduled (assumed in the scheduler cache).

So I've tried to make the difference between a pod referencing a claim and a pod consuming a claim. The pod references in the list gives us the latter, since it is set by the scheduler. But without the explicit list of pods, we have to find another way to handle deallocation. As mentioned in another comment, I don't think there is a safe way to do deallocation by looking listing pods referencing the claim.

Agree that the naming isn't necessarily perfect, but I think the new names mostly makes sense assuming we keep the existing names. But definitely open to changing them.

dom4ha · 2025-06-10T15:08:17Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+deleted or finish running. An empty list means there are no current consumers of the claim
+and it can be deallocated.
+
+#### Finding pods using a ResourceClaim


ReservedFor is also used to avoid races between different schedulers scheduling pods that share the same ResourceClaim (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md)

It may be hard to replace this use case since currently there is no good way of coordinating claim allocation between different schedulers. We work on a KEP #5287 which allows to set NominateNodeName after scheduler decides on a pod placement (pod is assumed, meaning it's scheduled but not bound yet), but the way it's designed now, this fields also conveys more information than just whether a pod is assumed, so unfortunately this bit of information cannot be used directly.

@wojtek-t @sanposhiho @macsko

We also work on an ultimate way of reserving resources that could address the problem of races between schedulers, but there are no details yet to share.

Do you have an example of how this could become a problem with multiple schedulers? I'm not familiar with that situation.

wojtek-t · 2025-06-12T08:19:09Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+
+Rather than expecting the `ReservedFor` field to contain an exhaustive list of
+all pods using the ResourceClaim, we propose letting the controller managing
+a ResourceClaim specify the reference to the resource consuming the claim


+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

wojtek-t · 2025-06-12T08:29:43Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+##### Deallocation
+The resourceclaim controller will remove Pod references from the `ReservedFor` list just
+like it does now using the same logic. For non-Pod references, the controller will recognize
+a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will


I think this is somewhat misleading. I believe we should either make it work for an arbitrary owner or not at all.
Ignoring out-of-tree resources makes us not out-of-tree friendly and additionally may lead to situations where people will be assuming that it works (based on experience with in-tree types) and end up leaking resources...

I would personally suggest requiring a controller to unset it for Alpha and add a beta graduation to revisit that decision.

Yeah, I agree. I also think workload controllers are in a better position to make decisions on when allocation is safe, so several reasons for moving this responsibility to the workload controllers.

keps/sig-scheduling/5194-reserved-for-workloads/README.md

wojtek-t · 2025-06-12T08:38:12Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
+it will be subject to the 256 limit on the size of the `ReservedFor` list and the
+controller creating the `ResourceClaim` will not find the reference it expects
+in the `ReservedFor` list when it tries to remove it.


To clarify - if there is already a reference to pod in ReservedFor, then for newly scheduled pods scheduler will continue to adding those, despite spec.ReservedFor set to something else right?

[If so, it would be useful to add that explicitly here too]

Yeah, looking at this again I don't think there are any good ways to completely avoid situations where there might be both pod and non-pod references in the ReservedFor list. So I've called out here that this is something the controllers need to be able to handle.

wojtek-t · 2025-06-12T08:40:32Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
+
+Yes. Applications that were already running will continue to run and the allocated
+devices will remain so.


What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

keps/sig-scheduling/5194-reserved-for-workloads/README.md

wojtek-t · 2025-06-13T11:19:57Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+  `ReservedFor` list, despite there being a non-pod reference here. So it ends up with
+  both pod and non-pod references in the list. We need to make sure the system can
+  handle this, as it might also happen as a result of disablement and the enablement
+  of the feature.


Can we even do something about that?
Does kubelet do admission again after restart? @SergeyKanzhelev for your input

We're ending up in a situation where some pods may be listed and some aren't listed...

I have a feeling that we need to make an explicit recommendation that before downgrade you need to ensure that ReservedFor contains the list of pods.

I don't think there is a way to avoid this situation. Even if the kubelet doesn't do admission again after restart, the enabled->disabled->enabled flow would still lead to a situation where we would have both pod and non-pod references.

I think having both types of references can be managed. the workload controller shouldn't remove the reference until all pods consuming the claim has been removed, so once that happens there shouldn't be any pod references left (or if there is a race, they should be removed soon after) so the claim can be allocated. But it makes for weird semantics that the list can contain an incomplete list of pods.

So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.

So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.

So I think that we should basically do both:

explicitly recommend that before downgrading (or disabling the feature) you should ensure that the ReserveFor for non-pod references is replaced with the actual pod list [and that we will not provide any automation for it]

try to build some safety mechanism to not blow-up the system if that didn't happen (like controllers will not remove its reference until there is a pod in that list)

Does that make sense?

Yeah, I think we can handle pod and non-pod references in the list. I've updated the section to be more specific here.

keps/sig-scheduling/5194-reserved-for-workloads/README.md

wojtek-t · 2025-06-17T06:12:24Z

The missing sections will have to be filled in for Beta, but for Alpha this is good enough from PRR perspective.

/approve PRR

k8s-ci-robot · 2025-06-17T06:12:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent, wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dom4ha · 2025-06-17T14:26:08Z

keps/sig-scheduling/5194-reserved-for-workloads/README.md

+1. If the kubelet sees that the `status.allocation.ReservedForAnyPod` is set, it will skip
+   the check that the Pod is listed in the `ReservedFor` list and just run the pod.
+
+1. If the DRA scheduler plugin is trying to find candidates for deallocation in


Scheduler may need to deallocate a ResourceClaim in PostFilter, otherwise it won't be able to find alternative allocation for it, so it cannot leave it up to an external component.

In workload scheduling ResourceClaims may be allocated and deallocated several time during scheduling, so it has to be fast in-memory process.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 5, 2025

k8s-ci-robot requested review from johnbelamaric and macsko June 5, 2025 19:48

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 5, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Jun 5, 2025

github-project-automation bot added this to SIG Scheduling Jun 5, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 5, 2025

KEP-5194: Initial version for DRA ReservedFor Workloads

61ec86f

mortent force-pushed the ReservedFor branch from eb043c5 to 61ec86f Compare June 5, 2025 20:04

johnbelamaric mentioned this pull request Jun 5, 2025

DRA: ReservedFor Workloads #5194

Open

5 tasks

johnbelamaric reviewed Jun 6, 2025

View reviewed changes

mortent force-pushed the ReservedFor branch from 1bfd8a0 to 5678f93 Compare June 7, 2025 18:57

k8s-ci-robot assigned pohly Jun 9, 2025

Addressed comments

2c98d9d

mortent force-pushed the ReservedFor branch from 5678f93 to 2c98d9d Compare June 9, 2025 15:11

k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025

github-project-automation bot added this to Dynamic Resource Allocation Jun 9, 2025

github-project-automation bot moved this to 🆕 New in Dynamic Resource Allocation Jun 9, 2025

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025

wojtek-t self-assigned this Jun 10, 2025

dom4ha reviewed Jun 10, 2025

View reviewed changes

mortent requested a review from dom4ha June 10, 2025 22:27

wojtek-t reviewed Jun 12, 2025

View reviewed changes

mortent force-pushed the ReservedFor branch from 1a0635c to 5db2605 Compare June 12, 2025 20:17

Addressed more comments

7c58919

mortent force-pushed the ReservedFor branch from 5db2605 to 7c58919 Compare June 12, 2025 20:25

wojtek-t reviewed Jun 13, 2025

View reviewed changes

mortent added 2 commits June 13, 2025 16:07

Fixed typo

a6a044e

Clarify more around the downgrade and feature disablement scenarios

f8ed259

mortent force-pushed the ReservedFor branch from 1259202 to f8ed259 Compare June 17, 2025 00:13

dom4ha reviewed Jun 17, 2025

View reviewed changes

KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379

Are you sure you want to change the base?

KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379

Conversation

mortent commented Jun 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mortent commented Jun 9, 2025

Uh oh!

mortent commented Jun 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wojtek-t commented Jun 17, 2025