Skip to content

KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mortent
Copy link
Member

@mortent mortent commented Jun 5, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 5, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jun 5, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Jun 5, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 5, 2025
@@ -0,0 +1,3 @@
kep-number: 5194
alpha:
approver: "@johnbelamaric"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to pick someone else since I will be out all but two days between now and KEP freeze

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to @wojtek-t. @wojtek-t are you ok with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - I can take it

and therefore know that it is safe to deallocate the `ResourceClaim`.

This requires that the controller/user has permissions to update the status
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the other controllers need to do it, then they each need to be granted update permissions to the status. Whereas for the resource claim controller to do it, it needs get/list/watch for that resource type. I think that the latter option has way less opportunity for orphaned resource claims. I would suggest we lead with that as the primary mechanism.

If the resource claim controller sees something in there and doesn't have the right permissions, it can complain in events and logs.

Realistically, this probably means giving the resourceclaim controller the ability to watch deployments, jobs, and statefulsets. And maybe a few others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've updated the design here a bit.

What I think might be a possible challenge with making the resource claim controller responsible for removing the reference in the ReservedFor list is that we probably can only do this safely when the resource has been deleted (and even then I guess there can be orphaned pods). If a Deployment is scaled down to zero, there will not be any pods using the ResourceClaim, so it could potentially be deallocated. But I don't think the resource claim controller can deallocate just based on the number of replicas, since there will be a race between the deallocation and the possible creation of new pods. If the workload controller is responsible for handling this, I think it should be able to handle this safely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on other feedback and thinking more about this, I have changed this to actually put the responsibility on the controllers. Two primary reasons for this:

  • It lets us implement a single solution that is available to both in-tree and out-of-tree controllers.
  • Controllers have more context about the workloads, so they are in a better position than the resourceclaim controller to decide when deallocation is safe.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already running will continue to run and the allocated
devices will remain so.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when they finish, though? Won't the dangling reference to, say, a Job in the ReservedFor do something weird in the deallocation process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking that removing them was handled by the workload controller and therefore wouldn't be affected if the feature was disabled. But with the change that we want the resource claim controller responsible for removing the reference that is obviously not true. And even if we made it the responsibility of the workload controller, that functionality would probably be covered by the same feature gate.

I have updated this section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

@mortent
Copy link
Member Author

mortent commented Jun 9, 2025

/assign @pohly

@mortent
Copy link
Member Author

mortent commented Jun 9, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 9, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 10, 2025
@wojtek-t wojtek-t self-assigned this Jun 10, 2025

Rather than expecting the `ReservedFor` field to contain an exhaustive list of
all pods using the ResourceClaim, we propose letting the controller managing
a ResourceClaim specify the reference to the resource consuming the claim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it enough to specify the ResourceClaim itself to which all pods that are supposed to use the allocated device are referring to?

What is a use case for using a different object, since it potentially may contain other set of pods than the ones that refer to this ResourceClaim?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the main reason for having the reference here is that we need a way to signal to the resourceclaim controller that there are no more pods consuming the claim and it can therefore be deallocated. We have discussed two ways this can happen:

  • The resourceclaim controller checks whether the referenced resource exists, and if not, concludes where are no pods consuming the claim and it deallocates it.
  • The workload controller decides when there are no more pods consuming the resourceclaim and therefore decides it can be deallocated.

In both cases, I think it is useful that we capture information about the "owning workload" of the ResourceClaim and I think the ReservedFor field seems like a reasonable place to do it.

As for having a workload managing pods where not all of them are consuming the ResourceClaim, for that scenario I think it will be up to the workload controller to decide when it is safe to remove the reference in the ReservedFor list and therefore deallocate the claim.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

The `ReservedFor` list already accepts generic resource references, so this
field doesn't need to be changed. However, we are proposing adding two new
fields to the `ResourceClaim` type:
* `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that the ResourceClaim itself could be specified (and so all pods referring to it), why not treat it as a default way of reserving resources (after making the feature flag gated)?

Is there any reason why the new approach cannot replace the existing one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think there are two ways we use the ReservedFor list today:

  • To determine when the ResourceClaim can be deallocated.
  • To find all pods consuming (or referencing) the ResourcecClaim.

We can deallocate when there are no more pods consuming the ResourceClaim. Without the ReservedFor list of pods, the only entity that can determine this is the controller responsible for managing creation/deletion of the pods, i.e. the workload controller. And it "communicates" with the resourceclaim controller by removing the reference in the ReservedFor list.

Finding the pods referencing the ResourceClaim is not sufficient for deallocation, because we have a race between the check and new pods being created. But for the situations where we just need the pods referencing the claim, that is pretty easy to find. And the proposal here is that we will do that in the device_taint_eviction controller.

the spec which resource is the consumer of the `ResourceClaim`. When the first pod
referencing the `ResourceClaim` is scheduled, the reference will be copied into
the `status.ReservedFor` list.
* `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name ReservedForAnyPod may be a bit misleading and inaccurate. The claim would be reserved for for any pod that is referencing this claim (or other object specified in spec.ReservedFor).

Actually the name ReservedFor itself is misleading, as we should rather say that the ResourceClaim is AllocatedFor all pods that are referencing it. We probably should also say that the ResourceClaim should be deallocated when non of the pods referencing it is scheduled (assumed in the scheduler cache).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've tried to make the difference between a pod referencing a claim and a pod consuming a claim. The pod references in the list gives us the latter, since it is set by the scheduler. But without the explicit list of pods, we have to find another way to handle deallocation. As mentioned in another comment, I don't think there is a safe way to do deallocation by looking listing pods referencing the claim.

Agree that the naming isn't necessarily perfect, but I think the new names mostly makes sense assuming we keep the existing names. But definitely open to changing them.

deleted or finish running. An empty list means there are no current consumers of the claim
and it can be deallocated.

#### Finding pods using a ResourceClaim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReservedFor is also used to avoid races between different schedulers scheduling pods that share the same ResourceClaim (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md)

It may be hard to replace this use case since currently there is no good way of coordinating claim allocation between different schedulers. We work on a KEP #5287 which allows to set NominateNodeName after scheduler decides on a pod placement (pod is assumed, meaning it's scheduled but not bound yet), but the way it's designed now, this fields also conveys more information than just whether a pod is assumed, so unfortunately this bit of information cannot be used directly.

@wojtek-t @sanposhiho @macsko

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also work on an ultimate way of reserving resources that could address the problem of races between schedulers, but there are no details yet to share.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of how this could become a problem with multiple schedulers? I'm not familiar with that situation.

@mortent mortent requested a review from dom4ha June 10, 2025 22:27

Rather than expecting the `ReservedFor` field to contain an exhaustive list of
all pods using the ResourceClaim, we propose letting the controller managing
a ResourceClaim specify the reference to the resource consuming the claim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Morten

I think in theory we could consider ownerRef for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.

##### Deallocation
The resourceclaim controller will remove Pod references from the `ReservedFor` list just
like it does now using the same logic. For non-Pod references, the controller will recognize
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is somewhat misleading. I believe we should either make it work for an arbitrary owner or not at all.
Ignoring out-of-tree resources makes us not out-of-tree friendly and additionally may lead to situations where people will be assuming that it works (based on experience with in-tree types) and end up leaking resources...

I would personally suggest requiring a controller to unset it for Alpha and add a beta graduation to revisit that decision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree. I also think workload controllers are in a better position to make decisions on when allocation is safe, so several reasons for moving this responsibility to the workload controllers.

in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but
it will be subject to the 256 limit on the size of the `ReservedFor` list and the
controller creating the `ResourceClaim` will not find the reference it expects
in the `ReservedFor` list when it tries to remove it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify - if there is already a reference to pod in ReservedFor, then for newly scheduled pods scheduler will continue to adding those, despite spec.ReservedFor set to something else right?

[If so, it would be useful to add that explicitly here too]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looking at this again I don't think there are any good ways to completely avoid situations where there might be both pod and non-pod references in the ReservedFor list. So I've called out here that this is something the controllers need to be able to handle.

###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes. Applications that were already running will continue to run and the allocated
devices will remain so.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor, which is not set to pod?

`ReservedFor` list, despite there being a non-pod reference here. So it ends up with
both pod and non-pod references in the list. We need to make sure the system can
handle this, as it might also happen as a result of disablement and the enablement
of the feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we even do something about that?
Does kubelet do admission again after restart? @SergeyKanzhelev for your input

We're ending up in a situation where some pods may be listed and some aren't listed...

I have a feeling that we need to make an explicit recommendation that before downgrade you need to ensure that ReservedFor contains the list of pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a way to avoid this situation. Even if the kubelet doesn't do admission again after restart, the enabled->disabled->enabled flow would still lead to a situation where we would have both pod and non-pod references.

I think having both types of references can be managed. the workload controller shouldn't remove the reference until all pods consuming the claim has been removed, so once that happens there shouldn't be any pod references left (or if there is a race, they should be removed soon after) so the claim can be allocated. But it makes for weird semantics that the list can contain an incomplete list of pods.

So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.

So I think that we should basically do both:

  • explicitly recommend that before downgrading (or disabling the feature) you should ensure that the ReserveFor for non-pod references is replaced with the actual pod list [and that we will not provide any automation for it]
  • try to build some safety mechanism to not blow-up the system if that didn't happen (like controllers will not remove its reference until there is a pod in that list)

Does that make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can handle pod and non-pod references in the list. I've updated the section to be more specific here.

@wojtek-t
Copy link
Member

The missing sections will have to be filled in for Beta, but for Alpha this is good enough from PRR perspective.

/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mortent, wojtek-t
Once this PR has been reviewed and has the lgtm label, please assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1. If the kubelet sees that the `status.allocation.ReservedForAnyPod` is set, it will skip
the check that the Pod is listed in the `ReservedFor` list and just run the pod.

1. If the DRA scheduler plugin is trying to find candidates for deallocation in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler may need to deallocate a ResourceClaim in PostFilter, otherwise it won't be able to find alternative allocation for it, so it cannot leave it up to an external component.

In workload scheduling ResourceClaims may be allocated and deallocated several time during scheduling, so it has to be fast in-memory process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

6 participants