-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,3 @@ | |||
kep-number: 5194 | |||
alpha: | |||
approver: "@johnbelamaric" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably best to pick someone else since I will be out all but two days between now and KEP freeze
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - I can take it
and therefore know that it is safe to deallocate the `ResourceClaim`. | ||
|
||
This requires that the controller/user has permissions to update the status | ||
subresource of the `ResourceClaim`. The resourceclaim controller will also try to detect if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if the other controllers need to do it, then they each need to be granted update permissions to the status. Whereas for the resource claim controller to do it, it needs get/list/watch for that resource type. I think that the latter option has way less opportunity for orphaned resource claims. I would suggest we lead with that as the primary mechanism.
If the resource claim controller sees something in there and doesn't have the right permissions, it can complain in events and logs.
Realistically, this probably means giving the resourceclaim controller the ability to watch deployments, jobs, and statefulsets. And maybe a few others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've updated the design here a bit.
What I think might be a possible challenge with making the resource claim controller responsible for removing the reference in the ReservedFor
list is that we probably can only do this safely when the resource has been deleted (and even then I guess there can be orphaned pods). If a Deployment is scaled down to zero, there will not be any pods using the ResourceClaim, so it could potentially be deallocated. But I don't think the resource claim controller can deallocate just based on the number of replicas, since there will be a race between the deallocation and the possible creation of new pods. If the workload controller is responsible for handling this, I think it should be able to handle this safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on other feedback and thinking more about this, I have changed this to actually put the responsibility on the controllers. Two primary reasons for this:
- It lets us implement a single solution that is available to both in-tree and out-of-tree controllers.
- Controllers have more context about the workloads, so they are in a better position than the resourceclaim controller to decide when deallocation is safe.
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes. Applications that were already running will continue to run and the allocated | ||
devices will remain so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when they finish, though? Won't the dangling reference to, say, a Job in the ReservedFor do something weird in the deallocation process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking that removing them was handled by the workload controller and therefore wouldn't be affected if the feature was disabled. But with the change that we want the resource claim controller responsible for removing the reference that is obviously not true. And even if we made it the responsibility of the workload controller, that functionality would probably be covered by the same feature gate.
I have updated this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor
, which is not set to pod?
/assign @pohly |
/wg device-management |
|
||
Rather than expecting the `ReservedFor` field to contain an exhaustive list of | ||
all pods using the ResourceClaim, we propose letting the controller managing | ||
a ResourceClaim specify the reference to the resource consuming the claim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it enough to specify the ResourceClaim itself to which all pods that are supposed to use the allocated device are referring to?
What is a use case for using a different object, since it potentially may contain other set of pods than the ones that refer to this ResourceClaim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the main reason for having the reference here is that we need a way to signal to the resourceclaim controller that there are no more pods consuming the claim and it can therefore be deallocated. We have discussed two ways this can happen:
- The resourceclaim controller checks whether the referenced resource exists, and if not, concludes where are no pods consuming the claim and it deallocates it.
- The workload controller decides when there are no more pods consuming the resourceclaim and therefore decides it can be deallocated.
In both cases, I think it is useful that we capture information about the "owning workload" of the ResourceClaim and I think the ReservedFor
field seems like a reasonable place to do it.
As for having a workload managing pods where not all of them are consuming the ResourceClaim, for that scenario I think it will be up to the workload controller to decide when it is safe to remove the reference in the ReservedFor
list and therefore deallocate the claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Morten
I think in theory we could consider ownerRef
for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.
The `ReservedFor` list already accepts generic resource references, so this | ||
field doesn't need to be changed. However, we are proposing adding two new | ||
fields to the `ResourceClaim` type: | ||
* `spec.ReservedFor` which allows the creator of a `ResourceClaim` to specify in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume that the ResourceClaim itself could be specified (and so all pods referring to it), why not treat it as a default way of reserving resources (after making the feature flag gated)?
Is there any reason why the new approach cannot replace the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think there are two ways we use the ReservedFor
list today:
- To determine when the ResourceClaim can be deallocated.
- To find all pods consuming (or referencing) the ResourcecClaim.
We can deallocate when there are no more pods consuming the ResourceClaim. Without the ReservedFor
list of pods, the only entity that can determine this is the controller responsible for managing creation/deletion of the pods, i.e. the workload controller. And it "communicates" with the resourceclaim controller by removing the reference in the ReservedFor
list.
Finding the pods referencing the ResourceClaim is not sufficient for deallocation, because we have a race between the check and new pods being created. But for the situations where we just need the pods referencing the claim, that is pretty easy to find. And the proposal here is that we will do that in the device_taint_eviction controller.
the spec which resource is the consumer of the `ResourceClaim`. When the first pod | ||
referencing the `ResourceClaim` is scheduled, the reference will be copied into | ||
the `status.ReservedFor` list. | ||
* `status.allocation.ReservedForAnyPod` which will be set to `true` by the DRA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name ReservedForAnyPod
may be a bit misleading and inaccurate. The claim would be reserved for for any pod that is referencing this claim (or other object specified in spec.ReservedFor).
Actually the name ReservedFor
itself is misleading, as we should rather say that the ResourceClaim is AllocatedFor all pods that are referencing it. We probably should also say that the ResourceClaim should be deallocated when non of the pods referencing it is scheduled (assumed in the scheduler cache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've tried to make the difference between a pod referencing a claim and a pod consuming a claim. The pod references in the list gives us the latter, since it is set by the scheduler. But without the explicit list of pods, we have to find another way to handle deallocation. As mentioned in another comment, I don't think there is a safe way to do deallocation by looking listing pods referencing the claim.
Agree that the naming isn't necessarily perfect, but I think the new names mostly makes sense assuming we keep the existing names. But definitely open to changing them.
deleted or finish running. An empty list means there are no current consumers of the claim | ||
and it can be deallocated. | ||
|
||
#### Finding pods using a ResourceClaim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReservedFor is also used to avoid races between different schedulers scheduling pods that share the same ResourceClaim (see https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md)
It may be hard to replace this use case since currently there is no good way of coordinating claim allocation between different schedulers. We work on a KEP #5287 which allows to set NominateNodeName after scheduler decides on a pod placement (pod is assumed, meaning it's scheduled but not bound yet), but the way it's designed now, this fields also conveys more information than just whether a pod is assumed, so unfortunately this bit of information cannot be used directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also work on an ultimate way of reserving resources that could address the problem of races between schedulers, but there are no details yet to share.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example of how this could become a problem with multiple schedulers? I'm not familiar with that situation.
|
||
Rather than expecting the `ReservedFor` field to contain an exhaustive list of | ||
all pods using the ResourceClaim, we propose letting the controller managing | ||
a ResourceClaim specify the reference to the resource consuming the claim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Morten
I think in theory we could consider ownerRef
for that, but given we may want to create this ResourceClaim before workload, there would be a chance of races here. So I think that having this explicit pointer is reasonable.
##### Deallocation | ||
The resourceclaim controller will remove Pod references from the `ReservedFor` list just | ||
like it does now using the same logic. For non-Pod references, the controller will recognize | ||
a small number of built-in types, starting with `Deployment`, `StatefulSet` and `Job`, and will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is somewhat misleading. I believe we should either make it work for an arbitrary owner or not at all.
Ignoring out-of-tree resources makes us not out-of-tree friendly and additionally may lead to situations where people will be assuming that it works (based on experience with in-tree types) and end up leaking resources...
I would personally suggest requiring a controller to unset it for Alpha and add a beta graduation to revisit that decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree. I also think workload controllers are in a better position to make decisions on when allocation is safe, so several reasons for moving this responsibility to the workload controllers.
in the `spec.ReservedFor` list. As a result, the workload will get scheduled, but | ||
it will be subject to the 256 limit on the size of the `ReservedFor` list and the | ||
controller creating the `ResourceClaim` will not find the reference it expects | ||
in the `ReservedFor` list when it tries to remove it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify - if there is already a reference to pod in ReservedFor
, then for newly scheduled pods scheduler will continue to adding those, despite spec.ReservedFor
set to something else right?
[If so, it would be useful to add that explicitly here too]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, looking at this again I don't think there are any good ways to completely avoid situations where there might be both pod and non-pod references in the ReservedFor
list. So I've called out here that this is something the controllers need to be able to handle.
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? | ||
|
||
Yes. Applications that were already running will continue to run and the allocated | ||
devices will remain so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if kubelet restarts in the meantime? Won't it try to re-admit the pod (and it will fail because now it would try to validate the ReservedFor
, which is not set to pod?
`ReservedFor` list, despite there being a non-pod reference here. So it ends up with | ||
both pod and non-pod references in the list. We need to make sure the system can | ||
handle this, as it might also happen as a result of disablement and the enablement | ||
of the feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we even do something about that?
Does kubelet do admission again after restart? @SergeyKanzhelev for your input
We're ending up in a situation where some pods may be listed and some aren't listed...
I have a feeling that we need to make an explicit recommendation that before downgrade you need to ensure that ReservedFor contains the list of pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a way to avoid this situation. Even if the kubelet doesn't do admission again after restart, the enabled->disabled->enabled
flow would still lead to a situation where we would have both pod and non-pod references.
I think having both types of references can be managed. the workload controller shouldn't remove the reference until all pods consuming the claim has been removed, so once that happens there shouldn't be any pod references left (or if there is a race, they should be removed soon after) so the claim can be allocated. But it makes for weird semantics that the list can contain an incomplete list of pods.
So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if providing recommendations about how to do safe downgrades or feature disablement is an option, that might be better in the long run. Once the feature reaches GA, so it can't be removed through disablement or rollback, this situation would no longer be possible.
So I think that we should basically do both:
- explicitly recommend that before downgrading (or disabling the feature) you should ensure that the ReserveFor for non-pod references is replaced with the actual pod list [and that we will not provide any automation for it]
- try to build some safety mechanism to not blow-up the system if that didn't happen (like controllers will not remove its reference until there is a pod in that list)
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can handle pod and non-pod references in the list. I've updated the section to be more specific here.
The missing sections will have to be filled in for Beta, but for Alpha this is good enough from PRR perspective. /approve PRR |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mortent, wojtek-t The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1. If the kubelet sees that the `status.allocation.ReservedForAnyPod` is set, it will skip | ||
the check that the Pod is listed in the `ReservedFor` list and just run the pod. | ||
|
||
1. If the DRA scheduler plugin is trying to find candidates for deallocation in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scheduler may need to deallocate a ResourceClaim in PostFilter, otherwise it won't be able to find alternative allocation for it, so it cannot leave it up to an external component.
In workload scheduling ResourceClaims may be allocated and deallocated several time during scheduling, so it has to be fast in-memory process.
One-line PR description: DRA: ReservedFor Workloads as alpha in 1.34
Issue link: DRA: ReservedFor Workloads #5194
Other comments: first revision