-
Notifications
You must be signed in to change notification settings - Fork 1.5k
kep-5278: nominated node name for an expected pod placement #5287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sanposhiho
commented
May 7, 2025
- One-line PR description: add kep-5278: nominated node name for an expected pod placement
- Issue link: Use NominatedNodeName to express the expected pod placement #5278
- Other comments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
/cc @macsko |
/hold Not to merge it without sig-scheduling approval |
/sig autoscaling |
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
|
||
So, they know where those pods are likely going to be scheduled. | ||
|
||
By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more thing that I would like to be able to handle, which is related to CA (and Karpenter).
As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit). But the main point is that the nodes don't yet exist.
I chatted with @x13n and even though it's not the case now, it's possible to change CA so that it knows the name of the node it will create. In which case they would be able to set the NominatedNodeName appropriately.
But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.
I can think of few different options:
- we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is
- we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]
- some hybrid where CA additionally sets some annotation on the pod like "nnn-is-coming: true" :)
But we also need a mechanism, that when a new node comes, we first see if there are any pods with NNN set to that node and try to schedule them first (probably taking into account if there aren't higher-priority ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.
Good point.
we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is
I like this over others.
One additional point is what if the pod triggers the preemption. If the pod can go somewhere after the preemption, we should prioritize that, rather than waiting for a new node to be registered.
So, the condition (for not clearing NNN) should be:
- node with NNN doesn't exist
- the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)
we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]
Yeah, risky. If external components (or even the scheduler after the preemption) put the existing node name, but the node is deleted right after that, the pod could get stuck.
Also, if many existing running pods are deleted after CA triggers the node creation, we might be able to schedule pending pods over there without waiting for new nodes to come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the condition (for not clearing NNN) should be:
- node with NNN doesn't exist
- the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)
Fair - if scheduler in the meantime found a different place for that pod, we should take that into account.
The main question that I have based on that is:
- do we really want to clear NNN first and later set it to the new value
- or do we actually want to allow for "changing NNN in place"
The later would allow us to avoid additional API call so I think we should consider it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mentioned to have a validation webhook to prevent "changing NNN" somewhere on the KEP though, probably we cannot. Performance perspective of course, and also an external component might set NNN between the moments between the scheduler clears it and sets a new value to it. (what should the scheduler do? clear it again, or give up setting a new value? complexity)
So, it's better to allow changing NNN from non-empty, with just one API call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the implementation were changed so that the CA knows the name of the node it plans to provision in advance, isn’t it still possible that the node might fail to be provisioned properly? For example, if compute resources or GPUs in a specific cloud zone are exhausted, the underlying instance might not be provisioned at all. In that case, wouldn't there be a need to change the NNN in place? Or does the CA (or Karpenter) only assign the NNN when it's certain that the node will eventually join the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit)
Currently we cannot prevent kube-scheduler from placing pods differently, as the semantic of nomination is best effort in this regard. In particular, whenever scheduler changes CA decision, there is a risk that a group of pods would no longer be schedulable. I think we should clearly call it a non-goal and this type of scheduling would be hopefully covered by reservations.
we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is
It sounds doable. There is a risk that NNN would point to no longer existing node, but scheduler would either find a new place for it or it would remain unschedulable and CA should find a new home for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to both of your comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love that idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but it's much bigger change (not only touching scheduling and CA, but also other components, e.g. GC).
I would like this to happen too, but I don't think it necessarily should stop a flavor of this KEP to land first.
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
But, if an external component updates `NominatedNodeName` that is set by the scheduler, | ||
the pod could end up having different `NominatedNodeName` and `NodeName`. | ||
|
||
Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to it
But technically, nothing prevents you from setting NominatedNodeName after the NodeName is set, so there can be divergence anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't apiserver reject setting NominatedNodeName on bound pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't apiserver reject setting NominatedNodeName on bound pods?
We can do that too. But, the scenario here is:
- The binding cycle sets NNN.
- Someone sets NNN. (while the binding cycle is still running)
- The binding cycle calls the binding api.
- NNN and NodeName would be different.
So, we need to clear NNN at the binding API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need both (clearning NNN at binding and the preventing from settin NNN on bound pods)
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
--> | ||
### The scheduler puts `NominatedNodeName` | ||
|
||
After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just went through: kubernetes/kubernetes#125491 (comment)
Is there anything we need to do to accomodate with @dom4ha described there about double-accounting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new flow would be like this:
- The scheduler decides where the pod goes to at the scheduling cycle.
- At the end of the scheduling cycle, it "assumes" the pod on the node (= reserves the pod's place on the scheduler's cache)
- At the beginning of the binding cycle, it adds NNN to the Pod (added by this KEP)
- The event handler receives the pod's update, which is getting NNN though, given the pod is assumed, the update is just ignored.
So, I don't see any problem here. Adding NNN after "assume" shouldn't cause anything.
Though maybe I don't 100% understand what @dom4ha was concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point was that nominated node and assumed node are two different things for scheduler. We are now going to use one of these concepts to both cases, which was my concern.
Because of this concern we started discussions about reservations to better understand the problem space before we start changing nomination semantic. So on one hand, we'd like to move things forward and fix a few problems, but on the other hand, we may still want to continue investigating how other similar cases relate to the nomination.
I wonder whether we're in the position we understand them and we should jump into implementation, or use this proposal to explore various options. I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.
What do you think about listing different use cases first? The obvious ones that I see are:
- Scheduler decides to preempt some pods and wants to inform other components that there is a pod which is about to take the freed up space (the final assignment decision can be change at any time)
- Scheduler decided to bind a pod, but performs pre-binding steps first, so wants to inform other components that the binding process started (the final assignment is already decided, only consecutive preemption may change it)
- CA case
- Scheduler itself tries to find a place for group of pods in gang-scheduling case
- Kueue case (external scheduler does gang-scheduling)
etc.
Looking into above list, CA case sounds the most similar to the preemption case (a nomination is a suggestion that can change without any harm). Delayed binding is a bit different one though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your 5 use cases are also what I could come up with now, but all of them don't seem problematic to me to just reuse NNN. (do you?)
I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.
Unless we come up with problems or further use cases that NNN cannot accommodate, it's natural to pick up the simplest solution, just reusing NNN. Rather, the ease/simplicity is actually a big factor.
We can expand NNN's concept and still say NNN is the node name proposed for the next step, in general. If external components set it or preemption sets it (i.e., before scheduling cycles), NNN means the proposal to the scheduling cycle. If the scheduling cycle sets it, NNN means the proposal to the binding cycle.
I don't think we need to try coming up with all possible future use cases right now (impossible), especially because NNN is an existing field. We can design completely a new thing when such use case or problem actually comes up as time goes, and switch from NNN (if needed). That's my thought process.
I updated the KEP once, based on the discussion that we've had so far. |
/retest |
This LGTM - @dom4ha - I think this is waiting for your review before we ping PRR approver. |
|
||
#### NominatedNodeName can already be set by other components now. | ||
|
||
There aren't any guardrails preventing other components from setting NominatedNodeName now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are rules in RBAC preventing other components update Pod.status where the nominatedNodeName is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is that? Do we really have such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RBAC is per object (or subresource) - there is no way (and no plans) to allow per-field restrictions or conditions in RBAC.
So no - if you have permissions to update PodStatus, you can also set NNN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who has permission to update PodStatus today?
In such cases, the semantic is not well defined now and the outcome of it may not match user | ||
expectations. | ||
|
||
This KEP is a step towards clarifying this semantic instead of maintaining status-quo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not clarifying, but supposed to allow other components to act as if they were co-schedulers without a mechanism kube-scheduler could validate their decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kube-scheduler is treating is a hint to try a given node first. So yes - it does validation, it never blindly accepts that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we analyze cases how a validation may fail and how scheduler communicates that?
For instance what if there is overallocation? We have legitimate overallocation case in preemption and we also know there can be race setting NNN due to our async (delayed) api calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated, the scheduler doesn't clear NNN on its own. Even if NNN is overallocation, the scheduling cycles just ignores NNN, but doesn't clear NNN on its own.
Because as you said, sometimes overallocation might be an expected scenario (while waiting for the external component or the preemption process to complete the process). If NNN's overallocation is really not expected overallocation, then the component who set NNN should clear NNN.
- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or | ||
the beginning of the binding cycle. | ||
- No external components can overwrite `NominatedNodeName` set by a different component. | ||
- If `NominatedNodeName` is set, the component who set it is responsible for updating or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How a component can know that it set the NNN (owns it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should've clarified that. They can check ManagedFields
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically it's doable, but I think that metadata are different level information that is designed to assure consistency of api-server updates. It does not seem right to use it to denote ownership (who updated a field first wins the ownership, unless it's kube-scheduler).
Shall such components watch if kube-scheduler haven't overwritten their hint and give up updating it anymore unless kube-scheduler clears it (which it rarely, but sometimes does)? Can component set NNN again? I guess autoscaler could set it again, but what if Kueue want to perform its follow up scheduling (because previous one was incorrect)? It cannot overwrite becuause it does not own it anymore?
My thinking is that allowing may components to write to the same field is something that we don't typically do in Kubernetes. If we do, then there is a common pattern to sync to the current state and reconsider the decision. In our case we'd have more complicated rule as described above.
I know how we can reuse this field to denote decisions or intentions made by kube-scheduler and I'd prefer to keep the ownership of this field and possibly extend it when we know how the resource reservation should work. We now
effectively enable resource reservation and many schedulers, but with very imperfect story behind it that maybe address autoscaler case, but nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not seem right to use it to denote ownership (who updated a field first wins the ownership, unless it's kube-scheduler).
Hmm, I'm not sure why it doesn't look right to you. Is it mentioned somewhere as a bad practice, or do you have a specific reason? If it looks really weird to use it, I'm fine to create another supplemental field to declare "who set NNN", but I'm not very sure if using ManagedFields
is so bad that we have to explicitly create such a new field to declare the ownership.
Shall such components watch if kube-scheduler haven't overwritten their hint and give up updating it anymore unless kube-scheduler clears it (which it rarely, but sometimes does)?
If kube-scheduler set it, they shuoldn't update NNN because NNN value is not theirs anymore.
Can component set NNN again?
As long as NNN's value is set by themselves, they can decide whether to clear it or update it with a new node name. (if their NNN is not valid anymore)
what if Kueue want to perform its follow up scheduling (because previous one was incorrect)? It cannot overwrite becuause it does not own it anymore?
Ideally, Kueue should clear or update it when it observes PodScheduled: false
, which indicates their NNN didn't work. It can overwrite because it's still theirs, unless NNN is overwritten by the scheduler.
But, for the long term, especially for a required TAS, we should allow it to perform a follow-up scheduling in a better way because they can't modify the required node selector after un-gating the pods. (NodeSelector is immutable when it's un-gated)
This NNN + required node selector would have a general usecase for such an external scheduler. What we need for such external schedulers is to somehow allow them to mutate the required node selector even after un-gating the pods, or allow them to gate the pods again to modify the selector.
After those kind of improvement is done, Kueue with a required TAS (or some other external schedulers) can perform re-computing and update NNN with the required node selector, when it observes PodScheduled: false
.
A bit went off track, but that's the whole context that I have for Kueue/external scheduler stuff. (Full discussion can be found at #5329 (comment))
My thinking is that allowing may components to write to the same field is something that we don't typically do in Kubernetes.
Right, that's the reason we have a clear rule that NNN has to be owned by the component who set it, and only the owner component or the scheduler can update/clear it.
We now effectively enable resource reservation and many schedulers, but with very imperfect story behind it that maybe address autoscaler case, but nothing more.
I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all? Please explicitly point out what is an imperfect story you're feeling, or I'd say, a problem on the current proposal. Otherwise, we cannot start a discussion of the cause of imperfections and how we can address/improve them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one place where a field is not owned by a single component is unschedulable
field in node spec. IIUC we don't generally know who cordoned a node (kubectl or some automation) but whoever did it has to assume it is safe to un-cordon, even though there may be some other controller that would like to keep the node cordoned. Are there other fields like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one place where a field is not owned by a single component is unschedulable field in node spec.
We had offline chat with Tim and he haven't considered this as an issue either, so we're good here.
Hmm, I'm not sure why it doesn't look right to you. Is it mentioned somewhere as a bad practice, or do you have a specific reason?
I'm not sure how the practice looks like, so it was just my gut feeling. We can defer the decision to implementation. It sounds clear to me that components needs to track whether they own the NNN value, so we probably should add it to the KEP
This NNN + required node selector would have a general usecase for such an external scheduler. What we need for such external schedulers is to somehow allow them to mutate the required node selector even after un-gating the pods, or allow them to gate the pods again to modify the selector.
I also think that it's a good pattern that Kueue and other external schedulers could follow. NNN is best effort, so whenever node selector changes, scheduler will follow it no matter if NNN was set or not.
I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all?
Sorry for not being precise. In the Kueue case setting both NNN and node selector means that scheduler cannot change NNN, because of node selector. So in case the pod cannot be scheduled for some reason (when Kueue decision conflicts with kube-scheudler), the resources remain reserved anyway. In this situations Kueue should clear NNN and I think it's already described in the KEP.
@mimowo what Kueue currently does in situation when a workload scheduled by Kueue is in fact unschedulable in kube-scheduler? Does Kueue change node selectors or removes pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo what Kueue currently does in situation when a workload scheduled by Kueue is in fact unschedulable in kube-scheduler? Does Kueue change node selectors or removes pods?
Currently we let the Pod stay Pending. To mitigate this user can configure "waitForPodsReady" which after a timeout (5min) will evict the entire Job (with all Pods), and re-try queuing and scheduling from scratch.
Does Kueue change node selectors or removes pods?
Workload eviction (in case of hitting waitForPodsReady) means suspending the Job (in case of k8s Job this is spec.suspend=true
). Then, k8s Job controller deletes the Pods. The Job re-creates the Pods when the workload scheduling is re-tried by Kueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we let the Pod stay Pending. To mitigate this user can configure "waitForPodsReady" which after a timeout (5min) will evict the entire Job (with all Pods), and re-try queuing and scheduling from scratch.
It sounds reasonable to keep the resources "reserved" for that time then. Obviously some higher priority pods could still jump in that place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo Please note that we brought back the Kueue use case, so setting NNN by Kueue can speed up scheduling of some bigger workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mimowo Please note that we brought back the Kueue use case, so setting NNN by Kueue can speed up scheduling of some bigger workloads.
Awesome!
Moreover: | ||
- Regardless of who set `NominatedNodeName`, its readers should always take that into | ||
consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes). | ||
- In case of faulty components (e.g. overallocation the nodes), these decisions will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most problematic case. Incorrect or faulty decisions would influence cluster behavior and for instance prevent cluster scale down. I don't think it's a good idea.
Clearing NNN by scheduler would open a possiblity for other components to race for setting NNN (for instance Kueue and autoscaling). I really think that we need separate conecpt of nominated resource hints (not only for nodes) which would be solely owned by external components. Scheduler would convert them to NNN as soon as it let a pod go through a scheduling cycle.
It's debatable where exactly it should happen, I think that in Reserve or even earlier (unschedulable pod can have NNN set if a nominated Pod does not exist yet or if it's not ready yet like in Node Readiness Gates proposal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to achieve that in an additive way - if we would introduce a new field of "state", then the component would be setting NNN=<proposal> state="hint"
and then scheduler could promote it to NNN=<proposal> state="nomination"
.
I think it's a core thing - yes, we could try to solve every single aspect at once, but it will take years to agree and execute - as long as we see a compatible path forward, that's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect or faulty decisions would influence cluster behavior and for instance prevent cluster scale down. I don't think it's a good idea.
My thought here is:
Even NNN put by the scheduler preemption could be wrong, but today it influences other components (i.e., CA), like I argued in another thread. I'm not saying that influence is wrong because CA should take on-going preemption into consideration so that it doesn't make a new node for the preempting pod, which would be a double effort. But, the same thing can be said to NNN set by other external components.
As long as the external components that puts NNN clears NNN (or updates NNN) as soon as they find the current NNN becomes invalid, it should be fine. That's their responsibility.
Clearing NNN by scheduler would open a possiblity for other components to race for setting NNN
As stated, it's not a good practice to have multiple components that might set NNN. There should be only one. Of course the scheduler would be another component that sets NNN, but as mentioned, the scheduler's NNN takes precedence.
Even if we separate the field, the same thing happens under the same condition: many components try to put the hint field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is still open. Do we want to allow the system to consume potentially incorrect decisions of other components which kube-scheduler is not able to correct?
If you propose how to correct it, why it's not a part of this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow the system to consume potentially incorrect decisions of other components which kube-scheduler is not able to correct?
Right, the scheduler cannot determine "this NNN is invalid".
I believe other components have to take this decision into consideration because NNN is a general sign that "someone is working on making this pod scheduled there". ("someone" might be the cluster autoscaler, might be the scheduler preemption process, etc.)
The other components have to take it into consideration so that it doesn't perform any double effort for the same pod. For example, CA ignores pods with NNN set by the preemption today because it believes "NNN is set -> the preemption happened -> no need to create a new node". Similar idea.
If you propose how to correct it, why it's not a part of this KEP?
I don't propose kube-scheduler correct it because the component that set NNN is the only person who can tell "this NNN is invalid". That's the reason, instead, I propose that the component that set NNN is responsible for updating/clearing NNN when it becomes invalid.
For example, the cluster autoscaler sets NNN on non-existing node. From the scheduler pov, it cannot tell whether this NNN is valid (= a new node is coming, on its way) or this NNN is not valid anymore (= CA creates the node, but somehow the creation fails). What the scheduler can do is only to believe:
- As long as NNN is there, the cluster autoscaler is correctly scaling up a new instance, and so far so good.
- If NNN becomes invalid, the cluster autoscaler should immediately clear it (or update it with a new one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, the cluster autoscaler sets NNN on non-existing node. From the scheduler pov, it cannot tell whether this NNN is valid (= a new node is coming, on its way) or this NNN is not valid anymore (= CA creates the node, but somehow the creation fails).
This is also a good case. What will happen when the node finally appears, but has initial taints that prevent scheduling? Shall scheduler find new NNN for those pods then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated, the scheduler doesn't clear NNN on its own, which allows us to address your scenario. So, the scheduler keeps trying NNN node (new tainted node) first until the taints are removed, unless the pod triggers the preemption and NNN is updated by the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a good case. What will happen when the node finally appears, but has initial taints that prevent scheduling? Shall scheduler find new NNN for those pods then?
Adding to what @sanposhiho already mentioned above. NNN doesn't prevent scheduler from putting this pod elsewhere. So imagine that we have the NNN set to something invalid (whether non-existing node, overloaded node, doesn't matter). When there is a different place to put that pod and scheduler tries that pod, it will start with checking NNN, but it will fallback to other nodes if NNN doesn't work. So if there is a different place for it - scheduler will use it and place the pod there, despite badly chosen NNN.
This is what I also meant when we were discussing it internally by "as long as scheduler can say veto to decision set via NNN this seems fine" - and scheduler by definition can do that because NNN is (and will continue to be) only a hint of place that "we believe would work best for this pod".
|
||
#### Race condition | ||
|
||
If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that async API KEP introduces a new race here, since scheduler could set NNN in its cache, but not in api. In the meantime external component can set it to something different, so when we receive the external update, we need to distinguish whether the different NNN is a NNN change that external component did or it's the race condition described above and it can be ignored.
Separating nomination hints should help us track external changes and allow us to always ignore any incoming NNNs (as we currently do), as scheduler is the only one who can set it.
IMHO it should help us be more clear if we keep NNN in cache from NN (assume)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that async API KEP introduces a new race here, since scheduler could set NNN in its cache, but not in api. In the meantime external component can set it to something different, so when we receive the external update, we need to distinguish whether the different NNN is a NNN change that external component did or it's the race condition described above and it can be ignored.
Right, but this is a general problem that we have to solve in the async KEP, as I mentioned at #5249 (comment). We can have the same problem with any fields of any kind of resources.
Separating nomination hints should help us track external changes and allow us to always ignore any incoming NNNs (as we currently do), as scheduler is the only one who can set it.
Right, but this reason does not justify separation enough, because, again, it can happen anywhere, and we cannot have duplicated fields every time we have the same problem with some fields in some objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but this reason does not justify separation enough, because, again, it can happen anywhere, and we cannot have duplicated fields every time we have the same problem with some fields in some objects.
No, scheduler modifies only the fields which it owns and it's probably true for most other fields. It would be nice to hear what are the exceptions.
Yes, there are areas that are affected indirectly (like node resource allocation), but we're owner of this area and kube-scheduler was designed to solve that problem. We should identify all such new areas while designing async api call to not miss anything and not introduce new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned #5249 (comment), I agree NNN would be the only case within the default scheduler, but at the same time, we cannot tell/control what the custom plugins does or how the custom resource is designed. So, that all depends on the scope of #5249.
Also, again, either way, this reason alone is not a big enough reason to make field separate.
We can just prioritize the NNN patch made by the scheduler even if the external components just put NNN at a similar timing. This is aligned with what we stated: the scheduler's NNN takes precedence over others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all? Please explicitly point out what is an imperfect story you're feeling, or I'd say, a problem on the current proposal. Otherwise, we cannot start a discussion of the cause of imperfections and how we can address/improve them.
Sorry for being unclear here. What I meant is that resource reservation is quite a broad topic and I think we all agree that we're not discussing all possibilities and don't attempt to do that yet. This is why I'm calling it imperfect in this sense and not the ultimate reservation solution, which most likely will look differently (@wojtek-t is working on some more specific proposal).
So it's a bit unfortunate that nomination hints in fact also cause implicit reservation, but since we agreed that it's good enough, I think we can consider bringing Kueue case back. Even though they express scheduling constraints by setting node labels, giving scheduler nomination hint will allow us to bypass Filtering and Scoring phase completely, which may really matter for large workloads. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dom4ha - the user story for Kueue is already mentioned in the KEP and I don't think it requires any changes in the KEP; I'm tentatively ok with that but I think we can further discuss that during implementation
In other words, from state machine perspective, there is visible difference in who set the | ||
`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption. | ||
If it was an external component, it's just a hint that may even be ignored. | ||
However, if we look from consumption point of view - these are effectively the same. We want |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say so. Nomination hints may be not intentionally wrong and the system is not able to distinguish whether the nomination is really accepted by the system and should be the "plan of the record" or is incorrect, but we don't have a mechanism to clear or correct it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scheduler doesn't have a better option in mind - it effectively is a plan of the record for the system. And scheduler can change it (but it can change it now too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in several places, can you show some example scenarios that cause trouble if we cannot distinguish them? I just haven't heard anything, and hence we keep this design like this from the beginning. I do know we cannot distinguish, but at the same time I don't know what happens if we cannot. Without that, we cannot discuss this matter on the same page.
Here's a detailed my assumption (basically all the discussion happened at #5329 (comment)), if it helps:
In the first place, we must assume that the nomination set by the scheduler preemption could also be wrong.
So, anyone (including the scheduler preemption, cluster autoscaler, and any external components) can be wrong. Some of them might be because they don't do enough scheduling simulation. But, another reason here is because there's an inevitable time gap between when they compute the cluster state and put NNN, and when the scheduling cycle actually processes the pod. So, even if the scheduling simulation is perfect, NNN could be wrong, and that's the reason we're saying NNN is the best effort.
So, essentially the scheduler preemption vs external components: I don't see the difference. Both of them are: 1. compute things, 2. put NNN, 3. maybe change the cluster state (2 and 3 can be opposite or async depending on the components)
And, in order to wait for (3) to be completed (in case of CA, that's a node creation, in case of the preemption, that's a pod termination, etc), other components should take NNN into consideration to prevent the double effort to make the pod schedulable, regardless of who set NNN. After all, if (3) fails, or (3) completes, but NNN is not longer correct (e.g., CA creates a new node, but the node is filled by other high priority pods), then the component should clear NNN, or update with a new value.
This is the reason of this statement here in the KEP: "if we look from consumption point of view - these are effectively the same. We want to expose the information, that as of now a given node is considered as a potential placement"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to my last comment though, I'm not saying we shouldn't separate fields. It's just that I don't see the need of doing it right now, and we can do that later when actually we see the need of that.
I just don't want to introduce such a rich design without (yet) seeing actual problems or use cases that are clear for us right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to my last comment though, I'm not saying we shouldn't separate fields. It's just that I don't see the need of doing it right now, and we can do that later when actually we see the need of that.
Provided that we can do that in an additive way, which we believe we can. So i agree with that.
We are limited in time, so we probably won't be able to resolve all the concerns today. So let's go ahead with what we have and get others feedback. In the worst case we will cut the scope of the proposal. The original delayed binding use case is very important and less controversial, so we should be able to deliver that. |
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
To make it more clear, overall it's LGTM from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments from PRR review
keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md
Outdated
Show resolved
Hide resolved
Why should this KEP _not_ be implemented? | ||
--> | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions, that I'm not sure where to post:
- How
NominatedNodeName
plays withNodeName
, can we ensure this is explicitly pointed out in this document? - Have you considered adding a PodSpec
NominatedNodeName
to cover the use-case Wojtek mentioned to clearly separate nomination (status) from hint (spec)? I'd imagine this being at least explored in the alternatives section with explicit reason why it was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
NNN is treated as "current plan of record", but something that is not necessarily a final decision. NodeName is the actual binding decision.
-
Yes - it was explored and we decided that distinguishing that for now is not really needed and can be done in an additive way in the future. So let's not add the API until we are really sure it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
the PRR, with the caveat that the upgrade->downgrade->upgrade section will get updated as soon as the implementation lands and it's possible to test that
There aren't any restrictions preventing other components from setting NominatedNodeName as of now. | ||
However, we don't have any validation of how that currently works. | ||
To support the usecases mentioned above we will adjust the scheduler to do the following: | ||
- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, warning might be too expensive, but v(1) or v(2) log is good.
5. ensure that it gets cleared | ||
6. upgrade | ||
7. set NNN to non-existing node | ||
8. ensure it isn't cleared again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, this is required for beta graduation, but given this is starting with beta it's hard to do the test before the implementation. But please once you land the code change to update this fragment of the document with the steps performed and results, and ping me so I'm aware this has happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK - that was the original intention.
/label tide/merge-method-squash |
/hold do we need anything else before merging? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dom4ha, sanposhiho, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think we're fine. If needed, we can always open a follow-up. /lgtm |