kep-5278: nominated node name for an expected pod placement #5287

sanposhiho · 2025-05-07T10:52:36Z

One-line PR description: add kep-5278: nominated node name for an expected pod placement

Issue link: Use NominatedNodeName to express the expected pod placement #5278

Other comments:

sanposhiho

@dom4ha @wojtek-t
Opened a draft PR. I know there are other sections that we have to write.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

sanposhiho · 2025-05-07T10:55:59Z

/cc @macsko

sanposhiho · 2025-05-07T10:58:35Z

/hold

Not to merge it without sig-scheduling approval

sanposhiho · 2025-05-07T11:08:42Z

/sig autoscaling

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T09:33:57Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+So, they know where those pods are likely going to be scheduled.
+
+By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,


There is one more thing that I would like to be able to handle, which is related to CA (and Karpenter).

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit). But the main point is that the nodes don't yet exist.
I chatted with @x13n and even though it's not the case now, it's possible to change CA so that it knows the name of the node it will create. In which case they would be able to set the NominatedNodeName appropriately.

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

I can think of few different options:

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]

some hybrid where CA additionally sets some annotation on the pod like "nnn-is-coming: true" :)

But we also need a mechanism, that when a new node comes, we first see if there are any pods with NNN set to that node and try to schedule them first (probably taking into account if there aren't higher-priority ones).

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

Good point.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

I like this over others.
One additional point is what if the pod triggers the preemption. If the pod can go somewhere after the preemption, we should prioritize that, rather than waiting for a new node to be registered.
So, the condition (for not clearing NNN) should be:

node with NNN doesn't exist

the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]

Yeah, risky. If external components (or even the scheduler after the preemption) put the existing node name, but the node is deleted right after that, the pod could get stuck.
Also, if many existing running pods are deleted after CA triggers the node creation, we might be able to schedule pending pods over there without waiting for new nodes to come up.

So, the condition (for not clearing NNN) should be:

node with NNN doesn't exist

the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

Fair - if scheduler in the meantime found a different place for that pod, we should take that into account.

The main question that I have based on that is:

do we really want to clear NNN first and later set it to the new value

or do we actually want to allow for "changing NNN in place"
The later would allow us to avoid additional API call so I think we should consider it.

Yeah, I mentioned to have a validation webhook to prevent "changing NNN" somewhere on the KEP though, probably we cannot. Performance perspective of course, and also an external component might set NNN between the moments between the scheduler clears it and sets a new value to it. (what should the scheduler do? clear it again, or give up setting a new value? complexity)

So, it's better to allow changing NNN from non-empty, with just one API call.

Even if the implementation were changed so that the CA knows the name of the node it plans to provision in advance, isn’t it still possible that the node might fail to be provisioned properly? For example, if compute resources or GPUs in a specific cloud zone are exhausted, the underlying instance might not be provisioned at all. In that case, wouldn't there be a need to change the NNN in place? Or does the CA (or Karpenter) only assign the NNN when it's certain that the node will eventually join the cluster?

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit)

Currently we cannot prevent kube-scheduler from placing pods differently, as the semantic of nomination is best effort in this regard. In particular, whenever scheduler changes CA decision, there is a risk that a group of pods would no longer be schedulable. I think we should clearly call it a non-goal and this type of scheduling would be hopefully covered by reservations.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

It sounds doable. There is a risk that NNN would point to no longer existing node, but scheduler would either find a new place for it or it would remain unschedulable and CA should find a new home for it.

+1 to both of your comments

A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.

I love that idea.

Agree, but it's much bigger change (not only touching scheduling and CA, but also other components, e.g. GC).
I would like this to happen too, but I don't think it necessarily should stop a flavor of this KEP to land first.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T09:48:27Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+But, if an external component updates `NominatedNodeName` that is set by the scheduler, 
+the pod could end up having different `NominatedNodeName` and `NodeName`.
+
+Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api)


+1 to it

But technically, nothing prevents you from setting NominatedNodeName after the NodeName is set, so there can be divergence anyway.

Couldn't apiserver reject setting NominatedNodeName on bound pods?

Couldn't apiserver reject setting NominatedNodeName on bound pods?

We can do that too. But, the scenario here is:

The binding cycle sets NNN.

Someone sets NNN. (while the binding cycle is still running)

The binding cycle calls the binding api.

NNN and NodeName would be different.

So, we need to clear NNN at the binding API.

We need both (clearning NNN at binding and the preventing from settin NNN on bound pods)

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T12:21:31Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+-->
+### The scheduler puts `NominatedNodeName`
+
+After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to.


I just went through: kubernetes/kubernetes#125491 (comment)

Is there anything we need to do to accomodate with @dom4ha described there about double-accounting?

A new flow would be like this:

The scheduler decides where the pod goes to at the scheduling cycle.

At the end of the scheduling cycle, it "assumes" the pod on the node (= reserves the pod's place on the scheduler's cache)

At the beginning of the binding cycle, it adds NNN to the Pod (added by this KEP)

The event handler receives the pod's update, which is getting NNN though, given the pod is assumed, the update is just ignored.

https://github.com/kubernetes/kubernetes/blob/ff74d46bf074476d798653584657ef974306053f/pkg/scheduler/eventhandlers.go#L155-L161

So, I don't see any problem here. Adding NNN after "assume" shouldn't cause anything.
Though maybe I don't 100% understand what @dom4ha was concerned.

The point was that nominated node and assumed node are two different things for scheduler. We are now going to use one of these concepts to both cases, which was my concern.

Because of this concern we started discussions about reservations to better understand the problem space before we start changing nomination semantic. So on one hand, we'd like to move things forward and fix a few problems, but on the other hand, we may still want to continue investigating how other similar cases relate to the nomination.

I wonder whether we're in the position we understand them and we should jump into implementation, or use this proposal to explore various options. I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

What do you think about listing different use cases first? The obvious ones that I see are:

Scheduler decides to preempt some pods and wants to inform other components that there is a pod which is about to take the freed up space (the final assignment decision can be change at any time)

Scheduler decided to bind a pod, but performs pre-binding steps first, so wants to inform other components that the binding process started (the final assignment is already decided, only consecutive preemption may change it)

CA case

Scheduler itself tries to find a place for group of pods in gang-scheduling case

Kueue case (external scheduler does gang-scheduling)
etc.

Looking into above list, CA case sounds the most similar to the preemption case (a nomination is a suggestion that can change without any harm). Delayed binding is a bit different one though.

Your 5 use cases are also what I could come up with now, but all of them don't seem problematic to me to just reuse NNN. (do you?)

I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

Unless we come up with problems or further use cases that NNN cannot accommodate, it's natural to pick up the simplest solution, just reusing NNN. Rather, the ease/simplicity is actually a big factor.
We can expand NNN's concept and still say NNN is the node name proposed for the next step, in general. If external components set it or preemption sets it (i.e., before scheduling cycles), NNN means the proposal to the scheduling cycle. If the scheduling cycle sets it, NNN means the proposal to the binding cycle.

I don't think we need to try coming up with all possible future use cases right now (impossible), especially because NNN is an existing field. We can design completely a new thing when such use case or problem actually comes up as time goes, and switch from NNN (if needed). That's my thought process.

sanposhiho · 2025-05-08T15:00:04Z

I updated the KEP once, based on the discussion that we've had so far.

sanposhiho · 2025-05-29T10:52:59Z

/retest

wojtek-t · 2025-05-29T11:10:34Z

This LGTM - @dom4ha - I think this is waiting for your review before we ping PRR approver.

dom4ha · 2025-05-30T09:55:39Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+#### NominatedNodeName can already be set by other components now.
+
+There aren't any guardrails preventing other components from setting NominatedNodeName now.


I think there are rules in RBAC preventing other components update Pod.status where the nominatedNodeName is defined.

Where is that? Do we really have such?

RBAC is per object (or subresource) - there is no way (and no plans) to allow per-field restrictions or conditions in RBAC.

So no - if you have permissions to update PodStatus, you can also set NNN.

Who has permission to update PodStatus today?

dom4ha · 2025-05-30T09:58:09Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+In such cases, the semantic is not well defined now and the outcome of it may not match user
+expectations.
+
+This KEP is a step towards clarifying this semantic instead of maintaining status-quo.


I think it's not clarifying, but supposed to allow other components to act as if they were co-schedulers without a mechanism kube-scheduler could validate their decisions.

kube-scheduler is treating is a hint to try a given node first. So yes - it does validation, it never blindly accepts that.

Shouldn't we analyze cases how a validation may fail and how scheduler communicates that?

For instance what if there is overallocation? We have legitimate overallocation case in preemption and we also know there can be race setting NNN due to our async (delayed) api calls.

As stated, the scheduler doesn't clear NNN on its own. Even if NNN is overallocation, the scheduling cycles just ignores NNN, but doesn't clear NNN on its own.
Because as you said, sometimes overallocation might be an expected scenario (while waiting for the external component or the preemption process to complete the process). If NNN's overallocation is really not expected overallocation, then the component who set NNN should clear NNN.

dom4ha · 2025-05-30T10:03:49Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- Scheduler is allowed to overwrite `NominatedNodeName` at any time in case of preemption or
+the beginning of the binding cycle.
+- No external components can overwrite `NominatedNodeName` set by a different component.
+- If `NominatedNodeName` is set, the component who set it is responsible for updating or


How a component can know that it set the NNN (owns it)?

We should've clarified that. They can check ManagedFields.

Technically it's doable, but I think that metadata are different level information that is designed to assure consistency of api-server updates. It does not seem right to use it to denote ownership (who updated a field first wins the ownership, unless it's kube-scheduler).

Shall such components watch if kube-scheduler haven't overwritten their hint and give up updating it anymore unless kube-scheduler clears it (which it rarely, but sometimes does)? Can component set NNN again? I guess autoscaler could set it again, but what if Kueue want to perform its follow up scheduling (because previous one was incorrect)? It cannot overwrite becuause it does not own it anymore?

My thinking is that allowing may components to write to the same field is something that we don't typically do in Kubernetes. If we do, then there is a common pattern to sync to the current state and reconsider the decision. In our case we'd have more complicated rule as described above.

I know how we can reuse this field to denote decisions or intentions made by kube-scheduler and I'd prefer to keep the ownership of this field and possibly extend it when we know how the resource reservation should work. We now
effectively enable resource reservation and many schedulers, but with very imperfect story behind it that maybe address autoscaler case, but nothing more.

It does not seem right to use it to denote ownership (who updated a field first wins the ownership, unless it's kube-scheduler).

Hmm, I'm not sure why it doesn't look right to you. Is it mentioned somewhere as a bad practice, or do you have a specific reason? If it looks really weird to use it, I'm fine to create another supplemental field to declare "who set NNN", but I'm not very sure if using ManagedFields is so bad that we have to explicitly create such a new field to declare the ownership.

Shall such components watch if kube-scheduler haven't overwritten their hint and give up updating it anymore unless kube-scheduler clears it (which it rarely, but sometimes does)?

If kube-scheduler set it, they shuoldn't update NNN because NNN value is not theirs anymore.

Can component set NNN again?

As long as NNN's value is set by themselves, they can decide whether to clear it or update it with a new node name. (if their NNN is not valid anymore)

what if Kueue want to perform its follow up scheduling (because previous one was incorrect)? It cannot overwrite becuause it does not own it anymore?

Ideally, Kueue should clear or update it when it observes PodScheduled: false, which indicates their NNN didn't work. It can overwrite because it's still theirs, unless NNN is overwritten by the scheduler.

But, for the long term, especially for a required TAS, we should allow it to perform a follow-up scheduling in a better way because they can't modify the required node selector after un-gating the pods. (NodeSelector is immutable when it's un-gated)

This NNN + required node selector would have a general usecase for such an external scheduler. What we need for such external schedulers is to somehow allow them to mutate the required node selector even after un-gating the pods, or allow them to gate the pods again to modify the selector.
After those kind of improvement is done, Kueue with a required TAS (or some other external schedulers) can perform re-computing and update NNN with the required node selector, when it observes PodScheduled: false.

A bit went off track, but that's the whole context that I have for Kueue/external scheduler stuff. (Full discussion can be found at #5329 (comment))

My thinking is that allowing may components to write to the same field is something that we don't typically do in Kubernetes.

Right, that's the reason we have a clear rule that NNN has to be owned by the component who set it, and only the owner component or the scheduler can update/clear it.

We now effectively enable resource reservation and many schedulers, but with very imperfect story behind it that maybe address autoscaler case, but nothing more.

I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all? Please explicitly point out what is an imperfect story you're feeling, or I'd say, a problem on the current proposal. Otherwise, we cannot start a discussion of the cause of imperfections and how we can address/improve them.

I think one place where a field is not owned by a single component is unschedulable field in node spec. IIUC we don't generally know who cordoned a node (kubectl or some automation) but whoever did it has to assume it is safe to un-cordon, even though there may be some other controller that would like to keep the node cordoned. Are there other fields like this?

I think one place where a field is not owned by a single component is unschedulable field in node spec.

We had offline chat with Tim and he haven't considered this as an issue either, so we're good here.

Hmm, I'm not sure why it doesn't look right to you. Is it mentioned somewhere as a bad practice, or do you have a specific reason?

I'm not sure how the practice looks like, so it was just my gut feeling. We can defer the decision to implementation. It sounds clear to me that components needs to track whether they own the NNN value, so we probably should add it to the KEP

This NNN + required node selector would have a general usecase for such an external scheduler. What we need for such external schedulers is to somehow allow them to mutate the required node selector even after un-gating the pods, or allow them to gate the pods again to modify the selector.

I also think that it's a good pattern that Kueue and other external schedulers could follow. NNN is best effort, so whenever node selector changes, scheduler will follow it no matter if NNN was set or not.

I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all?

Sorry for not being precise. In the Kueue case setting both NNN and node selector means that scheduler cannot change NNN, because of node selector. So in case the pod cannot be scheduled for some reason (when Kueue decision conflicts with kube-scheudler), the resources remain reserved anyway. In this situations Kueue should clear NNN and I think it's already described in the KEP.

@mimowo what Kueue currently does in situation when a workload scheduled by Kueue is in fact unschedulable in kube-scheduler? Does Kueue change node selectors or removes pods?

@mimowo what Kueue currently does in situation when a workload scheduled by Kueue is in fact unschedulable in kube-scheduler? Does Kueue change node selectors or removes pods?

Currently we let the Pod stay Pending. To mitigate this user can configure "waitForPodsReady" which after a timeout (5min) will evict the entire Job (with all Pods), and re-try queuing and scheduling from scratch.

Does Kueue change node selectors or removes pods?

Workload eviction (in case of hitting waitForPodsReady) means suspending the Job (in case of k8s Job this is spec.suspend=true). Then, k8s Job controller deletes the Pods. The Job re-creates the Pods when the workload scheduling is re-tried by Kueue.

Currently we let the Pod stay Pending. To mitigate this user can configure "waitForPodsReady" which after a timeout (5min) will evict the entire Job (with all Pods), and re-try queuing and scheduling from scratch.

It sounds reasonable to keep the resources "reserved" for that time then. Obviously some higher priority pods could still jump in that place.

@mimowo Please note that we brought back the Kueue use case, so setting NNN by Kueue can speed up scheduling of some bigger workloads.

@mimowo Please note that we brought back the Kueue use case, so setting NNN by Kueue can speed up scheduling of some bigger workloads.

Awesome!

dom4ha · 2025-05-30T10:14:38Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+Moreover:
+- Regardless of who set `NominatedNodeName`, its readers should always take that into
+consideration (e.g. ClusterAutoscaler or Karpenter when trying to scale down nodes).
+- In case of faulty components (e.g. overallocation the nodes), these decisions will


This is the most problematic case. Incorrect or faulty decisions would influence cluster behavior and for instance prevent cluster scale down. I don't think it's a good idea.

Clearing NNN by scheduler would open a possiblity for other components to race for setting NNN (for instance Kueue and autoscaling). I really think that we need separate conecpt of nominated resource hints (not only for nodes) which would be solely owned by external components. Scheduler would convert them to NNN as soon as it let a pod go through a scheduling cycle.

It's debatable where exactly it should happen, I think that in Reserve or even earlier (unschedulable pod can have NNN set if a nominated Pod does not exist yet or if it's not ready yet like in Node Readiness Gates proposal).

It's possible to achieve that in an additive way - if we would introduce a new field of "state", then the component would be setting NNN=<proposal> state="hint" and then scheduler could promote it to NNN=<proposal> state="nomination".
I think it's a core thing - yes, we could try to solve every single aspect at once, but it will take years to agree and execute - as long as we see a compatible path forward, that's not needed.

Incorrect or faulty decisions would influence cluster behavior and for instance prevent cluster scale down. I don't think it's a good idea.

My thought here is:
Even NNN put by the scheduler preemption could be wrong, but today it influences other components (i.e., CA), like I argued in another thread. I'm not saying that influence is wrong because CA should take on-going preemption into consideration so that it doesn't make a new node for the preempting pod, which would be a double effort. But, the same thing can be said to NNN set by other external components.
As long as the external components that puts NNN clears NNN (or updates NNN) as soon as they find the current NNN becomes invalid, it should be fine. That's their responsibility.

Clearing NNN by scheduler would open a possiblity for other components to race for setting NNN

As stated, it's not a good practice to have multiple components that might set NNN. There should be only one. Of course the scheduler would be another component that sets NNN, but as mentioned, the scheduler's NNN takes precedence.
Even if we separate the field, the same thing happens under the same condition: many components try to put the hint field.

My question is still open. Do we want to allow the system to consume potentially incorrect decisions of other components which kube-scheduler is not able to correct?

If you propose how to correct it, why it's not a part of this KEP?

Do we want to allow the system to consume potentially incorrect decisions of other components which kube-scheduler is not able to correct?

Right, the scheduler cannot determine "this NNN is invalid".
I believe other components have to take this decision into consideration because NNN is a general sign that "someone is working on making this pod scheduled there". ("someone" might be the cluster autoscaler, might be the scheduler preemption process, etc.)
The other components have to take it into consideration so that it doesn't perform any double effort for the same pod. For example, CA ignores pods with NNN set by the preemption today because it believes "NNN is set -> the preemption happened -> no need to create a new node". Similar idea.

If you propose how to correct it, why it's not a part of this KEP?

I don't propose kube-scheduler correct it because the component that set NNN is the only person who can tell "this NNN is invalid". That's the reason, instead, I propose that the component that set NNN is responsible for updating/clearing NNN when it becomes invalid.

For example, the cluster autoscaler sets NNN on non-existing node. From the scheduler pov, it cannot tell whether this NNN is valid (= a new node is coming, on its way) or this NNN is not valid anymore (= CA creates the node, but somehow the creation fails). What the scheduler can do is only to believe:

As long as NNN is there, the cluster autoscaler is correctly scaling up a new instance, and so far so good.

If NNN becomes invalid, the cluster autoscaler should immediately clear it (or update it with a new one)

For example, the cluster autoscaler sets NNN on non-existing node. From the scheduler pov, it cannot tell whether this NNN is valid (= a new node is coming, on its way) or this NNN is not valid anymore (= CA creates the node, but somehow the creation fails).

This is also a good case. What will happen when the node finally appears, but has initial taints that prevent scheduling? Shall scheduler find new NNN for those pods then?

As stated, the scheduler doesn't clear NNN on its own, which allows us to address your scenario. So, the scheduler keeps trying NNN node (new tainted node) first until the taints are removed, unless the pod triggers the preemption and NNN is updated by the scheduler.

This is also a good case. What will happen when the node finally appears, but has initial taints that prevent scheduling? Shall scheduler find new NNN for those pods then?

Adding to what @sanposhiho already mentioned above. NNN doesn't prevent scheduler from putting this pod elsewhere. So imagine that we have the NNN set to something invalid (whether non-existing node, overloaded node, doesn't matter). When there is a different place to put that pod and scheduler tries that pod, it will start with checking NNN, but it will fallback to other nodes if NNN doesn't work. So if there is a different place for it - scheduler will use it and place the pod there, despite badly chosen NNN.

This is what I also meant when we were discussing it internally by "as long as scheduler can say veto to decision set via NNN this seems fine" - and scheduler by definition can do that because NNN is (and will continue to be) only a hint of place that "we believe would work best for this pod".

dom4ha · 2025-05-30T10:20:32Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+#### Race condition
+
+If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle,


Note that async API KEP introduces a new race here, since scheduler could set NNN in its cache, but not in api. In the meantime external component can set it to something different, so when we receive the external update, we need to distinguish whether the different NNN is a NNN change that external component did or it's the race condition described above and it can be ignored.

Separating nomination hints should help us track external changes and allow us to always ignore any incoming NNNs (as we currently do), as scheduler is the only one who can set it.
IMHO it should help us be more clear if we keep NNN in cache from NN (assume)

Note that async API KEP introduces a new race here, since scheduler could set NNN in its cache, but not in api. In the meantime external component can set it to something different, so when we receive the external update, we need to distinguish whether the different NNN is a NNN change that external component did or it's the race condition described above and it can be ignored.

Right, but this is a general problem that we have to solve in the async KEP, as I mentioned at #5249 (comment). We can have the same problem with any fields of any kind of resources.

Separating nomination hints should help us track external changes and allow us to always ignore any incoming NNNs (as we currently do), as scheduler is the only one who can set it.

Right, but this reason does not justify separation enough, because, again, it can happen anywhere, and we cannot have duplicated fields every time we have the same problem with some fields in some objects.

Right, but this reason does not justify separation enough, because, again, it can happen anywhere, and we cannot have duplicated fields every time we have the same problem with some fields in some objects.

No, scheduler modifies only the fields which it owns and it's probably true for most other fields. It would be nice to hear what are the exceptions.

Yes, there are areas that are affected indirectly (like node resource allocation), but we're owner of this area and kube-scheduler was designed to solve that problem. We should identify all such new areas while designing async api call to not miss anything and not introduce new ones.

As I mentioned #5249 (comment), I agree NNN would be the only case within the default scheduler, but at the same time, we cannot tell/control what the custom plugins does or how the custom resource is designed. So, that all depends on the scope of #5249.

Also, again, either way, this reason alone is not a big enough reason to make field separate.
We can just prioritize the NNN patch made by the scheduler even if the external components just put NNN at a similar timing. This is aligned with what we stated: the scheduler's NNN takes precedence over others.

I don't get what you want to say around this. You don't want to enable resource reservations made by other components at all? Please explicitly point out what is an imperfect story you're feeling, or I'd say, a problem on the current proposal. Otherwise, we cannot start a discussion of the cause of imperfections and how we can address/improve them.

Sorry for being unclear here. What I meant is that resource reservation is quite a broad topic and I think we all agree that we're not discussing all possibilities and don't attempt to do that yet. This is why I'm calling it imperfect in this sense and not the ultimate reservation solution, which most likely will look differently (@wojtek-t is working on some more specific proposal).

So it's a bit unfortunate that nomination hints in fact also cause implicit reservation, but since we agreed that it's good enough, I think we can consider bringing Kueue case back. Even though they express scheduling constraints by setting node labels, giving scheduler nomination hint will allow us to bypass Filtering and Scoring phase completely, which may really matter for large workloads. WDYT?

@dom4ha - the user story for Kueue is already mentioned in the KEP and I don't think it requires any changes in the KEP; I'm tentatively ok with that but I think we can further discuss that during implementation

dom4ha · 2025-05-30T10:22:12Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+In other words, from state machine perspective, there is visible difference in who set the
+`NominatedNodeName`. If it was scheduler, it may mean that there is already ongoing preemption.
+If it was an external component, it's just a hint that may even be ignored.
+However, if we look from consumption point of view - these are effectively the same. We want


I wouldn't say so. Nomination hints may be not intentionally wrong and the system is not able to distinguish whether the nomination is really accepted by the system and should be the "plan of the record" or is incorrect, but we don't have a mechanism to clear or correct it.

If scheduler doesn't have a better option in mind - it effectively is a plan of the record for the system. And scheduler can change it (but it can change it now too).

As I mentioned in several places, can you show some example scenarios that cause trouble if we cannot distinguish them? I just haven't heard anything, and hence we keep this design like this from the beginning. I do know we cannot distinguish, but at the same time I don't know what happens if we cannot. Without that, we cannot discuss this matter on the same page.

Here's a detailed my assumption (basically all the discussion happened at #5329 (comment)), if it helps:

In the first place, we must assume that the nomination set by the scheduler preemption could also be wrong.
So, anyone (including the scheduler preemption, cluster autoscaler, and any external components) can be wrong. Some of them might be because they don't do enough scheduling simulation. But, another reason here is because there's an inevitable time gap between when they compute the cluster state and put NNN, and when the scheduling cycle actually processes the pod. So, even if the scheduling simulation is perfect, NNN could be wrong, and that's the reason we're saying NNN is the best effort.
So, essentially the scheduler preemption vs external components: I don't see the difference. Both of them are: 1. compute things, 2. put NNN, 3. maybe change the cluster state (2 and 3 can be opposite or async depending on the components)
And, in order to wait for (3) to be completed (in case of CA, that's a node creation, in case of the preemption, that's a pod termination, etc), other components should take NNN into consideration to prevent the double effort to make the pod schedulable, regardless of who set NNN. After all, if (3) fails, or (3) completes, but NNN is not longer correct (e.g., CA creates a new node, but the node is filled by other high priority pods), then the component should clear NNN, or update with a new value.
This is the reason of this statement here in the KEP: "if we look from consumption point of view - these are effectively the same. We want to expose the information, that as of now a given node is considered as a potential placement"

To add to my last comment though, I'm not saying we shouldn't separate fields. It's just that I don't see the need of doing it right now, and we can do that later when actually we see the need of that.
I just don't want to introduce such a rich design without (yet) seeing actual problems or use cases that are clear for us right now.

To add to my last comment though, I'm not saying we shouldn't separate fields. It's just that I don't see the need of doing it right now, and we can do that later when actually we see the need of that.

Provided that we can do that in an additive way, which we believe we can. So i agree with that.

dom4ha · 2025-05-30T14:55:58Z

This LGTM - Dominik Marciński - I think this is waiting for your review before we ping PRR approver.

We are limited in time, so we probably won't be able to resolve all the concerns today. So let's go ahead with what we have and get others feedback. In the worst case we will cut the scope of the proposal.

The original delayed binding use case is very important and less controversial, so we should be able to deliver that.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

dom4ha · 2025-06-09T09:10:39Z

This LGTM - Dominik Marciński - I think this is waiting for your review before we ping PRR approver.

To make it more clear, overall it's LGTM from me.

soltysh

Several comments from PRR review

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

soltysh · 2025-06-09T16:39:00Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+Why should this KEP _not_ be implemented?
+-->
+
+## Alternatives


Two questions, that I'm not sure where to post:

How NominatedNodeName plays with NodeName, can we ensure this is explicitly pointed out in this document?

Have you considered adding a PodSpec NominatedNodeName to cover the use-case Wojtek mentioned to clearly separate nomination (status) from hint (spec)? I'd imagine this being at least explored in the alternatives section with explicit reason why it was rejected.

NNN is treated as "current plan of record", but something that is not necessarily a final decision. NodeName is the actual binding decision.

Yes - it was explored and we decided that distinguishing that for now is not really needed and can be done in an additive way in the future. So let's not add the API until we are really sure it's needed.

soltysh

/approve
the PRR, with the caveat that the upgrade->downgrade->upgrade section will get updated as soon as the implementation lands and it's possible to test that

soltysh · 2025-06-13T08:45:55Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+There aren't any restrictions preventing other components from setting NominatedNodeName as of now.
+However, we don't have any validation of how that currently works.
+To support the usecases mentioned above we will adjust the scheduler to do the following:
+- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]


I agree, warning might be too expensive, but v(1) or v(2) log is good.

soltysh · 2025-06-13T08:52:53Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+5. ensure that it gets cleared
+6. upgrade
+7. set NNN to non-existing node
+8. ensure it isn't cleared again


Normally, this is required for beta graduation, but given this is starting with beta it's hard to do the test before the implementation. But please once you land the code change to update this fragment of the document with the steps performed and results, and ping me so I'm aware this has happened.

ACK - that was the original intention.

soltysh · 2025-06-13T09:39:59Z

/label tide/merge-method-squash

dom4ha · 2025-06-13T09:53:31Z

/hold do we need anything else before merging?
/approve

k8s-ci-robot · 2025-06-13T09:53:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, sanposhiho, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-scheduling/OWNERS~~ [dom4ha,sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-06-13T10:49:01Z

do we need anything else before merging?

I think we're fine. If needed, we can always open a follow-up.

/lgtm
/hold cancel

sanposhiho added 2 commits May 7, 2025 12:50

kep-5278: nominated node name for an expected pod placement

9400883

add prr

d69af1f

sanposhiho marked this pull request as draft May 7, 2025 10:52

k8s-ci-robot added do-not-merge/work-in-progress kind/kep labels May 7, 2025

k8s-ci-robot requested review from kikisdeliveryservice and macsko May 7, 2025 10:52

k8s-ci-robot added the sig/scheduling label May 7, 2025

github-project-automation bot added this to SIG Scheduling May 7, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling May 7, 2025

k8s-ci-robot added cncf-cla: yes size/XL labels May 7, 2025

sanposhiho commented May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml Show resolved Hide resolved

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

update toc

f78c781

k8s-ci-robot added the do-not-merge/hold label May 7, 2025

add motivation

3dc7cb9

k8s-ci-robot added the sig/autoscaling label May 7, 2025

sanposhiho commented May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Show resolved Hide resolved

tenzen-y reviewed May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

sanposhiho mentioned this pull request May 7, 2025

Use NominatedNodeName to express the expected pod placement #5278

Open

4 tasks

dom4ha reviewed May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

liggitt added this to @liggitt May 7, 2025

wojtek-t reviewed May 8, 2025

View reviewed changes

sanposhiho added 2 commits May 8, 2025 16:59

update based on the discussion

9da94cc

update toc

cc1a919

sanposhiho requested a review from wojtek-t May 29, 2025 10:45

wojtek-t assigned dom4ha May 29, 2025

dom4ha reviewed May 30, 2025

View reviewed changes

dom4ha reviewed Jun 9, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

soltysh reviewed Jun 9, 2025

View reviewed changes

This was referenced Jun 10, 2025

KEP-5194: DRA: ReservedFor Workloads in 1.34 #5379

Open

Update KEP-5007 DRA Device Binding Conditions #5342

Merged

sanposhiho and others added 4 commits June 12, 2025 06:29

fix based on suggestions

3b4e63d

update toc

2862595

Few clarifying comments

0165349

update toc

0318c96

soltysh approved these changes Jun 13, 2025

View reviewed changes

github-project-automation bot moved this from Needs Triage to Needs Approval in SIG Scheduling Jun 13, 2025

k8s-ci-robot added the approved label Jun 13, 2025

k8s-ci-robot added the tide/merge-method-squash label Jun 13, 2025

k8s-ci-robot removed the do-not-merge/hold label Jun 13, 2025

k8s-ci-robot assigned wojtek-t Jun 13, 2025

k8s-ci-robot added the lgtm label Jun 13, 2025

k8s-ci-robot merged commit fb20a5e into kubernetes:master Jun 13, 2025
3 of 4 checks passed

k8s-ci-robot added this to the v1.34 milestone Jun 13, 2025

github-project-automation bot moved this from Needs Approval to Done in SIG Scheduling Jun 13, 2025

github-project-automation bot moved this to Done in @liggitt Jun 13, 2025


		So, they know where those pods are likely going to be scheduled.

		By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,


		#### NominatedNodeName can already be set by other components now.

		There aren't any guardrails preventing other components from setting NominatedNodeName now.


		#### Race condition

		If an external component adds `NominatedNodeName` to the pod that is going through a scheduling cycle,

kep-5278: nominated node name for an expected pod placement #5287

kep-5278: nominated node name for an expected pod placement #5287

Uh oh!

Conversation

sanposhiho commented May 7, 2025

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sanposhiho commented May 7, 2025

Uh oh!

sanposhiho commented May 7, 2025

Uh oh!

sanposhiho commented May 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho commented May 8, 2025

Uh oh!

sanposhiho commented May 29, 2025

Uh oh!

wojtek-t commented May 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanposhiho May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

sanposhiho May 17, 2025 •

edited

Loading

sanposhiho May 8, 2025 •

edited

Loading

dom4ha May 9, 2025 •

edited

Loading

sanposhiho May 10, 2025 •

edited

Loading

sanposhiho May 30, 2025 •

edited

Loading

sanposhiho May 30, 2025 •

edited

Loading