Skip to content

KEP-5229: Asynchronous API calls during scheduling #5249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

macsko
Copy link
Member

@macsko macsko commented Apr 16, 2025

  • One-line PR description: Add KEP-5229
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 16, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 16, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 16, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 16, 2025
@macsko
Copy link
Member Author

macsko commented Apr 16, 2025

/cc @dom4ha @sanposhiho

I have written some possible approaches to these API calls to start the discussion. I will not be visibly active for the next three weeks, but feel free to comment.

- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2).


#### 2.2: Make the API calls queued
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like queueing API calls might be a good direction long term and probably won't be that hard to implement.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great summary! I'll take time reading through it and put some comments after that.

This would require implementing advanced logic for queueing API calls in the kube-scheduler and migrating **all** pod-based API calls done during scheduling to this method,
potentially including the binding API call. The new component should be able to resolve any conflicts in the incoming API calls as well as parallelize them properly,
e.g., don't parallelize two updates of the same pod. This requires [making the API calls queued](#22-make-the-api-calls-queued) or
[sending API calls through a kube-scheduler's cache](#23-send-api-calls-through-a-kube-schedulers-cache) to be implemented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually what you need here is something pretty similar to DeltaFIFO we use in client-go:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/delta_fifo.go

We wouldn't operate on object, but on their updates, but that's conceptually exactly what you need:

  • tracking changes to a given object all together
  • a concept of deduping (if I already scheduled a pod but didn't yet sent the failures, don't send them; if I have multiple failures not yet sent to report, is the last one only valid?, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds promising, but after a brief look it's not super clear to me how it works and whether it addresses all the problems.

The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comment could have been misleading.
It was supposed to be a meta-comment - i don't think we can really reuse any code from DeltaFIFO.
What I was trying to say is that it's effectively the same conceptual pattern that delta-fifo is using.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic on the scheduler side may still be quite complex depending on which type of update is pending. For instance I suspect some of them may be blocking re-scheduling, but some not.

That's interesting - in particular:

  1. what are the examples where an API call should prevent further attempts to do something with a given pod?
    [The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category]

  2. is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what are the examples where an API call should prevent further attempts to do something with a given pod?
    [The only one I can see is preemption, but it's not even for a given pod, so it doesn't belong to this category]
  2. is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

Don't know the answer out of my head, but it's not obvious to me yet that all updates can be skipped and we don't need to wait for updates to be persisted. That's why adding an analysis section and categorization of the API calls is needed.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

That an interesting scenario - I just don't think that we should let the binding there. We probably need smarter deduping logic.
If we know that we will preempt the pod - there is no point in doing the binding. We need to think what is means from the "pre-binding" perspective (how to ensure we will not leak something), but probably directly moving the pod to a failed state seems better, than letting kubelet start it and bring it down immediately after.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example comes to my mind is preemption of just-bound pod. It would be safer to let the binding call happen first before we could start preemption process, to avoid a situation we preempt something not bound yet.

I guess it actually happens today? Because we're running the binding cycle asynchronously.
Let's say pod-1 goes to the binding cycle (after "assume"), pod-2 triggers the preemption and then pod-2 could preempt pod-1 before pod-1's binding cycle completes.
And, what happens today (I guess) is, because pod-1 is deleted, the binding cycle for pod-1 would be just failing.

But, that is a good example showing the difficulty of this KEP: Possibly, not only API calls for the same object, but also API calls for different objects could be depending on one another...

Copy link
Member Author

@macsko macsko May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. what are the examples where an API call should prevent further attempts to do something with a given pod?

I haven't found any such API call. Even preemption should not be a problem.

Another thing to consider is how to update the Pod's status in the scheduler's memory. Now, since the API calls are blocking, we don't need to persist the status change before the call, because the event handlers will soon receive the Pod/Update event with the current status and update the Pod object there.

  1. is there any case when if I have queued two different API calls for a given pod P - we actually want to send both, not just the last one?

We have 3 kinds of API calls for a Pod in scheduler: deletion (preemption), binding and status update (unschedulable or NNN). There is no such case where we would like to send two of these calls for a single Pod (see API calls categorization section).

@wojtek-t wojtek-t self-assigned this Apr 25, 2025
Making one universal approach of handling API calls in the kube-scheduler could allow these calls to be consistent, as well as better controlling
the number of dispatched goroutines. Asynchronous preemption could also be migrated to this approach.

### Goals
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking from the high level, I think we may have the following goals (although it's not clear we're able to achieve all of them):

  1. Make scheduling cycle free of blocking api-calls (any async option is fine)
  2. Skip some type of updates if they soon become irrelevant by consecutive updates
  3. Prioritize high importance updates (binding) over low importance ones if updates to the api-server gets throttled

Maybe we have more, but I think agreeing first on our goals is important before we can select a solution. One of my doubts here is that we may not know yet all the requirement we may have once we start designing workload-aware scheduling and reservations. More api calls may appear at that time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to it, but I think there is also a question about priorities. In my mental model:

  1. eliminate blocking calls is P0
  2. skip soon-to-be irrelevant updates is P1
  3. prioritization of updates - is nice to have [I would like that but not necessary if that adds a lot of complexity]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling like the third one is the next step, and we can just scope it out for now, at least from this KEP. Of course, we should keep it in mind when discussing this KEP though.
I agree that all of three are important, but it's just that we don't have to solve all of them at once on one KEP here; the third one looks complex enough to deserve being discussed on another KEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which seems to be roughly aligned with what I wrote above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, my with the above, the order was also in purpose.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 20, 2025
@macsko
Copy link
Member Author

macsko commented May 20, 2025

I added a section with API calls categorization and adjusted the goals. PTAL

To go further, we could probably agree on some high-level concept (queuing without blocking a pod?).

if the newest status is stored in-memory.
- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls,
hence implementing those shouldn't block making (1 - 3) calls asynchronous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any relevance order between calls for different pods?
I think there isn't (i.e. we don't require NNN to be set before preempting the first victim).

But even if there isn't - it would make sense to call it out explicitly here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any relevance order between calls for different pods?

I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

Copy link
Member

@sanposhiho sanposhiho May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we need to protect the space freed up by preempted pods for the nominated pod.

But, we store NNN to the nominator when we update NNN. i.e., even if we make the API call async, as long as we report NNN to the nominator synchronously, the space should be reserved, and not stealed by other pods' scheduling cycles.
So, I believe we can make this NNN API call in the preemption asynchronous, simply.

Copy link
Member

@sanposhiho sanposhiho May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if we go with the scheduler cache idea, we don't have to be worried at all about that kind of scenario. Any updates are reflected to the scheduler cache synchronously, and things that happen later can immediately refer to the update on the scheduler cache (regardless of whether API call is asynchronously done or not

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless scheduler restarts in the meantime and we lose the in-memory state.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

which effectively introduces the ordering
The question is whether in-memory state is good enough, at least initially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I referred only to the ordering of api calls, not cache updates, which in turn can become fully synchronous again (we achieve the non-blocking goal by async api calls only).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any relevance order between calls for different pods?

I'd say there isn't. Any two API calls, but for different pods made by scheduler are independent and I don't see any counter-example.

I believe we can make this NNN API call in the preemption asynchronous, simply.

That's right.

The question is whether in-memory state is good enough, at least initially.

WDYM by "initially" here?

- Updating Pod status (1, 2) could be less important and called if there is space for it.
It's worth considering if setting `nominatedNodeName` (3) should have the same priority or higher,
because the higher delay might affect other components like Cluster Autoscaler.
- API calls for non-Pod resources (5 - 10) could be analyzed case by case, but are likely less important than (5) and (4).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 6-10

Also - I'm not sure I agree with it. As an example - volume binding is on a critical path for pod binding - so delaying volume binding delays pod binding - which affects overall pod startup. Which is probably what we should actually optimize for.

Copy link
Member

@sanposhiho sanposhiho May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I actually rather think API calls for non-Pod resources are likely equally important to 4 and 5 (at least for the current default scheduler) because, like described in 6-10 above, most of them are in the binding cycles, which are on the critical path, or PostFilter (similar to preemption - making space for pending pods).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, the difficulty here is that we cannot just say "the API calls for non pods are likely important -> let's prioritize them" because obviously there'll be exceptions especially when it comes to the scheduler with custom plugins.
So, I basically agree with the KEP saying that this is case by case actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it leads to another question: should we consider API calls being made within binding cycles? They're already async, so can we just treat them as a single async group for multiple API calls? And, in this KEP, we only need to add a canceling functionality in the binding cycles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it leads to another question: should we consider API calls being made within binding cycles?

Maybe I misunderstood your intention here, but if we don't consider them, we will face the races due to these not being coordinated with others. If the pod was earlier unschedulable, but the call was queued for too long, we need to cancel that once binding starts. So we need coordination between these. From that perspective, having a single place to handle everything sounds better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it leads to another question: should we consider API calls being made within binding cycles?

I think we need to consider pod-based API calls (status update, binding, preemption) to be able to effectively cancel/skip them. Another aspect is the 6 - 10 calls, which I think don't need be considered, at least initially. The only concern might be the DRA's PostFilter, which can make blocking API call (but probably we would need then to make async API calls for all ResourceClaim calls because of similar reasons to Pod ones).

- Simplifies introducing new API calls to the kube-scheduler if the collision handling logic is configured correctly.

Cons:
- Requires implementing complex, advanced queueing logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it's debatable how complex that really is - my subjective claim is that it's actually not that complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really depends on what we're doing with this queueing logic. If it's just going to be the queue, obviously simple.
But, like mentioned at 2.2, if we are going to merge multiple operations into one, things will be getting complicated.

- Cannot be used for the `nominatedNodeName` scenario, requiring additional effort and separate handling.


#### 1.3: Use advanced queue and don't block the pod from being scheduled in the meantime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally strongly for this option - because it allows to handle all calls the same way and optimizes for latency.
I think the additional paid upfront complexity (that btw will be well encapsulated to a single place) it a cost that we should be willing to pay for the fact that it becomes unified and simple to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that this should be the way to go. If we reach the consensus here, I'd move other options to the Alternatives considered and present one proposal.

- `nominatedNodeName` scenario support would require more effort in (1.1) or (1.2).


#### 2.2: Make the API calls queued
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is should be path forward - the primary reason against 2.3 is that it allows to encapsulate the logic into a single place and the burden is not spread across different (arguable much harder to reason about) places. And allows for avoiding races and optimizations (compared to 2.1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand how to handle external changes? The obvious case is disappearance of a pod that was supposed to be preempted. It's probably quite easy case, as we can skip our update, but what if nominated node name changes once we allow external actors to change it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things may change behind the scene - so there might be races. But for a particular object relying on optimistic concurrency should just work.

If you're worries about scheduler preempting something and external actor changing NNN in the meantime - that has an inherent race in it no matter what we do - we may try to minimize the window, but it will always be there and I'm not sure I would actually try to optimize for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update - after thinking about it I changed my mind and we may prefer 2.3 actually: #5249 (comment)

Both of the above API calls could be migrated to the new mechanism.

In-tree plugins' operations that involve non-pod API calls during scheduling and could be made asynchronous,
but not necessarily in the first place:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we need to consider changes planned in the Extended Resource KEP, so it will be additional challenge how to put them together.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because when canceling pod binding, we probably want to cancel any pre-binding calls as well

When do we cancel pod binding?

Copy link
Member

@sanposhiho sanposhiho May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I now can guess you meant ↓ described later in this kep.

  • Pod deletion caused by preemption (4) should cancel all Pod-based API calls for such a Pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may want to consider them from the very beginning, because when canceling pod binding, we probably want to cancel any pre-binding calls as well.

Yes, that could be an optimization, but not necessary.

Btw, now when the pod Y is in the PreBind stage and pod X wants to preempt pod Y, pod Y delete API call is sent to the apiserver. If pod Y would be on WaitOnPermit, preemption just cancel it, without sending the API call. Shouldn't we optimize similarly for the PreBind, given that the number of use cases for PreBind (DRA) is growing? Do we have any reason against it?

if the newest status is stored in-memory.
- API calls for non-Pod resources (5 - 10) should be further analyzed as they are not likely to consider the Pod-based API calls,
hence implementing those shouldn't block making (1 - 3) calls asynchronous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any relevance order between calls for different pods?

I think there is because we need to protect the space freed up by preempted pods for the nominated pod. Sure, pods don't disapear immediately, but we shouldn't rely on it. That obviously puts in question whether we can priorities some updates over the others... which could change the order.

Alternatively we may still want to treat a preemption as one async operation that includes setting nomination and removing pods in the right sequence.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 9, 2025
@macsko macsko requested review from dom4ha and sanposhiho June 10, 2025 07:11
Copy link
Member

@dom4ha dom4ha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I think the proposal description is mainly missing a statement that some procedures will be still a separate go routines with ability to wait for the API calls to finish. This way we don't have to implement dependencies between those calls which depend on each other.

We should also make it clear which option we want to follow. IIUC option C is the one that is the most reasonable one and easy to implement.


This proposal could be implemented as a second step extension of proposal A.

### Summary of API call management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth to describe what happens on update failures, either due to conflicts or just reaching the max number or retries. For instance what happens when binding fails. Is it expected that a go routine waits for binding completion and reacts to a failure? Does the failed updates for which nobody waits, are ignored (like in status updates)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the examples to Retrying API calls section

@macsko
Copy link
Member Author

macsko commented Jun 16, 2025

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 16, 2025
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments from me - other than that PRR lgtm.

and if it's not expected, custom logic could always be added using an `APICall` interface.

However, to support other potential use cases and have the newest object possible in the cache (proposals B and C, and optionally A),
merging the object received by event handlers with API call details should also be added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that merging with objects received from event handlers isn't a full solution. Given those are eventually consistent, there will always be a period of time, when the object was already changed and is not yet observed by the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume our update will fail due to such a conflict, so this is why we should describe how to react on this and other types of failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - but propagating objects from handlers isn't the full solution - this may still happen given their eventually consistent nature.

Copy link
Member Author

@macsko macsko Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know that there will be such window, when we would still have to merge it, even after executing the call.

But, I'm now wondering, what if we execute the update API call and just write the returned updated object to the cache? Then, when we process the updates in event handlers, we could ignore those, that have an older ResourceVersion (as we have a newer object already in the cache).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - we should do that.
But still that isn't a full solution - we can still face conflicts etc. that we need to handle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What conflicts do you mean? If we make an update based on an older object?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I mean that no matter what we do, the actual version committed in etcd can be different than what scheduler thinks it is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the scheduler could indeed operate on an older version of an object, but we face similar problems with the current implementation. The key is to ensure this is considered when handling an API call. However, it doesn't have any impact on current (Pod-based) API calls that we (will) made by the scheduler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential conflict that we should not ignore is when a pod got assigned manually by a user. I think that we should have some rules which fields scheduler could blindly override and which requires pulling them in first and issue rescheduling of the conflicting changes. NodeName (assignment) is one of them, but possibly there are others like ReservedFor in ResourceClaims.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should have some rules which fields scheduler could blindly override and which requires pulling them in first and issue rescheduling of the conflicting changes. NodeName (assignment) is one of them

I don't feel we have much to do with that. For the actual use cases, overriding will be enough (maybe with handling the external binding/deletion separately). We could defer this discussion to the implementation phase and see what will be exactly needed then.

#### Enqueueing an API call while a previous one is in-flight

One other possible scenario is when an API call is executing (is in-flight) and a new API call for the same object wants to be added.
If both have the same call type, standard merging logic could be used, i.e., merge the new API call with the API call in flight.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this - if the old one is already in-flight, how we can merge those?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the details of the API call in the map, even if the call is in-flight.

Let's say a status update API call is in-flight and wants to set the Pod condition PodScheduled to false with reason foo.
If then the next status update is queued for the same Pod, it can verify via merge if it actually changes anything. If it doesn't, it can be skipped. If it does, the merge should ignore changes that are duplicated with a call that is in-flight.
However, skip is a more important use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does, the merge should ignore changes that are duplicated with a call that is in-flight.

This assumes that the request succeeds (or if not - will be retried).

But anyway - I think I get your point - the wording is misleading though.
It's not really the we merge the new incoming request with the old one. It's that we adjust the new incoming one to reflect changes that are already in flight.

It's a phrasing issue - merging for me means that we combine these two into a single one. Which is not what we're doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if not - will be retried.

I'd say that this assumes that the request failure will be handled, if needed, as explained in Retrying API calls section.

It's not really the we merge the new incoming request with the old one. It's that we adjust the new incoming one to reflect changes that are already in flight.

Yes, that't right. Probably the wording is not that precise, as you said. I'll refine it to make it more clear.

@wojtek-t
Copy link
Member

This looks fine from PRR perspective.

/approve PRR

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 17, 2025
@dom4ha
Copy link
Member

dom4ha commented Jun 17, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dom4ha, macsko, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dom4ha
Copy link
Member

dom4ha commented Jun 17, 2025

/hold
/lgtm

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 17, 2025
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but LGTM overall.
I have several unknowns around APIQueue, but I agree the overall direction and we can clarify such details at the implementation part.

This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider "custom plugins might start to use this for use cases that we haven't considered" as a risk. There could be many questions like "what if (custom) plugins do this".
We need to clarify and document a solid recommendation for plugin developers to guide them properly: what kind of API calls are supported gracefully by this async API mechanism, and what API calls are not supported, or don't make sense.
Or, at the first step, maybe we even can start off saying "we don't yet support custom plugins", until we're fully ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I added the section about it. I think we could partially support these plugins in the first version, but we have to carefully document allowed use cases.


// APICall describes the API call to be made and store all required data to make the call,
// e.g. fields that should be updated or object to be added/removed.
type APICall interface {
Copy link
Member

@sanposhiho sanposhiho Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should all the API calls for the same APICallType use the same APICall implementation? What if two different APICall implementations are used to make an API call for a certain APICallType? How would/should APIQueue behave, how are those two Merge()ed? or will we somehow try to prevent such situation in the first place?

Copy link
Member Author

@macsko macsko Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we start with an assumption that we have one implementation per a CallType, but we might consider extending this behavior in the future. Added it explicitly

Copy link
Member

@sanposhiho sanposhiho Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the implementation, we need:

  • how to prevent two different APICall are used for the same APICallType.
  • how to replace a certain APICall that everyone is using. e.g., custom plugins may want to change how status_update is handled and hence the status_update's APICall implementation that is used by everyone.
    • meaning, no one should directly depend on a specific APICall implementation. Everyone should use each APICall implementation indirectly so that we can change it.

I guess APICall implementations for each type should be stored properly in a single place that is accessible to everyone.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@wojtek-t
Copy link
Member

/lgtm

Unholding based on #5249 (review)

/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jun 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit 3d9a053 into kubernetes:master Jun 18, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 18, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Done in SIG Scheduling Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants