Skip to content

Create a proposal for Buffers API. #8151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 12, 2025

Conversation

jbtk
Copy link
Contributor

@jbtk jbtk commented May 21, 2025

What type of PR is this?

/kind documentation

What this PR does / why we need it:

Adds proposal for Buffers API

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jbtk. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from vadasambar May 21, 2025 09:36
@k8s-ci-robot k8s-ci-robot requested a review from x13n May 21, 2025 09:36
@jbtk
Copy link
Contributor Author

jbtk commented May 21, 2025

The initial discussion about this proposal was on a doc: http://docs/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo?tab=t.0

(there are still unresolved conversations there)

@jbtk jbtk force-pushed the buffersproposal branch from d304178 to 4bebd67 Compare May 22, 2025 10:01
@jbtk
Copy link
Contributor Author

jbtk commented May 22, 2025

Summary of open topics:

  • controller vs library used by autoscalers for translating buffers to pod spec
  • k8s quotas limitations (lacking in the proposal, will research that and fill in the blanks in the proposal)

@jbtk jbtk force-pushed the buffersproposal branch from 4bebd67 to ee9915b Compare May 27, 2025 12:03
@raywainman
Copy link
Member

raywainman commented May 27, 2025

/assign @jackfrancis

@jackfrancis - adding you here because I know this is related to things you've thought about in the past with regards to pre-provisioning capacity.

@jbtk
Copy link
Contributor Author

jbtk commented Jun 2, 2025

Update:

  1. added k8s resource quota handling to proposal
  2. after some discussions I am keeping the proposal to have a separate controller. When implementing we will reconsider embedding pod spec (due to embedding and validation that needs to be copied and maintained)

Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall idea and direction looks good to me, but IMO we should clarify some fine details in the proposal.

@jackfrancis @gjtempleton Could you give the proposal a high-level review?

@jonathan-innis Could you review the proposal? It's translated from the doc we've already discussed.

If the user uses autoscaling the cluster size will be adjusted to the number of
currently running pods. This means that often when a new pod is created its
scheduling time will need to include the time of creating a new node and
attaching it to the cluster. Many users care deeply about pod scheduling time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a nontrivial amount of writes (events, etc) associated with evicting, deleting, creating, and scheduling pods. ETCD is pretty sensitive to writes -- I suspect there are pretty huge perf benefits to this if you were to measure it.

* User pods can use the available capacity without any configuration changes
(strictly additive, can be managed by the user who owns the pod as well as
cluster administrators)
* Open possibility for cloud providers to offer different types of buffers if
Copy link
Contributor

@ellistarn ellistarn Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with this use case. Abstraction/extensibility often comes at the cost of complexity. I'd be wary of expanding to this scope unless there was a very concrete use case that could not be met in a vendor agnostic way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of the proposal I am adding a field buffer type (string). It will allow to offer different types of buffers both vendor agnostic as well as vendor specific.

Ideas that are floating around:

  • with resource hotplug proposal we could provide buffer where node can be resized into so that new capacity comes up faster (KEP: KEP-3953: Node Resource Hot Plug enhancements#3955)
  • offering a buffer of stopped nodes that can be restarted faster than creating a new node.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.

At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.

It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.

At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Actually I think that this is a good thing that there is a single API surface to manage different types of buffers and the user can test and compare how different buffers behave. Example: I have a workload, I can set up active buffer or stopped vms buffer. With a single field change it is possible to flip the buffer type and compare performance and cost. Yes - this means that this API needs to be flexible, but since it comes down to pod like chunks of capacity it should be relatively easy to translate it to any kind of buffer.
  2. About users getting confused - yes - buffers may be able to provide various buffer types in the future, but there will be a single surface to view all the possible overprovisioning mechanisms.

I think that annotations are a good idea if someone adds a new buffer type and the API is not enough for them. I would want to avoid a situation though where we create something that user did not intend to because something failed to parse annotation (because older version of the buffer controller did not support it). Therefore I think that adding a field for type ensures that we can fail fast in case the configuration tries to accomplish something that is non necessarily possible.

additional pods that are representing a buffer (for example treating them as
virtual pods) so that the capacity in the cluster becomes available.

### How do we determine the pod spec for targets using scale subresource?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a pod template spec inline? Imagine the case where you want to leave room for a CI job, but the job doesn't exist in the cluster yet, so you have nothing to reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal has both option to reference an object that implements scale subresource as well as pod template. The user will be able to choose one or the other.

Note that referencing pod template is tricky. Not every controller requires the user to define one (for example deployment has it just embedded - copy pasting is prone to errors coming from the fact that one changes and not the other) and some pods can change over time (for example if the VPA is used and scales the pods).

require more implementation work within the cluster autoscaler in order to
support virtual pods with DRA resources.
* More features that further help users to manage the buffers:
* Buffer schedule - option to define a buffer that is active only during
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Buffer supported the scale subresource, you could hook it up to things like https://keda.sh/docs/2.11/scalers/cron/ to achieve this use case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential feature for the future. On the original doc there is discussion about this: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs

Once I implement buffers I will look into this more: how hard would it be to have it just using existing tools vs a dedicated field. Thanks for the link to the tool - I will check it out besides testing with just k8s CronJobs.

Also implementing scale subresource would mean that we could use things like HPA to adjust the size of the buffer which could be useful as well.

// Depending on the type of referenced object one of the configurations below will be used:
// - for a crd implementing scale subresource (like deployment, job, replicaset)
// the ReplicaBasedBufferSpec will be used
// - for GKE CCC or Karpenter NodePool the NodeClassBufferSpec will be used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For node-centric headroom, I suspect that this will be implemented in the NodePool API in Karpenter, and not part of this buffer API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what are the plans of the Karpenter team. We talked with them about this proposal, they commented on the original doc (https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?pli=1&disco=AAABjKBEw_Q).

There is also a separate section why we do not plan to expose an API of "I just want to have 2 extra nodes" and most likely this would make sense in the context of NodePool API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify:

  • buffers are focusing on the capacity in the cluster, we want to offer the users a single API surface to define spare capacity - whether it is based on a single worklad or whether this is shared by workloads that run on a single ccc/kartpenter pool
  • in GKE CCC focuses on setting up a policy/configuration. This is why we wanted to have a separate, provider agnostic way of expressing extra capacity that should be provisioned. I am not sure about the karpenter node pool philosophy.
  • Because CCC is a policy and node class buffer (tbd the name ;P) is a way to have a shared buffer they may have different access control. Also there may be many workloads on a single pool and it may be much more convenient to specify two separate buffers of different chunk shapes rather than a single buffer which would make more sense from the nodepool API perspective

@towca as FYI

@jbtk jbtk force-pushed the buffersproposal branch from ee9915b to a8fb68a Compare June 9, 2025 11:13
@jbtk jbtk force-pushed the buffersproposal branch 3 times, most recently from b2f35b9 to 0c81ab0 Compare June 13, 2025 11:49
@jbtk jbtk force-pushed the buffersproposal branch 2 times, most recently from ecd9845 to f7a96ac Compare July 28, 2025 15:03
@jbtk
Copy link
Contributor Author

jbtk commented Jul 28, 2025

Updated the proposal to reflect the changes. The field names slightly differ from what @jonathan-innis proposed - please review and comment directly on the pr.

Note: While discussing with @jonathan-innis he noticed that there might be security considerations to reference across namespaces. For now I changed the pod template to be string (name only) within the same namespsce and I am discussing options with @liggitt. Open option - adding a reference grant to allow cross namespace referencing.

@jbtk jbtk force-pushed the buffersproposal branch 2 times, most recently from 78cb364 to 06ff1ba Compare August 7, 2025 11:00
@jbtk
Copy link
Contributor Author

jbtk commented Aug 7, 2025

@ellistarn @towca

I changed the names and yaml structure based on the comment from @liggitt (thanks!).

Are there any more things that are blocking this PR from being approved for merge?

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API types lgtm overall, swept the updated yaml / go types and noted a few spots where optionality wasn't quite right

}

type ScalableRef struct {
// optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be required, even if the value provided is an explicit "" for the core API group

@ellistarn
Copy link
Contributor

@ellistarn @towca

I changed the names and yaml structure based on the comment from @liggitt (thanks!).

Are there any more things that are blocking this PR from being approved for merge?

Lgtm

@towca
Copy link
Collaborator

towca commented Aug 11, 2025

The current proposal LGTM. We got an LGTM from Karpenter and Jordan as well. IMO this is good to merge as soon as the remaining nits from me and Jordan are addressed.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2025
type CapacityBufferSpec struct {
// optional
// oneof=PodShapeSource
PodTemplateRef *LocalObjectRef
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for ScalableRef we have the comment:

// Reference to an object of a kind that has scale subresource and sets its label selector field.

Let's add a similar comment for PodTemplateRef, maybe something like:

// Reference to a pre-defined PodTemplate resource in the same namespace that declares a static buffer

@jackfrancis
Copy link
Contributor

I added some additional nits, but overall I'm happy with the direction of the API that represents a consensus between @ellistarn @liggitt @towca (and others). lgtm after incorporation of that last mile feedback

@jbtk jbtk force-pushed the buffersproposal branch from 06ff1ba to 0ffe04d Compare August 11, 2025 19:01
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 11, 2025
@towca
Copy link
Collaborator

towca commented Aug 12, 2025

Holding for the one remaining discussion: #8151 (comment). Feel free to unhold after the discussion is finalized.

/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jbtk, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 12, 2025
@jbtk
Copy link
Contributor Author

jbtk commented Aug 12, 2025

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 12, 2025
@k8s-ci-robot k8s-ci-robot merged commit 612fdbb into kubernetes:master Aug 12, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants