-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Create a proposal for Buffers API. #8151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @jbtk. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The initial discussion about this proposal was on a doc: http://docs/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo?tab=t.0 (there are still unresolved conversations there) |
Summary of open topics:
|
/assign @jackfrancis @jackfrancis - adding you here because I know this is related to things you've thought about in the past with regards to pre-provisioning capacity. |
Update:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall idea and direction looks good to me, but IMO we should clarify some fine details in the proposal.
@jackfrancis @gjtempleton Could you give the proposal a high-level review?
@jonathan-innis Could you review the proposal? It's translated from the doc we've already discussed.
If the user uses autoscaling the cluster size will be adjusted to the number of | ||
currently running pods. This means that often when a new pod is created its | ||
scheduling time will need to include the time of creating a new node and | ||
attaching it to the cluster. Many users care deeply about pod scheduling time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a nontrivial amount of writes (events, etc) associated with evicting, deleting, creating, and scheduling pods. ETCD is pretty sensitive to writes -- I suspect there are pretty huge perf benefits to this if you were to measure it.
* User pods can use the available capacity without any configuration changes | ||
(strictly additive, can be managed by the user who owns the pod as well as | ||
cluster administrators) | ||
* Open possibility for cloud providers to offer different types of buffers if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling with this use case. Abstraction/extensibility often comes at the cost of complexity. I'd be wary of expanding to this scope unless there was a very concrete use case that could not be met in a vendor agnostic way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of the proposal I am adding a field buffer type (string). It will allow to offer different types of buffers both vendor agnostic as well as vendor specific.
Ideas that are floating around:
- with resource hotplug proposal we could provide buffer where node can be resized into so that new capacity comes up faster (KEP: KEP-3953: Node Resource Hot Plug enhancements#3955)
- offering a buffer of stopped nodes that can be restarted faster than creating a new node.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.
At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not expect that there will be any more complexity added as the complexity would be in the controller and autoscaler and can be vendor or autoscaler specific.
It's less about complexity on the implementation and more about complexity and cognitive burden of the different "types" of buffers on the user -- if I have multiple buffer types that orchestrate buffers in completely different ways but are using the same GVK, I feel like that's asking for users to get really really confused around what's happening where and why.
At that point, you're better off just creating a separate API yourself. The other option here is that we just allow users to convert the API into pods however they want and they can just plug-on to the buffer using annotations. As we get more concrete use-cases for what folks are adding annotations for, we can start to add features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Actually I think that this is a good thing that there is a single API surface to manage different types of buffers and the user can test and compare how different buffers behave. Example: I have a workload, I can set up active buffer or stopped vms buffer. With a single field change it is possible to flip the buffer type and compare performance and cost. Yes - this means that this API needs to be flexible, but since it comes down to pod like chunks of capacity it should be relatively easy to translate it to any kind of buffer.
- About users getting confused - yes - buffers may be able to provide various buffer types in the future, but there will be a single surface to view all the possible overprovisioning mechanisms.
I think that annotations are a good idea if someone adds a new buffer type and the API is not enough for them. I would want to avoid a situation though where we create something that user did not intend to because something failed to parse annotation (because older version of the buffer controller did not support it). Therefore I think that adding a field for type ensures that we can fail fast in case the configuration tries to accomplish something that is non necessarily possible.
additional pods that are representing a buffer (for example treating them as | ||
virtual pods) so that the capacity in the cluster becomes available. | ||
|
||
### How do we determine the pod spec for targets using scale subresource? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a pod template spec inline? Imagine the case where you want to leave room for a CI job, but the job doesn't exist in the cluster yet, so you have nothing to reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposal has both option to reference an object that implements scale subresource as well as pod template. The user will be able to choose one or the other.
Note that referencing pod template is tricky. Not every controller requires the user to define one (for example deployment has it just embedded - copy pasting is prone to errors coming from the fact that one changes and not the other) and some pods can change over time (for example if the VPA is used and scales the pods).
require more implementation work within the cluster autoscaler in order to | ||
support virtual pods with DRA resources. | ||
* More features that further help users to manage the buffers: | ||
* Buffer schedule - option to define a buffer that is active only during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Buffer supported the scale subresource, you could hook it up to things like https://keda.sh/docs/2.11/scalers/cron/ to achieve this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential feature for the future. On the original doc there is discussion about this: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?disco=AAABjKBExBs
Once I implement buffers I will look into this more: how hard would it be to have it just using existing tools vs a dedicated field. Thanks for the link to the tool - I will check it out besides testing with just k8s CronJobs.
Also implementing scale subresource would mean that we could use things like HPA to adjust the size of the buffer which could be useful as well.
// Depending on the type of referenced object one of the configurations below will be used: | ||
// - for a crd implementing scale subresource (like deployment, job, replicaset) | ||
// the ReplicaBasedBufferSpec will be used | ||
// - for GKE CCC or Karpenter NodePool the NodeClassBufferSpec will be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For node-centric headroom, I suspect that this will be implemented in the NodePool API in Karpenter, and not part of this buffer API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what are the plans of the Karpenter team. We talked with them about this proposal, they commented on the original doc (https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?pli=1&disco=AAABjKBEw_Q).
There is also a separate section why we do not plan to expose an API of "I just want to have 2 extra nodes" and most likely this would make sense in the context of NodePool API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify:
- buffers are focusing on the capacity in the cluster, we want to offer the users a single API surface to define spare capacity - whether it is based on a single worklad or whether this is shared by workloads that run on a single ccc/kartpenter pool
- in GKE CCC focuses on setting up a policy/configuration. This is why we wanted to have a separate, provider agnostic way of expressing extra capacity that should be provisioned. I am not sure about the karpenter node pool philosophy.
- Because CCC is a policy and node class buffer (tbd the name ;P) is a way to have a shared buffer they may have different access control. Also there may be many workloads on a single pool and it may be much more convenient to specify two separate buffers of different chunk shapes rather than a single buffer which would make more sense from the nodepool API perspective
@towca as FYI
b2f35b9
to
0c81ab0
Compare
ecd9845
to
f7a96ac
Compare
Updated the proposal to reflect the changes. The field names slightly differ from what @jonathan-innis proposed - please review and comment directly on the pr. Note: While discussing with @jonathan-innis he noticed that there might be security considerations to reference across namespaces. For now I changed the pod template to be string (name only) within the same namespsce and I am discussing options with @liggitt. Open option - adding a reference grant to allow cross namespace referencing. |
78cb364
to
06ff1ba
Compare
I changed the names and yaml structure based on the comment from @liggitt (thanks!). Are there any more things that are blocking this PR from being approved for merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API types lgtm overall, swept the updated yaml / go types and noted a few spots where optionality wasn't quite right
} | ||
|
||
type ScalableRef struct { | ||
// optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be required, even if the value provided is an explicit "" for the core API group
Lgtm |
The current proposal LGTM. We got an LGTM from Karpenter and Jordan as well. IMO this is good to merge as soon as the remaining nits from me and Jordan are addressed. /lgtm |
type CapacityBufferSpec struct { | ||
// optional | ||
// oneof=PodShapeSource | ||
PodTemplateRef *LocalObjectRef |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for ScalableRef
we have the comment:
// Reference to an object of a kind that has scale subresource and sets its label selector field.
Let's add a similar comment for PodTemplateRef
, maybe something like:
// Reference to a pre-defined PodTemplate resource in the same namespace that declares a static buffer
I added some additional nits, but overall I'm happy with the direction of the API that represents a consensus between @ellistarn @liggitt @towca (and others). lgtm after incorporation of that last mile feedback |
Holding for the one remaining discussion: #8151 (comment). Feel free to unhold after the discussion is finalized. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jbtk, towca The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
Adds proposal for Buffers API
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: