Skip to content

docs: Node Overlay RFC #2166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 17, 2025
Merged

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Apr 25, 2025

Fixes

Description
This RFC is an updated Node Overlay RFC original opened a 6 month ago. This RFC plans to go over what is planned

kind: NodeOverlay
metadata:
  name: default
spec:
  priceDiscount: "90%"
  capacity:
    smarter-devices/fuse: 1
---
kind: NodeOverlay
metadata:
  name: memory
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-memory
    operator: Gt
    values: [2048]
   priceAdjustment: "+0.109"
   price: "1.422"
  capacity:
    smarter-devices/fuse: 23

How was this change tested?

  • N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 25, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2025
@engedaam engedaam mentioned this pull request Apr 25, 2025
@engedaam engedaam changed the title RFC: Node Overlay docs: Node Overlay RFC Apr 25, 2025
@coveralls
Copy link

coveralls commented Apr 25, 2025

Pull Request Test Coverage Report for Build 16333981357

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 82.125%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/consolidation.go 4 88.14%
Totals Coverage Status
Change from base Build 16332590768: -0.03%
Covered Lines: 10342
Relevant Lines: 12593

💛 - Coveralls

@engedaam engedaam force-pushed the node-overlay-rfc branch 3 times, most recently from ce506d2 to c155548 Compare April 25, 2025 22:22
@GnatorX
Copy link

GnatorX commented May 5, 2025

Thanks for tackling this problem with a new revision!

@engedaam engedaam force-pushed the node-overlay-rfc branch 4 times, most recently from fa43c6a to e2e038b Compare June 4, 2025 15:46
@engedaam engedaam force-pushed the node-overlay-rfc branch 4 times, most recently from 72592c5 to 2241bf1 Compare June 10, 2025 18:44
@Beardface123
Copy link

Hello all, I wanted to provide a short consumer perspective on this proposed feature.

Our organization has been in demand for several changes that would benefit from the proposed NodeOverlay feature.

  1. We're exploring nodepools to use ARM, with at least one nodepool requiring the fuse-plugin. (We're currently using a very undesirable workaround for this and are eagerly awaiting a solution for this in Karpenter).
  2. Cost is extremely important so any ability to overlay savings plans on top of particular instances for better node allocations is great.
  3. We're daemonset heavy, so including licensing costs of these daemonsets where applicable would be very helpful.

We're still exploring where new nodepools would be necessary, but if these features were to be implemented directly into the NodePool API I fear it would near exponentially increase the number of node pools we'd need to maintain as Karpenter admins. I would welcome the ease of an overlay, especially one that would target specific instances through a familiar spec. I would accept this increased blast radius and put it on our teams to ensure proper testing is done in case poor changes are performed on an overlay. Just an opinion of one Karpenter admin of course. 👍

In short, it's fantastic that Karpenter can smartly chose the right instance for the demand, but we're starting to see some shortcomings when it comes to more complex logic. An optional API to implement complex logic on a subset of nodes in a given nodepool seems like a good potential path forward

In transparency, we are also looking for the ability to dynamically adjust EBS throughput and IOPS depending on instance size, but I don't think NodeOverlay would apply here since this is part of the EC2NodeClass.

@ellistarn
Copy link
Contributor

but if these features were to be implemented directly into the NodePool API I fear it would near exponentially increase the number of node pools we'd need to maintain as Karpenter admins.

Great feedback @Beardface123! Can you expand a little bit on why you'd imagine this would increase the number of NodePools? I initially thought this, but realized an alternative path where .spec.overlays (plural) would let you do complex overrides within a single nodepool. The downside being that if you wanted this config everywhere you'd need to duplicate it across nodepools.

Your comment here seems to suggest that you would scope these overrides to a nodepool anyways:

An optional API to implement complex logic on a subset of nodes in a given nodepool


It's a bit nuanced, so I'm very curious to your thinking on this.

@Beardface123
Copy link

Can you expand a little bit on why you'd imagine this would increase the number of NodePools?

What I wanted to avoid is an implementation that will require NodePools to propagate out exponentially.

Given an example of a general node pool:

  • requires amd and arn architectures and the arn nodes need a fuse-plugin custom resource
  • a wide range of sizes from 2xlarge to 16xlarge. The idea of adding estimated daemonset overhead to these nodes is appealing, but sometimes licensing varies by total CPU for the node. The flat cost adjustment may vary for every instance size.

Given the scenario, we might need to take a single NodePools, split it to two to support the fuse-plugin, and then multiply by 5 for each instance size available. This is what I had in mind in my first post. I think I would be accepting of several implementations as long as it keeps the number of NodePools we manage under control. Also, upon re-reading the thread from a month ago, this might not be what was intended at all. 🙃

Your comment here seems to suggest that you would scope these overrides to a nodepool anyways

I didn't express myself correctly, sorry about that. To correct myself I should've said "An optional API to implement complex logic on a subset of nodes across specified NodePools"

If I were to try to represent my opinions concisely in a few points now that I've re-read the threads and given it another round of thought:

  1. I think the NodeOverlay API object is good idea, because it will create re-useable overlays that could be applied to multiple NodePools.
  2. In my opinion NodeOverlay API objects should either have an opt-out concept in the NodePool or an opt-in which would somewhat blend its concept with your proposed solution of .spec.overlay within the NodePool. I prefer the latter.
  3. It's up to people smarter than me, but if a NodeOverlay could be referenced in a NodePool, it could be an ordered list. This might invalidate the need for a NodeOverlay weight since different nodepools might have different priorities on overlays.

Ultimately, the overlay concept in general is very appealing in either implementation, but I think the preference is to use a separate API object to define overlays and have that propagate to all nodepools that reference it. The alternative is a more cumbersome experience if there are many NodePools with diverse overlays. If manually set, I'd worry about user error. If templating overlays with Helm or other tools is done, well that's time my team needs to figure it out.

Again, sorry for the misunderstanding up front.

@gillg
Copy link

gillg commented Jul 3, 2025

Out of curiosity where happens the Karpenter logic and the NodeOverlay in this schema ?
Does it will fit properly with a special scheduler extension + device plugin ?

                                                                        +----------------------------+
                                                                        | POD Manifest               |
                                                                        | with Request               |
                                                                        | aws.amazon.com/neuroncore:2|
                                                                        |                            |
                                                                        |                            |
                                                    2                   +-------------+--------------+
                                         +--------------------------------+           |
                                         |                                |           |
                                         |                                |           | 3
          +------------------------------+-----+                          |           |
          |           Kubelet in INF1/TRN1 Node|                          |           |
          |                                    +<-----------+             |           |
          +-----+---------------------+--------+            |       +-----v-----------v--------------+
                |                     ^                     |       |          Kube-Scheduler        |
                |                     |                     |       |                                |
                |                     |                     |       +--^------+---------------+------+
              9 |                  1  |                     |          |      |               |
                |                     |                    8|         5|      |4              |
                |                     |                     |          |      |               |
                |                     |                     |          |      |               |6
                v                     |                     |          |      |               |
          +-----+---------------------+--------+            |       +--+------v---------------v------+
          |    neuron-device-plugin            |            +-------+       neuron|scheduler|ext     |
          |    in INF1/TRN1 node               |                    +---------------------+----------+
          +----+----------------------+--------+                                          |
               |                      |                                                   |7
               |                      |10                                                 |
               |                      |                                                   v
             11|                      |                                         +---------+-------+
               |                      |                                         |POD Manifest:    |
               |                      |                                         |Annotation:      |
               |                      |                                         |NEURON_CORES:2,3 |
               v                      +---------------------------------------->+                 |
--device=/dev/neuron1 --env NEURON_RT_VISIBLE_CORES=2,3                         |                 |
                                                                                |                 |
                                                                                +-----------------+

1. neuron-device-plugin returns the list of Neuron cores/devices to kublet
2. Kubelet advertises the Core/Device list to K8s API server (in turn to kube-scheduler)
3. POD Request for neuron cores/devices [Kube-Scheduler picks up the POD creation request]
4. kube-scheduler calls the neuron-scheduler-extn filter function with list of nodes and POD Specification
5. neuron-scheduler-extn scans through the nodes and filters out nodes with non
contiguous cores/devices and returns the nodes that are capable of supporing the given POD specification
6. kube-scheduler calls the neuron-scheduler-extn bind function with pod and node
7. neuron-scheduler-extn updates the POD annotation with allocated neuron core/device Ids (contiguous)
8. neuron-scheduler-extn sends the bind request to kubelet of the selected node
9. Kubelet calls the Alloc function of the neuron-device-plugin
10. neuron-device-plugin queries the POD Annotation for allocated core/device Ids
11. neuron-device-plugin exports the devices & visisble cores to container runtime

Source: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler-flow.html#k8s-neuron-scheduler-flow

@engedaam
Copy link
Contributor Author

@gillg The only difference that NodeOverlay will provide is that Karpenter will now be aware of possible resource that can be present on the nodes once an Overlay is defined. Karpenter will not be aware of any custom scheduler

@engedaam engedaam force-pushed the node-overlay-rfc branch 3 times, most recently from 61dd5ea to 6303cde Compare July 16, 2025 23:54
@engedaam engedaam force-pushed the node-overlay-rfc branch 2 times, most recently from 50fb097 to 9a9454f Compare July 17, 2025 00:35
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: engedaam, jonathan-innis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2025
@engedaam
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2025
@jmdeal
Copy link
Member

jmdeal commented Jul 17, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2025
@engedaam
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 17, 2025
@k8s-ci-robot k8s-ci-robot merged commit b5b9a45 into kubernetes-sigs:main Jul 17, 2025
14 checks passed
@engedaam engedaam deleted the node-overlay-rfc branch July 17, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.