Skip to content

Target NodeSwap GA in v1.34 #5387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 19, 2025

Conversation

iholder101
Copy link
Contributor

  • One-line PR description: Target NodeSwap GA in v1.34.
  • Other comments:

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 8, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 8, 2025
@iholder101 iholder101 mentioned this pull request Jun 8, 2025
69 tasks
@SergeyKanzhelev
Copy link
Member

/hold

I believe the ability to disable swap per-Pod is a GA blocker with this KEP.

As discussed in a few meetings, one of the requested feature for the node-level swap enablement is an ability to disable swap per-Pod. We already offer a few cases when swap is not enabled in this KEP. However, there are still cases when customer may want to opt-out from swap. Moreover, if this is a security requirement, the ability to disable swap on Pod must be done per-Pod using mechanisms like admission web hooks.

Some alternatives were discussed:

  1. Make customer configure labels and taints to prevent scheduling on nodes with swap enabled. This is a very weak way to opt-out Pods from being scheduled. First, cluster admin may enable swap and forget to taint/label the node. Second, customer will need to re-design how Pods are bin-packed to nodes if disabling of swap will require a separate node.
  2. Make swap disablement be done via direct cgroup manipulation or NRI plugins and annotations. This option exposes nodes to root-level access daemonsets of init containers. Also this will not work for managed k8s environment, where nodes are not fully-owned by end users.

There might be other alternatives, but in my opinion, they ultimately not worth the effort, keeping in ming these great benefits of enable per-Pod swap. So the proposal is to assume that pod without any value configured is non-swappable. And allow to either enable LimitedSwap (part of this KEP) or Explicit swap limit (new KEP). The benefits are:

  1. This approach will be the most secure as cluster admin cannot "surprise" application owners by enabling swap on nodes. (Mass enablement of swap on Pods is still possible by implementing a small webhook).
  2. All nodes can enable swap with very small side effect to anything else. And then any Pod can decide to "allow swap" and will get swap whenever it is landed. No need to coordinate nodes with swap enabled and Pods that want swap.
  3. We can remove the assumption that guaranteed pod doesn't want swap. This assumption was made as a precaution with limited customer validation. Keeping the restriction forever for guaranteed pods is another customer confusion and will be eliminated by explicit enablement of swap.

As for the timeline, I would argue that introducing this field may keep the pod in beta for another release and then GA after validation. @thockin to keep me honest on the requirements here.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 10, 2025
@thockin
Copy link
Member

thockin commented Jun 10, 2025

keep me honest on the requirements here.

I'm having difficulty putting the threads together into a single picture. Can you clarify the question?

the proposal is to assume that pod without any value configured is non-swappable.

This is certainly the SAFE answer

@SergeyKanzhelev
Copy link
Member

keep me honest on the requirements here.

I'm having difficulty putting the threads together into a single picture. Can you clarify the question?

So the #2400 was in beta for a while. And the graduation criteria we set is to have an API for future per-Pod swap approved. So we know that we have a way forward. I was advocating for including per-Pod enablemend/disablement in #2400 and not wait for the full support of per-Pod swap enablement, limits, etc. And I am thinking that adding a way to enable/disable swap per-Pod is not resetting #2400 to alpha. And it can progress to GA soon as long as we can demonstrate it was used. Just wanted to confirm that this API addition is not resetting the KEP to alpha.

@thockin
Copy link
Member

thockin commented Jun 10, 2025

#5359 (per-pod swap control in API) does not reset #2400, but I would question if you want to GA that without the per-pod control?

@SergeyKanzhelev
Copy link
Member

but I would question if you want to GA that without the per-pod control?

we are on the same page. I think disabling/enabling per-Pod will be an adoption blocker for this feature for many users.

@iholder101
Copy link
Contributor Author

iholder101 commented Jun 12, 2025

As discussed in a few meetings, one of the requested feature for the node-level swap enablement is an ability to disable swap per-Pod.

Further to our conversation offline, there had seemed to be a miscommunication in sig-node meetings. I'm sorry about that.

However, I just want to re-iterate this KEP's scope, from the merged KEP design:

This KEP aims to introduce basic swap enablement and leave further extensions to follow-up KEPs. This way Kubernetes users / vendors would be able to use swap in a basic manner quickly while extensions would be brought to discussion in dedicated KEPs that would progress in the meantime.

For example, to achieve this goal, this KEP does not introduce any APIs that allow customizing how the feature behaves, but instead only determines whether the feature is enabled or disabled... Within the scope of this KEP we will support only two basic behaviors... Both do not provide any customizability... behaviour is automatic and implicit that requires minimum user intervention. As mentioned above, in the very near future, follow-up KEPs would bring API extension and customizability.

While this KEP sets the ground for extending the API in follow-ups through "swap behaviors", changing APIs, especially at the pod-level, is highly complex and controversial. Therefore, it is out of scope for this KEP.

As can be seen, we've agreed and merged to the design to focus on basic swap enablement with no pod-level API changes. This is mentioned specifically on the KEP and captures most of the Summary section.

Since (after the KEP design was merged already..) folks were insisting on providing clarity for the pod-level APIs, I've opened #5359, although it's clearly not part of the KEP's scope AFAICT.

@iholder101
Copy link
Contributor Author

cc @dchen1107 @haircommander

@iholder101 iholder101 marked this pull request as ready for review June 12, 2025 11:18
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 12, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp June 12, 2025 11:18
@SergeyKanzhelev
Copy link
Member

From slack from @iholder101:

I am arguing the default to be no-swap (if not specified).

I strongly disagree with this approach.
For the vast majority of pod/app owners, swap configuration is a node-level concern that should be left for the admin or some higher level persona to deal with. In other words, most pod owners just don't care. On the other hand, having to explicitly turn on swap on a per-pod basis will make this feature almost unusable.

Can you please clarify why it will be unusable for people who wants swap? Is it because they do not own the app? Or it is too much of a burden to install the webhook?

I feel like there may be some confirmation bias when asking somebody who wants swap. People who want swap would wish for minimal friction enablement. When people who doesn't want swap will now need to deal with labels, taints, etc. I think the opposite should be better when people who doesn't want swap should spend less effort than people who wants it

@ajaysundark
Copy link

IMHO, for swap as a whole feature, ideal default would be 'disabled' by default. Most production use-cases prefer predictability, and when pods don't explicitly declare swap preferences, shouldn't cause runtime uncertainty. but unfortunately it would break the (implicit) behavior for existing users.

The trade-off question is API consistency vs not disrupting existing users.

@stmcginnis stmcginnis mentioned this pull request Jun 14, 2025
4 tasks
@iholder101
Copy link
Contributor Author

iholder101 commented Jun 15, 2025

I feel like there may be some confirmation bias when asking somebody who wants swap.

I want to stress again that, AFAICT, the vast majority of app owners do not want to use swap, nor are they trying to avoid using it. Usually, they just don't care, and treat it as a node-level admin configuration that should be dealt with by someone else.

Just like with regular Linux processes, in rare cases, app owners know they want to opt-out, and processes can ask the kernel to opt-out of swap in such cases, but a process never opts-in or asks to use swap.

I also remind that the whole swap feature is opt-in, on a node-level basis (just like with any Linux host).

When people who doesn't want swap will now need to deal with labels, taints, etc. I think the opposite should be better when people who doesn't want swap should spend less effort than people who wants it

Are we talking short or long term? which is more important?

As I've emphasized here, the whole idea of this KEP was to enable basic swap while opening the door for further extensions, that would be made incrementally. The idea was to deliver minimal functionality now, gather real-world feedback, and then gradually evolve the feature, adding APIs, eviction behavior, scheduling mechanisms, and more.

However, the current suggestion seems to set aside that vision in favor of short-term concerns. Since swap-aware scheduling hasn't been designed yet, supporting it still requires manual intervention from administrators. Rather than investing time in a more thoughtful and extensible design, we risk rushing into a quick fix that may constrain us down the line.

In other words, the proposed API offers modest short-term gains at the expense of long-term flexibility. A better approach, in my view, would be to state clearly: "This feature is production-usable, but incomplete. Today, certain aspects - like scheduling properties - must be managed manually. Only enable it if you're comfortable taking on that responsibility". That framing leaves the door open for robust, future-proof solutions, rather than locking us into a slightly safer setup now that could hinder extensibility later.

Looking ahead, even once we support proper scheduling, eviction policies, and swap-limit controls, we'll remain bound to an API that requires admins to configure --fail-swap-on=false, set swapBehavior, and implement a webhook or MPA just to override the default swap behavior. Beyond being cumbersome, this locks us into a model where every pod owner can opt out of swap - ultimately limiting our design space for future improvements.


To summarize, I think it would be best to stick to the KEP's original plan, GA 2400 and discuss KEP-5359 in parallel.

Saying that, TBH, after so many releases I'm investing so much in this, and after so many times things were agreed, merged to the KEP, then were completely revisited - I'm worn down.

If people think that an opt-in approach is the right way to go - let's do it - as long as we GA swap in the shorter term possible.

@dchen1107 @haircommander @thockin @tallclair I'd appreciate your thoughts on this.

Signed-off-by: Itamar Holder <iholder@redhat.com>
@deads2k
Copy link
Contributor

deads2k commented Jun 18, 2025

I'm going to stay out of the sig conversation of "should this be replaced by something else".

The PRR itself was updated during the various betas and looks to still be in good shape. This approval is only for the PRR.

/approve

Copy link
Contributor

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
based on #5360 (comment)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, iholder101, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2025
@mrunalp
Copy link
Contributor

mrunalp commented Jun 19, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025
@mrunalp
Copy link
Contributor

mrunalp commented Jun 19, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit 042f517 into kubernetes:master Jun 19, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants