fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp #2316

cosimomeli · 2025-06-17T14:26:25Z

Description
When a node receives the unreachable taint, the Kubernetes taint controller triggers the deletion of all pods after 5 minutes. When the Node Repair threshold is reached, Karpenter's drain procedure waits for all pods to be evicted or to be stuck on termination (when they have passed their deletionTimestamp), but if a Pod has a long termination grace period (RabbitMQ operator pods have 7 days, for example) the node will wait too long before being deleted.

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

How was this change tested?
I added a unit test for this and also tested the change with both an Unhealthy Node on AWS (dead kubelet) and a simple node deletion.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

linux-foundation-easycla · 2025-06-17T14:26:29Z

The committers listed above are authorized under a signed CLA.

✅ login: cosimomeli (c69ab6d, 425b1c6, f776c6b, 7a58d17, 1d2a836, ff3b45f)

k8s-ci-robot · 2025-06-17T14:26:34Z

Welcome @cosimomeli!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-06-17T14:26:35Z

Hi @cosimomeli. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-06-17T15:15:20Z

Pull Request Test Coverage Report for Build 16201702039

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

6 of 6 (100.0%) changed or added relevant lines in 2 files are covered.
171 unchanged lines in 13 files lost coverage.
Overall coverage decreased (-0.1%) to 81.916%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/consolidation.go	3	88.14%
pkg/controllers/disruption/drift.go	4	87.76%
pkg/controllers/disruption/singlenodeconsolidation.go	4	93.62%
pkg/controllers/disruption/emptiness.go	5	87.3%
pkg/controllers/state/statenode.go	5	87.05%
pkg/controllers/controllers.go	9	0.0%
pkg/controllers/disruption/multinodeconsolidation.go	10	86.76%
pkg/test/ratelimitinginterface.go	10	0.0%
pkg/controllers/disruption/helpers.go	11	87.43%
pkg/controllers/disruption/validation.go	15	81.92%

Totals
Change from base Build 15692799185:	-0.1%
Covered Lines:	10219
Relevant Lines:	12475

💛 - Coveralls

jonathan-innis · 2025-06-17T20:35:03Z

/assign @engedaam

Amanuel implemented Node Autorepair so assigning him since he's the relevant owner

chicco785 · 2025-07-08T12:47:52Z

hey @engedaam any estimated time for the review? thx!

engedaam · 2025-07-09T17:23:46Z

When the Node Repair threshold is reached, Karpenter's drain procedure waits for all pods to be evicted or to be stuck on termination (when they have passed their deletionTimestamp), but if a Pod has a long termination grace period (RabbitMQ operator pods have 7 days, for example) the node will wait too long before being deleted.

Currently, Karpenter does not immediately drain pods when initiating a Node Repair action. Instead, it relies on a tolerationDuration configured by the cloud provider. For example, in the AWS Provider, unreachable nodes are given a 30-minute toleration duration before Karpenter begins the process of deleting the node. During this termination period, Karpenter waits for pods to be terminated, which is handled by the drain logic implemented in the terminator.go file (specifically at this line: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/terminator/terminator.go#L140). The behavior you're describing in this PR aligns with our current expectations. To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

Can you help me understand why this would help here? We only really look at the deletionTimestamp for filtering pods, not for case when to force delete

chicco785 · 2025-07-09T19:17:56Z

Currently, Karpenter does not immediately drain pods when initiating a Node Repair action. Instead, it relies on a tolerationDuration configured by the cloud provider. For example, in the AWS Provider, unreachable nodes are given a 30-minute toleration duration before Karpenter begins the process of deleting the node. During this termination period, Karpenter waits for pods to be terminated, which is handled by the drain logic implemented in the terminator.go file (specifically at this line: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/node/termination/terminator/terminator.go#L140). The behavior you're describing in this PR aligns with our current expectations. To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

As far as I understood, in case a Pod has a long termination grace period the node will not be removed at the end of the node toleration duration, but it will wait for the pod termination grace period. For example, in RabbitMQ operator pods have 7 days termination period, so the node won't be terminated before 7 days.

@cosimomeli can explain better.

cosimomeli · 2025-07-10T12:26:16Z

To better understand any potential issues, could you provide a specific example where you've observed Karpenter taking longer than the configured toleration duration to terminate an unhealthy node?

Hello @engedaam, thanks for the answer.
When a node becomes unreachable, Karpenter triggers Node Repair after 30 minutes (on AWS). Meanwhile, the Kubernetes taint controller starts evicting pods after 5 minutes when a node receives the node.kubernetes.io/unreachable taint.

Karpenter's terminator logic immediately drains every pod in the Node, as the node.health controller sets the node termination timestamp to the current timestamp, this is actually a forced shutdown, and the termination goes as expected, but there is one exception:
The podsToDelete here filters out every pod that is already terminating, and has not passed its graceful termination period. This means that if I have a pod with a very long termination period (my example was RabbitMQ with 7 days), the termination controller will not touch it, and Karpenter will wait for its natural termination before deleting the node.

To improve the forced termination, I added the terminating pods with a deletionTimestamp after the nodeTerminationTimestamp to be deleted again, so their deletionTimestamp can be aligned with the nodeTerminationTimestamp.

Can you help me understand why this would help here? We only really look at the deletionTimestamp for filtering pods, not for case when to force delete

My change has the effect of adding inside podsToDelete the pods with a graceful shutdown period that ends after the deadline of the node termination, this way they get deleted again with a different graceful period, to be compatible with the node termination deadline.

cosimomeli · 2025-07-10T12:35:21Z

Considering that after 30 minutes, excluding the ones with an explicit toleration of node.kubernetes.io/unreachable, all the pods are already deleted by the toleration controller as documented here, it's not uncommon that a terminating pod needs another delete to change its default deletion time to speed up the draining process.

engedaam · 2025-07-10T16:10:39Z

Considering that after 30 minutes, excluding the ones with an explicit toleration of node.kubernetes.io/unreachable, all the pods are already deleted by the toleration controller as documented here, it's not uncommon that a terminating pod needs another delete to change its default deletion time to speed up the draining process.

This clears things up and thanks for the thorough explanation!

engedaam · 2025-07-10T16:10:56Z

/ok-to-test

engedaam

Just one small nit

pkg/utils/pod/scheduling.go

pkg/controllers/node/termination/terminator/terminator.go

Co-authored-by: Amanuel Engeda <74629455+engedaam@users.noreply.github.com>

engedaam · 2025-07-10T17:23:38Z

/lgtm

jmdeal · 2025-07-11T17:18:21Z

/approve

k8s-ci-robot · 2025-07-11T17:18:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cosimomeli, jmdeal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmdeal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…tionTimestamp (kubernetes-sigs#2316) Co-authored-by: Amanuel Engeda <74629455+engedaam@users.noreply.github.com>

cosimomeli added 3 commits June 17, 2025 10:14

Delete terminating pods to allow force deletion when needed

425b1c6

Change pod deletion logic

c69ab6d

Add test

1d2a836

k8s-ci-robot requested review from engedaam and tallaxes June 17, 2025 14:26

k8s-ci-robot added the cncf-cla: no label Jun 17, 2025

k8s-ci-robot added the needs-ok-to-test label Jun 17, 2025

k8s-ci-robot added the size/M label Jun 17, 2025

cosimomeli and others added 2 commits June 17, 2025 16:36

Merge branch 'main' into fix-force-delete

7a58d17

lint

f776c6b

k8s-ci-robot added cncf-cla: yes and removed cncf-cla: no labels Jun 17, 2025

k8s-ci-robot assigned engedaam Jun 17, 2025

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Jul 10, 2025

engedaam reviewed Jul 10, 2025

View reviewed changes

pkg/utils/pod/scheduling.go Outdated Show resolved Hide resolved

cosimomeli commented Jul 10, 2025

View reviewed changes

pkg/controllers/node/termination/terminator/terminator.go Outdated Show resolved Hide resolved

Apply suggestions from code review

ff3b45f

Co-authored-by: Amanuel Engeda <74629455+engedaam@users.noreply.github.com>

k8s-ci-robot added the lgtm label Jul 10, 2025

k8s-ci-robot added the approved label Jul 11, 2025

k8s-ci-robot merged commit 58bf160 into kubernetes-sigs:main Jul 11, 2025
16 checks passed

fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp #2316

fix: update deletionTimestamp on terminating pods when after nodeDeletionTimestamp #2316

Conversation

cosimomeli commented Jun 17, 2025

Uh oh!

linux-foundation-easycla bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 17, 2025

Uh oh!

k8s-ci-robot commented Jun 17, 2025

Uh oh!

coveralls commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16201702039

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

jonathan-innis commented Jun 17, 2025

Uh oh!

chicco785 commented Jul 8, 2025

Uh oh!

engedaam commented Jul 9, 2025

Uh oh!

chicco785 commented Jul 9, 2025

Uh oh!

cosimomeli commented Jul 10, 2025

Uh oh!

cosimomeli commented Jul 10, 2025

Uh oh!

engedaam commented Jul 10, 2025

Uh oh!

engedaam commented Jul 10, 2025

Uh oh!

engedaam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

engedaam commented Jul 10, 2025

Uh oh!

jmdeal commented Jul 11, 2025

Uh oh!

k8s-ci-robot commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Jun 17, 2025 •

edited

Loading

coveralls commented Jun 17, 2025 •

edited

Loading