Skip to content

Fix: cooldown reset on pod restart #8057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

MenD32
Copy link
Contributor

@MenD32 MenD32 commented Apr 24, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes bug where upon deployment restart, cluster-autoscaler resets the the node scaledown timer, which can cause nodes to never scale in extreme cases.

Which issue(s) this PR fixes:

Fixes #7873

Special notes for your reviewer:

this issue has already affected me, and I'm highly motivated to fix this ASAP. so I should be available during the next few days, in order to resolve this quickly. if you have a better way of fixing this HA issue, please suggest one 😄

Does this PR introduce a user-facing change?

fixed: Cluster Autoscaler now persists scale-down state across pod restarts with the new `--max-deletion-candidate-ttl` flag. 
Previously, when a CA pod restarted, all node scale-down timers were reset, which could prevent nodes from scaling down in certain scenarios (such as after OOMKills). 
Now, deletion candidate taints with timestamps are honored across restarts if they're within the configurable TTL period. 
The flag defaults to 0 (maintaining previous behavior), but can be set to a duration (e.g., "15m") to retain scale-down state information for that period after pod restarts.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 24, 2025
@k8s-ci-robot k8s-ci-robot requested a review from vadasambar April 24, 2025 02:05
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 24, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @MenD32. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 24, 2025
@MenD32 MenD32 changed the title Fix: cooldown reset on pod redeploy Fix: cooldown reset on pod restart Apr 24, 2025
@@ -17,6 +17,7 @@ limitations under the License.
package planner

import (
ctx "context"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do the "normal" go thing and not use an import alias for the context package

I have a PR here to address the current patterns in code (which you may have copied): #7664

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, didn't know that

@@ -35,6 +35,11 @@ import (
klog "k8s.io/klog/v2"
)

const (
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded.
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/since"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preference for NodeUnneededSinceAnnotation = "cluster-autoscaler.kubernetes.io/unneeded-since"

Copy link
Contributor Author

@MenD32 MenD32 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll change it

}
nodeAnnotations[unneeded.NODE_COOLDOWN_SINCE_ANNOTATION] = p.latestUpdate.Format(time.RFC3339)
updatedNode.SetAnnotations(nodeAnnotations)
p.context.AutoscalingKubeClients.ClientSet.CoreV1().Nodes().Update(ctx.TODO(), updatedNode, metav1.UpdateOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@towca @x13n can you comment if this is the right place to initiate an update to the k8s API of the running cluster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd delegate adding an annotation/condition to a separate util like we do for taints: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/taints/taints.go.

Also I think it should either retry the Update in case of conflicts (like we do for taints in the file linked above), or use Patch instead to just merge the new annotation in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, planner is meant for planning, it shouldn't talk to outside world at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right. It shouldn't be in the planner.

I'm not sure where to put it, though, maybe a new method in the actuator?
It doesn't seem right to put it in any of the existing actuator methods...

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this makes sense, but we'll need more reviews from the core maintainers.

also, is it possible to have a unit test for this behavior?

cc @BigDarkClown @towca

@@ -35,6 +35,11 @@ import (
klog "k8s.io/klog/v2"
)

const (
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded.
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/since"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this annotation already been decided?

i'm just curious if we should mention "cooldown" anywhere in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's open to change, @jackfrancis suggested cluster-autoscaler.kubernetes.io/since so that's probably what I'll go with

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was probably a typo but just in case! My suggestion is cluster-autoscaler.kubernetes.io/unneeded-since. It doesn't look like we refer to "cooldown" as a concept in the docs anywhere so annotating the node with that jargon might confuse users?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unfortunately another thing that ideally would be standardized between Node autoscalers and the rest of K8s. Then other components could see which Nodes are being considered for consolidation and react to that - we've definitely had requests for such a feature.

Also, doesn't it fit better as a condition instead of an annotation?

@jonathan-innis We have a proposal for a Node annotation/condition that would specify whether a Node is considered unneeded by CA and for how long. This can be used to persist the unneededness over container restarts. IIRC Karpenter has the same "unneeded Node" concept as part of consolidation. IMO we should standardize the condition under node-autoscaling.k8s.io, so that other K8s components can react to Nodes being considered unneeded. WDYT?

@x13n
Copy link
Member

x13n commented Apr 25, 2025

Cluster Autoscaler doesn't know, when restarting, how long it was down nor what has happened over that time. Because of that, it currently errs on the side of restarting unneeded timers. For all we know, the nodes were fully utilized just a few seconds ago! This design decision extends beyond unneeded time - it is the reason why CA clears up all deletion candidate taints on startup.

Of course, we can revisit this decision, but this PR is insufficient. First of all, we should ensure deletion candidate taints are in sync with the notion of nodes being unneeded. Secondly, we need a mitigation for the case when CA is down for a long time (think days) and then comes up and starts to happily delete nodes. Since we are storing unneeded-since on the node object, there should be some "safety period" during which it is safe to assume the node stayed unready. Seconds is fine, days isn't, so there's a fine balance somewhere in between.

Finally, a node condition would probably be a better idea than an arbitrary annotation. Another alternative would be to use the deletion candidate taint value. The main upside of that second approach is that we're already doing this:

taint := apiv1.Taint{
Key: DeletionCandidateTaint,
Value: fmt.Sprint(time.Now().Unix()),
Effect: apiv1.TaintEffectPreferNoSchedule,
}

@MenD32
Copy link
Contributor Author

MenD32 commented Apr 25, 2025

@x13n I got it. I find it very important to persist this data across restarts because doing so can cause a very troubling logical pattern.

Some 'poison pill' enters the system that causes it to fail -> pod gets killed -> Restart causes the unneeded cooldown to restart -> nothing gets scaled down.

I had this issue with a large scale-up of nodes. At some point, the memory limit was exceeded (it was pretty generous), and nodes stopped scaling down.

I find this critical to fix, so please let me know how I can help, because it is vital for me to not have this running like that in my home lab...

Do you think that we could revisit the conversation regarding this pattern in CA?

I am more than happy to help in any way I can regarding this issue, of course.

@jackfrancis
Copy link
Contributor

To speak to @x13n's larger point: the fundamental issue (of which your encounter is one particular downstream symptom @MenD32) is that CA state is kept in-memory. Offloading that onto the Kubernetes cluster (for example via resource-specific annotations) in a reliably idempotent way (to account for disruptions to the CA runtime and/or the resources themselves) would be needed. This is very hard to do in an acceptably durable, secure way using standard interfaces like node taints and annotations (or deployment/daemonset/pod) via standard Kubernetes RBAC.

It would be possible, though.

I'm curious is @jonathan-innis or @njtran or anyone else in the Karpenter project has explored offloading some part of the autoscaler runtime to the cluster to better accommodate disruptions while maintaining reliable idempotency?

@x13n
Copy link
Member

x13n commented Apr 29, 2025

What I would propose to do here:

  • Introduce a new flag --max-taint-age. (Better naming suggestions are welcome 🙂) Defaulting it to 0 would preserve existing behavior, while allowing non-0 if desired.
  • When cleaning up taints on startup here:
    // Make sure we are only cleaning taints from selected node groups.
    selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...)
    taints.CleanAllToBeDeleted(selectedNodes,
    a.AutoscalingContext.ClientSet, a.Recorder, a.CordonNodeBeforeTerminate)
    if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount == 0 {
    // Clean old taints if soft taints handling is disabled
    taints.CleanAllDeletionCandidates(allNodes,
    a.AutoscalingContext.ClientSet, a.Recorder)
    Only delete ones that didn't reach max taint age yet. (Probably rename CleanAllDeletionCandidates to something along the lines of CleanStaleDeletionCandidates?)
  • Add a new method to scaledown.Planner initialization, similar to existing Update method (probably most code could be shared there) that will set timestamps for unneeded nodes inferred from taints found on existing nodes.

Thoughts? @jackfrancis @towca

@MenD32
Copy link
Contributor Author

MenD32 commented May 2, 2025

I'm currently pausing work on this PR until there is consensus on the solution's design, since there is the larger issue of how CA handles persistence and offloading to the k8s cluster

@towca
Copy link
Collaborator

towca commented May 12, 2025

What I would propose to do here:

  • Introduce a new flag --max-taint-age. (Better naming suggestions are welcome 🙂) Defaulting it to 0 would preserve existing behavior, while allowing non-0 if desired.
  • When cleaning up taints on startup here:
    // Make sure we are only cleaning taints from selected node groups.
    selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...)
    taints.CleanAllToBeDeleted(selectedNodes,
    a.AutoscalingContext.ClientSet, a.Recorder, a.CordonNodeBeforeTerminate)
    if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount == 0 {
    // Clean old taints if soft taints handling is disabled
    taints.CleanAllDeletionCandidates(allNodes,
    a.AutoscalingContext.ClientSet, a.Recorder)

    Only delete ones that didn't reach max taint age yet. (Probably rename CleanAllDeletionCandidates to something along the lines of CleanStaleDeletionCandidates?)
  • Add a new method to scaledown.Planner initialization, similar to existing Update method (probably most code could be shared there) that will set timestamps for unneeded nodes inferred from taints found on existing nodes.

Thoughts? @jackfrancis @towca

I like this idea, sounds good! The exact same logic could be applied to the corresponding Node conditions once we add those (so maybe the flag shouldn't mention "taint", wdyt about --max-deletion-candidate-staleness)?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 14, 2025
@x13n
Copy link
Member

x13n commented May 15, 2025

Makes sense. --max-deletion-candidate-staleness is a bit long but I guess more future proof. Would love to have something shorter, but honestly I don't have a better idea.

@jackfrancis
Copy link
Contributor

What I would propose to do here:

  • Introduce a new flag --max-taint-age. (Better naming suggestions are welcome 🙂) Defaulting it to 0 would preserve existing behavior, while allowing non-0 if desired.

  • When cleaning up taints on startup here:

    // Make sure we are only cleaning taints from selected node groups.
    selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...)
    taints.CleanAllToBeDeleted(selectedNodes,
    a.AutoscalingContext.ClientSet, a.Recorder, a.CordonNodeBeforeTerminate)
    if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount == 0 {
    // Clean old taints if soft taints handling is disabled
    taints.CleanAllDeletionCandidates(allNodes,
    a.AutoscalingContext.ClientSet, a.Recorder)

    Only delete ones that didn't reach max taint age yet. (Probably rename CleanAllDeletionCandidates to something along the lines of CleanStaleDeletionCandidates?)

  • Add a new method to scaledown.Planner initialization, similar to existing Update method (probably most code could be shared there) that will set timestamps for unneeded nodes inferred from taints found on existing nodes.

Thoughts? @jackfrancis @towca

I like this idea, sounds good! The exact same logic could be applied to the corresponding Node conditions once we add those (so maybe the flag shouldn't mention "taint", wdyt about --max-deletion-candidate-staleness)?

--node-deletion-candidate-ttl ?

@elmiko
Copy link
Contributor

elmiko commented May 16, 2025

imo, ttl is easier to understand that staleness.

Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from 7b966e2 to f31f6ac Compare May 17, 2025 17:10
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 17, 2025
@@ -463,7 +463,7 @@ func TestUpdateClusterState(t *testing.T) {
wantUnremovable: []string{"n1", "n2", "n3", "n4"},
},
{
name: "Simulation timeout is hitted",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is just a type I noticed, I can revert if that's an issue

Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from 4ea8af4 to 9b45b33 Compare June 5, 2025 06:53
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 5, 2025
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from 9b45b33 to 3392ab5 Compare June 5, 2025 06:56
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2025
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from 3392ab5 to f2bb059 Compare June 5, 2025 06:59
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 5, 2025
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from f2bb059 to f22dcae Compare June 5, 2025 07:03
Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very confused with all the node annotation logic. Why do you think it is needed?


apiv1 "k8s.io/api/core/v1"
klog "k8s.io/klog/v2"
)

const (
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded.
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/unneeded-since"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have this as an annotation, but for future reference: Please use CamelCase, we don't use ALL_CAPS for constants generally.

@@ -226,6 +226,7 @@ var (
enableDynamicResourceAllocation = flag.Bool("enable-dynamic-resource-allocation", false, "Whether logic for handling DRA (Dynamic Resource Allocation) objects is enabled.")
clusterSnapshotParallelism = flag.Int("cluster-snapshot-parallelism", 16, "Maximum parallelism of cluster snapshot creation.")
checkCapacityProcessorInstance = flag.String("check-capacity-processor-instance", "", "Name of the processor instance. Only ProvisioningRequests that define this name in their parameters with the key \"processorInstance\" will be processed by this CA instance. It only refers to check capacity ProvisioningRequests, but if not empty, best-effort atomic ProvisioningRequests processing is disabled in this instance. Not recommended: Until CA 1.35, ProvisioningRequests with this name as prefix in their class will be also processed.")
maxDeletionCandidateStaleness = flag.Duration("max-deletion-candidate-staleness", time.Duration(0), "Maximum time a node can be marked as removable before the marking becomes stale. This sets the TTL of Cluster-Autoscaler's state if the Cluste-Autoscaler deployment becomes inactive")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: now it says max-deletion-candidate-ttl not node-deletion-candidate-ttl. The word max is a bit redundant, as it is implied by ttl.

@@ -338,6 +338,9 @@ type AutoscalingOptions struct {
ProactiveScaleupEnabled bool
// PodInjectionLimit limits total number of pods while injecting fake pods.
PodInjectionLimit int
// MaxDeletionCandidateStaleness is the maximum time a node can be marked as removable without being deleted.
// This is used to prevent nodes from being stuck in the removable state during if the CA deployment becomes inactive.
MaxDeletionCandidateStaleness time.Duration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, Max -> Node?

for _, taintKey := range taintKeys {
taintsPresent = taintsPresent || HasTaint(node, taintKey)
}
taintsPresent = taintsPresent || HasTaint(node, taintKey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you don't need the initial assignment of false anymore. This can be simplified to:

if !HasTaint(node, taintKey) {
  continue
}

markedForDeletionTime, err := GetDeletionCandidateTime(node)
if err != nil {
klog.Warningf("Error while getting DeletionCandidate time for node %v: %v", node.Name, err)
return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This taint should only ever be used by Cluster Autoscaler. Also, trying to infer the taint owner based on whether its value can be parsed as time is very brittle anyway.

return true
}
if time.Since(*markedForDeletionTime) > deletionCandidateTTL {
timeSinceDeletion := time.Since(*markedForDeletionTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can be moved outside of the if block and used in the condition.

func MarkDeletionCandidate(node *apiv1.Node, client kube_client.Interface) (*apiv1.Node, error) {
taint := apiv1.Taint{
// GetDeletionCandidateTaint returns a taint that marks the node as a DeletionCandidate for Cluster Autoscaler.
func GetDeletionCandidateTaint() apiv1.Taint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but then the const could probably be renamed to DeletionCandidateTaintKey?

}
updated[name].since = since

// Update node annotation with the timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we do that?

@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from f22dcae to 0af55ac Compare June 5, 2025 16:43
@MenD32
Copy link
Contributor Author

MenD32 commented Jun 5, 2025

Hi, I accidentally messed up the branch for this by pushing the old implementation (with the unneeded-since annotation) instead of the new implementation that uses taints, which existed on a different branch locally, and when I pulled to respond to the PR, they mixed together.

I'm sorry for the confusion, I'll fix the PR response rn

MenD32 added 3 commits June 5, 2025 23:48
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch 5 times, most recently from bd6214a to 203bff3 Compare June 6, 2025 09:32
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
@MenD32 MenD32 force-pushed the fix/cooldown-reset-on-pod-redeploy branch from 203bff3 to 6590d55 Compare June 6, 2025 09:37
@MenD32 MenD32 requested a review from x13n June 11, 2025 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster Autoscaler gets OOMKilled instead of scaling down
6 participants