-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Fix: cooldown reset on pod restart #8057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix: cooldown reset on pod restart #8057
Conversation
Hi @MenD32. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@@ -17,6 +17,7 @@ limitations under the License. | |||
package planner | |||
|
|||
import ( | |||
ctx "context" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do the "normal" go thing and not use an import alias for the context package
I have a PR here to address the current patterns in code (which you may have copied): #7664
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, didn't know that
@@ -35,6 +35,11 @@ import ( | |||
klog "k8s.io/klog/v2" | |||
) | |||
|
|||
const ( | |||
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded. | |||
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/since" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preference for NodeUnneededSinceAnnotation = "cluster-autoscaler.kubernetes.io/unneeded-since"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll change it
} | ||
nodeAnnotations[unneeded.NODE_COOLDOWN_SINCE_ANNOTATION] = p.latestUpdate.Format(time.RFC3339) | ||
updatedNode.SetAnnotations(nodeAnnotations) | ||
p.context.AutoscalingKubeClients.ClientSet.CoreV1().Nodes().Update(ctx.TODO(), updatedNode, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd delegate adding an annotation/condition to a separate util like we do for taints: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/utils/taints/taints.go.
Also I think it should either retry the Update in case of conflicts (like we do for taints in the file linked above), or use Patch instead to just merge the new annotation in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, planner is meant for planning, it shouldn't talk to outside world at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right. It shouldn't be in the planner.
I'm not sure where to put it, though, maybe a new method in the actuator?
It doesn't seem right to put it in any of the existing actuator methods...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this makes sense, but we'll need more reviews from the core maintainers.
also, is it possible to have a unit test for this behavior?
@@ -35,6 +35,11 @@ import ( | |||
klog "k8s.io/klog/v2" | |||
) | |||
|
|||
const ( | |||
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded. | |||
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/since" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has this annotation already been decided?
i'm just curious if we should mention "cooldown" anywhere in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's open to change, @jackfrancis suggested cluster-autoscaler.kubernetes.io/since
so that's probably what I'll go with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was probably a typo but just in case! My suggestion is cluster-autoscaler.kubernetes.io/unneeded-since
. It doesn't look like we refer to "cooldown" as a concept in the docs anywhere so annotating the node with that jargon might confuse users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunately another thing that ideally would be standardized between Node autoscalers and the rest of K8s. Then other components could see which Nodes are being considered for consolidation and react to that - we've definitely had requests for such a feature.
Also, doesn't it fit better as a condition instead of an annotation?
@jonathan-innis We have a proposal for a Node annotation/condition that would specify whether a Node is considered unneeded by CA and for how long. This can be used to persist the unneededness over container restarts. IIRC Karpenter has the same "unneeded Node" concept as part of consolidation. IMO we should standardize the condition under node-autoscaling.k8s.io
, so that other K8s components can react to Nodes being considered unneeded. WDYT?
Cluster Autoscaler doesn't know, when restarting, how long it was down nor what has happened over that time. Because of that, it currently errs on the side of restarting unneeded timers. For all we know, the nodes were fully utilized just a few seconds ago! This design decision extends beyond unneeded time - it is the reason why CA clears up all deletion candidate taints on startup. Of course, we can revisit this decision, but this PR is insufficient. First of all, we should ensure deletion candidate taints are in sync with the notion of nodes being unneeded. Secondly, we need a mitigation for the case when CA is down for a long time (think days) and then comes up and starts to happily delete nodes. Since we are storing Finally, a node condition would probably be a better idea than an arbitrary annotation. Another alternative would be to use the deletion candidate taint value. The main upside of that second approach is that we're already doing this: autoscaler/cluster-autoscaler/utils/taints/taints.go Lines 175 to 179 in 6771ca4
|
@x13n I got it. I find it very important to persist this data across restarts because doing so can cause a very troubling logical pattern. Some 'poison pill' enters the system that causes it to fail -> pod gets killed -> Restart causes the unneeded cooldown to restart -> nothing gets scaled down. I had this issue with a large scale-up of nodes. At some point, the memory limit was exceeded (it was pretty generous), and nodes stopped scaling down. I find this critical to fix, so please let me know how I can help, because it is vital for me to not have this running like that in my home lab... Do you think that we could revisit the conversation regarding this pattern in CA? I am more than happy to help in any way I can regarding this issue, of course. |
To speak to @x13n's larger point: the fundamental issue (of which your encounter is one particular downstream symptom @MenD32) is that CA state is kept in-memory. Offloading that onto the Kubernetes cluster (for example via resource-specific annotations) in a reliably idempotent way (to account for disruptions to the CA runtime and/or the resources themselves) would be needed. This is very hard to do in an acceptably durable, secure way using standard interfaces like node taints and annotations (or deployment/daemonset/pod) via standard Kubernetes RBAC. It would be possible, though. I'm curious is @jonathan-innis or @njtran or anyone else in the Karpenter project has explored offloading some part of the autoscaler runtime to the cluster to better accommodate disruptions while maintaining reliable idempotency? |
What I would propose to do here:
Thoughts? @jackfrancis @towca |
I'm currently pausing work on this PR until there is consensus on the solution's design, since there is the larger issue of how CA handles persistence and offloading to the k8s cluster |
I like this idea, sounds good! The exact same logic could be applied to the corresponding Node conditions once we add those (so maybe the flag shouldn't mention "taint", wdyt about |
Makes sense. |
|
imo, |
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
7b966e2
to
f31f6ac
Compare
@@ -463,7 +463,7 @@ func TestUpdateClusterState(t *testing.T) { | |||
wantUnremovable: []string{"n1", "n2", "n3", "n4"}, | |||
}, | |||
{ | |||
name: "Simulation timeout is hitted", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is just a type I noticed, I can revert if that's an issue
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
4ea8af4
to
9b45b33
Compare
9b45b33
to
3392ab5
Compare
3392ab5
to
f2bb059
Compare
f2bb059
to
f22dcae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very confused with all the node annotation logic. Why do you think it is needed?
|
||
apiv1 "k8s.io/api/core/v1" | ||
klog "k8s.io/klog/v2" | ||
) | ||
|
||
const ( | ||
// NODE_COOLDOWN_SINCE_ANNOTATION is the annotation used to store the time when a node was found to be unneeded. | ||
NODE_COOLDOWN_SINCE_ANNOTATION = "cluster-autoscaler.kubernetes.io/unneeded-since" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have this as an annotation, but for future reference: Please use CamelCase, we don't use ALL_CAPS for constants generally.
@@ -226,6 +226,7 @@ var ( | |||
enableDynamicResourceAllocation = flag.Bool("enable-dynamic-resource-allocation", false, "Whether logic for handling DRA (Dynamic Resource Allocation) objects is enabled.") | |||
clusterSnapshotParallelism = flag.Int("cluster-snapshot-parallelism", 16, "Maximum parallelism of cluster snapshot creation.") | |||
checkCapacityProcessorInstance = flag.String("check-capacity-processor-instance", "", "Name of the processor instance. Only ProvisioningRequests that define this name in their parameters with the key \"processorInstance\" will be processed by this CA instance. It only refers to check capacity ProvisioningRequests, but if not empty, best-effort atomic ProvisioningRequests processing is disabled in this instance. Not recommended: Until CA 1.35, ProvisioningRequests with this name as prefix in their class will be also processed.") | |||
maxDeletionCandidateStaleness = flag.Duration("max-deletion-candidate-staleness", time.Duration(0), "Maximum time a node can be marked as removable before the marking becomes stale. This sets the TTL of Cluster-Autoscaler's state if the Cluste-Autoscaler deployment becomes inactive") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: now it says max-deletion-candidate-ttl
not node-deletion-candidate-ttl
. The word max
is a bit redundant, as it is implied by ttl
.
@@ -338,6 +338,9 @@ type AutoscalingOptions struct { | |||
ProactiveScaleupEnabled bool | |||
// PodInjectionLimit limits total number of pods while injecting fake pods. | |||
PodInjectionLimit int | |||
// MaxDeletionCandidateStaleness is the maximum time a node can be marked as removable without being deleted. | |||
// This is used to prevent nodes from being stuck in the removable state during if the CA deployment becomes inactive. | |||
MaxDeletionCandidateStaleness time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, Max
-> Node
?
for _, taintKey := range taintKeys { | ||
taintsPresent = taintsPresent || HasTaint(node, taintKey) | ||
} | ||
taintsPresent = taintsPresent || HasTaint(node, taintKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point you don't need the initial assignment of false
anymore. This can be simplified to:
if !HasTaint(node, taintKey) {
continue
}
markedForDeletionTime, err := GetDeletionCandidateTime(node) | ||
if err != nil { | ||
klog.Warningf("Error while getting DeletionCandidate time for node %v: %v", node.Name, err) | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This taint should only ever be used by Cluster Autoscaler. Also, trying to infer the taint owner based on whether its value can be parsed as time is very brittle anyway.
return true | ||
} | ||
if time.Since(*markedForDeletionTime) > deletionCandidateTTL { | ||
timeSinceDeletion := time.Since(*markedForDeletionTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This can be moved outside of the if
block and used in the condition.
func MarkDeletionCandidate(node *apiv1.Node, client kube_client.Interface) (*apiv1.Node, error) { | ||
taint := apiv1.Taint{ | ||
// GetDeletionCandidateTaint returns a taint that marks the node as a DeletionCandidate for Cluster Autoscaler. | ||
func GetDeletionCandidateTaint() apiv1.Taint { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then the const could probably be renamed to DeletionCandidateTaintKey
?
} | ||
updated[name].since = since | ||
|
||
// Update node annotation with the timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we do that?
f22dcae
to
0af55ac
Compare
Hi, I accidentally messed up the branch for this by pushing the old implementation (with the unneeded-since annotation) instead of the new implementation that uses taints, which existed on a different branch locally, and when I pulled to respond to the PR, they mixed together. I'm sorry for the confusion, I'll fix the PR response rn |
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
bd6214a
to
203bff3
Compare
Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
203bff3
to
6590d55
Compare
…dateNodes Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
…dateNodes Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
…dateNodes Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
Fixes bug where upon deployment restart, cluster-autoscaler resets the the node scaledown timer, which can cause nodes to never scale in extreme cases.
Which issue(s) this PR fixes:
Fixes #7873
Special notes for your reviewer:
this issue has already affected me, and I'm highly motivated to fix this ASAP. so I should be available during the next few days, in order to resolve this quickly. if you have a better way of fixing this HA issue, please suggest one 😄
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: