fix(agent): scale down using agent shutdown hook #22

lucaspin · 2023-12-06T21:02:03Z

Motivation

This change, which is present on Kubernetes 1.26+, added validation to not allow multiple HPAs to point at the same target. That restriction causes our autoscaling to break.

Solution

Instead of using 2 HPAs (one for scaling up, and one for scaling down), we use only 1 HPA now - for scaling up the agent deployment. The scaling down logic will be handled by a shutdown hook, which executes every time an agent gets idle for a specified period of time. The idle timeout period can be configured through the agent.autoscaling.idleTimeoutForScaleDown configuration value; by default, that value is 30min.

Shutdown hook logic

Since we can't choose which pods to delete when scaling a deployment, when an agent gets idle and shuts down, we:

Annotate the pod for the agent shutting down with a pod deletion cost of -1
Decrease the number of replicas in the deployment by 1. This only happens if we are not already at the minimum replica count, specified by agent.autoscaling.min. If we are at the minimum number of replicas already, we delete the pod, without decreasing the replica count, to avoid potentially getting into a CrashLoopBackOff.

Since multiple shutdown hooks can be executing at the same time, we use optimistic locking on the deployment, using a semaphoreci.com/handle annotation.

lucaspin · 2023-12-08T13:47:03Z

Due to Kubernetes' eventual consistency model, and its lack of more granular control over deployment scaling operations, this solution is too messy. Too many race conditions all over the place. Even though it "works", it would be terrible to maintain and troubleshoot, so I'm closing this.

lucaspin added 11 commits December 5, 2023 16:44

fix: scale down using pod annotations

91e1821

fixes

9039c5b

fixes

9fd3849

lock deployment when scaling down

4bbf72a

do not scale down below min

d9bc46b

delete pod when current=min

b72cbd8

fixes

cfe0462

make idle timeout configurable

05d058a

update DOCS

4127669

use optimistic locking

06d402a

handle deployment deleted right when agent is idle and goes down

3725f71

lucaspin closed this Dec 8, 2023

lucaspin deleted the fix/scale-down branch January 25, 2024 10:54

lucaspin mentioned this pull request Jan 25, 2024

feat: add new chart for agent-k8s-controller #23

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): scale down using agent shutdown hook #22

fix(agent): scale down using agent shutdown hook #22

lucaspin commented Dec 6, 2023 •

edited

lucaspin commented Dec 8, 2023

fix(agent): scale down using agent shutdown hook #22

fix(agent): scale down using agent shutdown hook #22

Conversation

lucaspin commented Dec 6, 2023 • edited

Motivation

Solution

Shutdown hook logic

lucaspin commented Dec 8, 2023

lucaspin commented Dec 6, 2023 •

edited