Skip to content
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.

Define custom metrics for auto scaling alarm triggers #63

Closed
villasv opened this issue Nov 23, 2018 · 1 comment
Closed

Define custom metrics for auto scaling alarm triggers #63

villasv opened this issue Nov 23, 2018 · 1 comment

Comments

@villasv
Copy link
Owner

villasv commented Nov 23, 2018

I once thought that the current setup is the best we can get from SQS, but that's true only considering the built in metrics for SQS. It solved #36, but restricted the auto scaling feature too much.

The current setup is not adapted to load and works only for Airflow clusters that do burst tasks, that is, spend the most part of the day idle. I believe that we can get a much better auto scaling behavior using custom metrics. If we glue the auto scaling functionality to the SQS metrics cloud watch alarms we'll be making lots of opinionated assumptions.

If we define a custom metric, it's reasonably to override it without altering too much the stack. Only the default is going to be opinionated, but easily adjusted to each own situation.

Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html

@villasv
Copy link
Owner Author

villasv commented Nov 27, 2018

Strategy Proposal

  • estimate a load average of the cluster based on an estimate of how many workers are idle
  • scale up or down based on minimum and maximum load averages

Specification

Celery has an autoscale feature that can be tweaked to control scaling inside each machine, and this should be the best driver to handle high variations of processing between tasks. We are primarily focused with scaling in response to big queues, not high CPU usage, as this is the most common scenario with Airflow during backfills or a sudden schedule of many concurrent tasks.

Our goal is simply to increase the amount of available workers, trusting that machines are correctly sized to keep them busy, but there's no simple way to assess how many are busy or idle. This proposal tries to estimate that proportion using only SQS and ASG metrics.

Because task durations can vary wildly, it's hard to count how many workers are supposed to be busy based on messages received. On the other hand, idle workers will repeatedly poll the queue for tasks, so if we can measure the amount of polling we can guess how many workers are idle.

(for a given interval of time t)
NOER := NumberOfEmptyReceives
W_i  := (average) number of idle workers
fpoll:= (average) idle worker's frequency of queue polling

[Eq. 1]
NOER ~ W_i * fpoll * t

We may then define the cluster load average as the ratio of busyness for the workers, making the gross approximation that workers are either busy or idle (negligible setup time)

l   := cluster load average
W   := (average) number of workers
W_b := (average) number of busy workers

[Eq. 2]
W = W_b + W_i
[Eq. 3]
l = W_b/W

We can plug equations 1 and 2 into 3 and get:

[Eq. 4]
l = 1 - W_i/W ~ 1 - NOER / (W * fpoll * t)

This approximation has a few instabilities, like varying fpoll and the ramping up associated with new workers, but sounds promising. We also have to take care of the special cases, namely when there are no workers yet.

That case is handled with a simple test: if the queue is empty, the load of an empty cluster is zero. If there's something on the queue, the load of an empty cluster is 100%. This edge case handling suffices to guarantee that we'll scale up from no workers to at least one worker if there's something to be executed and stay put when there's nothing.

EDIT1: worker concurrency doesn't affect message polling
EDIT2: fpoll has been experimentally defined as 0.098444...

@villasv villasv closed this as completed Nov 30, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant