Define custom metrics for auto scaling alarm triggers #63

villasv · 2018-11-23T19:10:24Z

I once thought that the current setup is the best we can get from SQS, but that's true only considering the built in metrics for SQS. It solved #36, but restricted the auto scaling feature too much.

The current setup is not adapted to load and works only for Airflow clusters that do burst tasks, that is, spend the most part of the day idle. I believe that we can get a much better auto scaling behavior using custom metrics. If we glue the auto scaling functionality to the SQS metrics cloud watch alarms we'll be making lots of opinionated assumptions.

If we define a custom metric, it's reasonably to override it without altering too much the stack. Only the default is going to be opinionated, but easily adjusted to each own situation.

Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html

villasv · 2018-11-27T19:47:44Z

Strategy Proposal

estimate a load average of the cluster based on an estimate of how many workers are idle
scale up or down based on minimum and maximum load averages

Specification

Celery has an autoscale feature that can be tweaked to control scaling inside each machine, and this should be the best driver to handle high variations of processing between tasks. We are primarily focused with scaling in response to big queues, not high CPU usage, as this is the most common scenario with Airflow during backfills or a sudden schedule of many concurrent tasks.

Our goal is simply to increase the amount of available workers, trusting that machines are correctly sized to keep them busy, but there's no simple way to assess how many are busy or idle. This proposal tries to estimate that proportion using only SQS and ASG metrics.

Because task durations can vary wildly, it's hard to count how many workers are supposed to be busy based on messages received. On the other hand, idle workers will repeatedly poll the queue for tasks, so if we can measure the amount of polling we can guess how many workers are idle.

(for a given interval of time t)
NOER := NumberOfEmptyReceives
W_i  := (average) number of idle workers
fpoll:= (average) idle worker's frequency of queue polling

[Eq. 1]
NOER ~ W_i * fpoll * t

We may then define the cluster load average as the ratio of busyness for the workers, making the gross approximation that workers are either busy or idle (negligible setup time)

l   := cluster load average
W   := (average) number of workers
W_b := (average) number of busy workers

[Eq. 2]
W = W_b + W_i
[Eq. 3]
l = W_b/W

We can plug equations 1 and 2 into 3 and get:

[Eq. 4]
l = 1 - W_i/W ~ 1 - NOER / (W * fpoll * t)

This approximation has a few instabilities, like varying fpoll and the ramping up associated with new workers, but sounds promising. We also have to take care of the special cases, namely when there are no workers yet.

That case is handled with a simple test: if the queue is empty, the load of an empty cluster is zero. If there's something on the queue, the load of an empty cluster is 100%. This edge case handling suffices to guarantee that we'll scale up from no workers to at least one worker if there's something to be executed and stay put when there's nothing.

EDIT1: worker concurrency doesn't affect message polling
EDIT2: fpoll has been experimentally defined as 0.098444...

villasv added enhancement feature labels Nov 23, 2018

villasv mentioned this issue Nov 23, 2018

Readme should comment how auto scaling works #60

Closed

villasv mentioned this issue Nov 27, 2018

Scheduled lambda metric that estimates worker load #64

Merged

villasv added this to the Blazing Beat milestone Nov 27, 2018

villasv closed this as completed Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define custom metrics for auto scaling alarm triggers #63

Define custom metrics for auto scaling alarm triggers #63

villasv commented Nov 23, 2018 •

edited

Loading

villasv commented Nov 27, 2018 •

edited

Loading

Define custom metrics for auto scaling alarm triggers #63

Define custom metrics for auto scaling alarm triggers #63

Comments

villasv commented Nov 23, 2018 • edited Loading

villasv commented Nov 27, 2018 • edited Loading

Strategy Proposal

Specification

villasv commented Nov 23, 2018 •

edited

Loading

villasv commented Nov 27, 2018 •

edited

Loading