You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 11, 2021. It is now read-only.
I once thought that the current setup is the best we can get from SQS, but that's true only considering the built in metrics for SQS. It solved #36, but restricted the auto scaling feature too much.
The current setup is not adapted to load and works only for Airflow clusters that do burst tasks, that is, spend the most part of the day idle. I believe that we can get a much better auto scaling behavior using custom metrics. If we glue the auto scaling functionality to the SQS metrics cloud watch alarms we'll be making lots of opinionated assumptions.
If we define a custom metric, it's reasonably to override it without altering too much the stack. Only the default is going to be opinionated, but easily adjusted to each own situation.
estimate a load average of the cluster based on an estimate of how many workers are idle
scale up or down based on minimum and maximum load averages
Specification
Celery has an autoscale feature that can be tweaked to control scaling inside each machine, and this should be the best driver to handle high variations of processing between tasks. We are primarily focused with scaling in response to big queues, not high CPU usage, as this is the most common scenario with Airflow during backfills or a sudden schedule of many concurrent tasks.
Our goal is simply to increase the amount of available workers, trusting that machines are correctly sized to keep them busy, but there's no simple way to assess how many are busy or idle. This proposal tries to estimate that proportion using only SQS and ASG metrics.
Because task durations can vary wildly, it's hard to count how many workers are supposed to be busy based on messages received. On the other hand, idle workers will repeatedly poll the queue for tasks, so if we can measure the amount of polling we can guess how many workers are idle.
(for a given interval of time t)
NOER := NumberOfEmptyReceives
W_i := (average) number of idle workers
fpoll:= (average) idle worker's frequency of queue polling
[Eq. 1]
NOER ~ W_i * fpoll * t
We may then define the cluster load average as the ratio of busyness for the workers, making the gross approximation that workers are either busy or idle (negligible setup time)
l := cluster load average
W := (average) number of workers
W_b := (average) number of busy workers
[Eq. 2]
W = W_b + W_i
[Eq. 3]
l = W_b/W
We can plug equations 1 and 2 into 3 and get:
[Eq. 4]
l = 1 - W_i/W ~ 1 - NOER / (W * fpoll * t)
This approximation has a few instabilities, like varying fpoll and the ramping up associated with new workers, but sounds promising. We also have to take care of the special cases, namely when there are no workers yet.
That case is handled with a simple test: if the queue is empty, the load of an empty cluster is zero. If there's something on the queue, the load of an empty cluster is 100%. This edge case handling suffices to guarantee that we'll scale up from no workers to at least one worker if there's something to be executed and stay put when there's nothing.
EDIT1: worker concurrency doesn't affect message polling
EDIT2: fpoll has been experimentally defined as 0.098444...
I once thought that the current setup is the best we can get from SQS, but that's true only considering the built in metrics for SQS. It solved #36, but restricted the auto scaling feature too much.
The current setup is not adapted to load and works only for Airflow clusters that do burst tasks, that is, spend the most part of the day idle. I believe that we can get a much better auto scaling behavior using custom metrics. If we glue the auto scaling functionality to the SQS metrics cloud watch alarms we'll be making lots of opinionated assumptions.
If we define a custom metric, it's reasonably to override it without altering too much the stack. Only the default is going to be opinionated, but easily adjusted to each own situation.
Reference: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
The text was updated successfully, but these errors were encountered: