Description
Motivation
For example: if target_replica_concurrency
is 1 and current concurrency is 2, but 10 seconds ago it was 3 and 10 seconds before that it was 4, then it's not necessary to scale up
Initial design thoughts
The metric to consider could be a projection of when the concurrency will reach the target_replica_concurrency
. This could be computed, for example, buy looking at the total in flight over the past window
period (not the rolling averages), and drawing a best fit line. The configuration would be "don't scale up if it's under-provisioned, but projected to reach target_replica_concurrency
within X amount of time (default: 0s)". Also support the reverse: "don't scale down if it's over provisioned, but projected to reach target_replica_concurrency
within X amount of time (default: 0s)".
Question: How to handle changes in the replica count over that period? Perhaps only consider requested replicas (vs live replicas)?