Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

Merged
merged 15 commits into from
Apr 30, 2020
161 changes: 161 additions & 0 deletions rfcs/2020-04-06-1858-automatically-adjust-limits.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# RFC 1858 - 2020-04-06 - Automatically Adjust Limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# RFC 1858 - 2020-04-06 - Automatically Adjust Limits
# RFC 1858 - 2020-04-06 - Automatically Adjust HTTP Request Limits

I think we should make it clear that this RFC focuses on HTTP based sinks only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, and it's worth clarifying that this doesn't apply universally, but the HTTP part seems accidental. It is documented to apply at the level of the tower crate service builder, so any service that uses that framework would have this capability. Now, it happens that all of our HTTP services use the tower framework, and vice versa, but it isn't necessarily so.


This RFC proposes a new scheme for rate limiting requests to external
services in order to maximize the sustained transmission rate over
varying conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This RFC proposes a new scheme for rate limiting requests to external
services in order to maximize the sustained transmission rate over
varying conditions.
This RFC proposes a new scheme for rate-limiting HTTP requests to
external services in order to maximize the sustained transmission
rate over varying conditions.


## Motivation

Vector users commonly run into the problem of internal service rate
limiting. This is not an external service limiting receiving data from
us but our own rate limiting based on the `request`
parameters. Alternatively, users can run into the opposite problem of
overwhelming a downstream service to the point where it becomes
unresponsive and starts queueing requests. Instead, Vector should be
automatically rate limiting its requests to maximally fill the service's
capacity without overwhelming it and causing additional problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
automatically rate limiting its requests to maximally fill the service's
capacity without overwhelming it and causing additional problems.
automatically rate-limiting its HTTP requests to maximally fill the
service's capacity without overwhelming it and causing additional
problems.


Most sinks in Vector have their request structure managed by the [tower
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most sinks in Vector have their request structure managed by the [tower
Most sinks in Vector have their HTTP request structure managed by the [tower

crate](https://github.com/tower-rs/tower). This service builder allows
for setting how requests are sent to remote sinks. In particular, Vector
fixes the number of requests that may be simultaneously in flight (AKA
the concurrency limit), the maximum rate at which requests may be sent,
expressed in terms of the number of requests over some time interval
(AKA the rate limit number and duration).

Many of these parameters _must_ be adjusted by the Vector administrator
bruceg marked this conversation as resolved.
Show resolved Hide resolved
to maximize the throughput for each service. For high volume sites, this
can require considerable trial and error experimentation before a
satisfactory set of parameters can be achieved. More importantly,
changes in the service parameters, whether in terms of available
processing power, bandwidth, or changes to latency, such as is caused by
link congestion, or the number of other agents delivering to the same
sink, can cause the same settings to now impede utilization rather than
improve it.

Since many of these factors that affect delivery rate are not fixed
values but instead will vary considerably throughout the life of a
service, it is impossible to choose the "best" parameters that will fit
all these conditions. Instead, Vector should adopt an approach that
allows for dynamic adjustment of the underlying parameters based on the
current conditions.

When service limits are reached, Vector will experience a number of
undesirable phenomenon, notably request latency, timeouts, and
deferrals. These all decrease the overall flow rate while actually
increasing actual bandwidth usage.

## Guide-level Proposal

There are two levels of controls we have in play—static and dynamic. The
existing controls are static, and best describe service limits, such as
maximum allowed request rates. What is needed is a set of dynamic
controls that adapt to the underlying conditions and scale the service
utilization on the fly.

Since the controls under consideration are dependant on some form of
queueing, the controls will be inserted at the same level as
`TowerRequestSettings`. The existing rate limit controls will remain in
bruceg marked this conversation as resolved.
Show resolved Hide resolved
order to provide a hard upper bound on service utilization (for example,
to prevent over-use violations), but will be dynamically bounded by
adjusting the concurrency.

The underlying control will replace the `tower::limit::ConcurrencyLimit`
layer with a new custom layer that dynamically adjusts the concurrency
limit based on current conditions. It will track each request's result
status (ie success, or deferral) and the round trip time (RTT). This
will require a modified `tower::limit::Limit` structure that will add
and remove permits as needed. A new `ResponseFuture` will forward the
result of the request back to the invoking `ConcurrencyLimit` after a
completed `poll` (in addition to the usual action of releasing the
permit on `drop`).

The algorithm used to control the limit will follow the AIMD framework:

* The controller will maintain an moving average RTT of past requests
using an exponentially weighted moving average (EWMA). The weighting
(α) is to be experimentally determined.

* The current response's RTT is compared to this moving average:

* If less than or equal to the average, the concurrency will be
incremented by one (additive increase) up to a maximum of the in
flight limit.

* If greater than the average, or the result was a failure of any
bruceg marked this conversation as resolved.
Show resolved Hide resolved
kind, the concurrency will be reduced by a factor of one half
(multiplicative decrease) down to a minimum of one.

## Prior Art

* [TCP congestion control algorithms](https://en.wikipedia.org/wiki/TCP_congestion_control)
* [Additive Increase/Multiplicative Decrease](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease)
* [Netflix Technology Blog: Performance Under Load](https://medium.com/@NetflixTechBlog/performance-under-load-3e6fa9a60581)
* [JINSPIRED - Adaptive Safety Control (archive.org)](https://web.archive.org/web/20130105023839/http://www.jinspired.com/site/jxinsight-opencore-6-4-ea-11-released-adaptive-safety-control)

## Sales Pitch

This proposal:

* provides a simple and understandable mechanism for varying resource
utilization of sinks;

* adapts an existing design to avoid reinventing known good solutions;
bruceg marked this conversation as resolved.
Show resolved Hide resolved

* is minimally invasive to the existing code base while applying to most
sinks;

* minimizes the amount of configuration required to produce the ideal
(most efficient and performant) configuration; and

* does not impose hard limits on flow rates while respecting configured
limits.

## Drawbacks

Since the underlying parameters that control when requests are throttled
will be abstracted behind an additional layer, it will become harder to
reason about the causes of bandwidth limits.

## Rationale

* As referenced earlier, the proposed mechanism borrows from _proven_
mechanisms designed to manage flow control under varying conditions,
making it a good choice for the first pass implementation.

* A moving average is used to smooth out small variations in latency
without completely ignoring them.

* EWMA is chosen as an averaging mechanism as it avoids having to
maintain memory of past observations beyond a single
value. Mathematically it is the simplest possible moving average.

## Alternatives

* Instead of managing the concurrency, we could alter the maximum
request rate or maximum bandwidth usage. This runs into the difficulty
of how to set the minimum bound both before any data has been
collected and after hard backoffs, while concurrency has a trivially
obvious minimum bound and is better able to flex with load.

* Instead of comparing the RTT against a moving average, we could simply
use the previous observation (mathematically equivalent to a EWMA
weighting of α=1).

## Outstanding Questions

* The ideal value for the weighting α is unknown. Too large a value will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably start with a low alpha and let the user customize this. I think that is what makes sense, with a low enough alpha we should be able to eventually step up to the "real" value.

amplify the effect of short term changes to RTT. Too small a value may
delay responding to real changes excessively.

* Some experimentation may be required to determine a small zone around
the average that is still considered "equal" to avoid excessive
flapping of the concurrency level without allowing the RTT to grow
unbounded and overload the sink.

## Plan Of Attack

* [ ] Submit a PR with spike-level code _roughly_ demonstrating the change.
* [ ] Benchmark the approach under various conditions to determine a good
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, I think, needs much more detail. How exactly do you plan to test/benchmark this change to ensure it is:

  1. Behaving as expected.
  2. Does not regress over time.
  3. Allows us to expand as we learn more.

For example. Starting simple, it would help to define a set of scenarios and their ideal outcome. We could then replay these scenarios to verify behavior as we hone in on the optimal algorithm. In other words, tests 😄 . But I think we should agree on those test criteria within this RFC. Ex:

  1. Service suddenly becomes unresponsive with sustained request time outs.
    • Vector should eventually reduce to an in_flight_limit of 1.
  2. Service gradually increases in response time by 100ms each request.
    • Vector should eventually reduce to an in_flight_limit of 1.
  3. Service implements a rate limit of 5 requests per second. If this limit is exceeded a 429 is returned:
    • Vector's in_flight_limit should hover around 5.
  4. etc..

It would be nice to have these static expectations to keep us honest. Beyond that, I think we would benefit from a randomized simulation of some sort. For example, we could define criteria that would cause a mock service to become unresponsive (shut down) and the test fails if the service reaches that state. The service would implement a rate-limit that randomly adjusts, and if the algorithm works, Vector should correctly hone in on that rate-limit as it adjusts. This should all be done within defined bounds that, if exceeded, would shut down the service.

Easier said than done! I'm happy to explain better. And I have to assume there is prior art for this kind of testing. If not, it would make for excellent blog post!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this needs a more concrete testing and benchmarking plan, and these are all good ideas. Does all of this detail really belong in this implementation plan, or am I free to spread it elsewhere and just reference it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. Feel free to add a section or add it under the "Guide Level Proposal" section.

Copy link
Contributor

@binarylogic binarylogic Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to clarify the "Plan of Attack", it should be progressive and outline an optimal plan. It does not mean we will do all of the items. For example, if a simulator is part of the plan, I would expect that to be at the end unless it's absolutely necessary. It's possible we won't do it depending on how everything comes together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW condition 3 above doesn't actually compute—the rate limit and concurrency are different units. A rate limit of 5 requests per second might be achieved with any number of different concurrency levels. It depends on what the RTT is between Vector and the service. More precisely, Vector's concurrency should approximate rate_limit / RTT.

value for α.
* [ ] ………