chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

bruceg · 2020-04-14T22:47:18Z

🎉

Ref #1858

Signed-off-by: Bruce Guenter bruce@timber.io

Rendered

Signed-off-by: Bruce Guenter <bruce@timber.io>

github-actions · 2020-04-14T22:47:41Z

Great PR! Please pay attention to the following items before merging:

Files matching rfcs/**:

Have at least 3 team members approved this RFC?

This is an automatically generated QA checklist based on modified files

Hoverbear

This seems like a well considered, minimal, known-good solution!

rfcs/2020-04-06-1858-automatically-adjust-limits.md

Signed-off-by: Bruce Guenter <bruce@timber.io>

lukesteensen

This sounds great! Convenient that it's entirely transparent to the user and we still remain under the existing configured lower bounds. We'll just want to make sure to spend time making the logs as informative as possible so users can understand what's happening. May also want to let people opt out.

LucioFranco

Great write up! One thing that we may want to also consider is some sort of jitter when it comes to updating the concurrency limit. This should help prevent multiple vectors making a decision at once to up their limit and possibly overwhelm the downstream system.

LucioFranco · 2020-04-16T16:03:20Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

+
+## Outstanding Questions
+
+* The ideal value for the weighting α is unknown. Too large a value will


We should probably start with a low alpha and let the user customize this. I think that is what makes sense, with a low enough alpha we should be able to eventually step up to the "real" value.

bruceg · 2020-04-16T22:32:47Z

One thing that we may want to also consider is some sort of jitter when it comes to updating the concurrency limit. This should help prevent multiple vectors making a decision at once to up their limit and possibly overwhelm the downstream system.

That is at least partially mitigated by the slow increase. If the slow increase causes a the downstream to be overwhelmed, there are probably capacity issues anyways (ie too many clients). Additionally, all the servers will be making those decisions at separate times, so there is some jitter built in. Still, it is worth thinking about.

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic

Great start! One big comment about the "plan of attack". Happy to chat through it more if it helps.

binarylogic · 2020-04-17T15:21:43Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

@@ -0,0 +1,164 @@
+# RFC 1858 - 2020-04-06 - Automatically Adjust Limits


Suggested change

# RFC 1858 - 2020-04-06 - Automatically Adjust Limits

# RFC 1858 - 2020-04-06 - Automatically Adjust HTTP Request Limits

I think we should make it clear that this RFC focuses on HTTP based sinks only.

This is true, and it's worth clarifying that this doesn't apply universally, but the HTTP part seems accidental. It is documented to apply at the level of the tower crate service builder, so any service that uses that framework would have this capability. Now, it happens that all of our HTTP services use the tower framework, and vice versa, but it isn't necessarily so.

binarylogic · 2020-04-17T15:22:25Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

+This RFC proposes a new scheme for rate limiting requests to external
+services in order to maximize the sustained transmission rate over
+varying conditions.


Suggested change

This RFC proposes a new scheme for rate limiting requests to external

services in order to maximize the sustained transmission rate over

varying conditions.

This RFC proposes a new scheme for rate-limiting HTTP requests to

external services in order to maximize the sustained transmission

rate over varying conditions.

binarylogic · 2020-04-17T15:23:52Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

+automatically rate limiting its requests to maximally fill the service's
+capacity without overwhelming it and causing additional problems.


Suggested change

automatically rate limiting its requests to maximally fill the service's

capacity without overwhelming it and causing additional problems.

automatically rate-limiting its HTTP requests to maximally fill the

service's capacity without overwhelming it and causing additional

problems.

binarylogic · 2020-04-17T15:24:07Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

+automatically rate limiting its requests to maximally fill the service's
+capacity without overwhelming it and causing additional problems.
+
+Most sinks in Vector have their request structure managed by the [tower


Suggested change

Most sinks in Vector have their request structure managed by the [tower

Most sinks in Vector have their HTTP request structure managed by the [tower

rfcs/2020-04-06-1858-automatically-adjust-limits.md

binarylogic · 2020-04-17T15:34:20Z

rfcs/2020-04-06-1858-automatically-adjust-limits.md

+## Plan Of Attack
+
+* [ ] Submit a PR with spike-level code _roughly_ demonstrating the change.
+* [ ] Benchmark the approach under various conditions to determine a good


This, I think, needs much more detail. How exactly do you plan to test/benchmark this change to ensure it is:

Behaving as expected.

Does not regress over time.

Allows us to expand as we learn more.

For example. Starting simple, it would help to define a set of scenarios and their ideal outcome. We could then replay these scenarios to verify behavior as we hone in on the optimal algorithm. In other words, tests 😄 . But I think we should agree on those test criteria within this RFC. Ex:

Service suddenly becomes unresponsive with sustained request time outs.

Vector should eventually reduce to an in_flight_limit of 1.

Service gradually increases in response time by 100ms each request.

Vector should eventually reduce to an in_flight_limit of 1.

Service implements a rate limit of 5 requests per second. If this limit is exceeded a 429 is returned:

Vector's in_flight_limit should hover around 5.

etc..

It would be nice to have these static expectations to keep us honest. Beyond that, I think we would benefit from a randomized simulation of some sort. For example, we could define criteria that would cause a mock service to become unresponsive (shut down) and the test fails if the service reaches that state. The service would implement a rate-limit that randomly adjusts, and if the algorithm works, Vector should correctly hone in on that rate-limit as it adjusts. This should all be done within defined bounds that, if exceeded, would shut down the service.

Easier said than done! I'm happy to explain better. And I have to assume there is prior art for this kind of testing. If not, it would make for excellent blog post!

I agree this needs a more concrete testing and benchmarking plan, and these are all good ideas. Does all of this detail really belong in this implementation plan, or am I free to spread it elsewhere and just reference it here?

Not at all. Feel free to add a section or add it under the "Guide Level Proposal" section.

And to clarify the "Plan of Attack", it should be progressive and outline an optimal plan. It does not mean we will do all of the items. For example, if a simulator is part of the plan, I would expect that to be at the end unless it's absolutely necessary. It's possible we won't do it depending on how everything comes together.

BTW condition 3 above doesn't actually compute—the rate limit and concurrency are different units. A rate limit of 5 requests per second might be achieved with any number of different concurrency levels. It depends on what the RTT is between Vector and the service. More precisely, Vector's concurrency should approximate rate_limit / RTT.

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic · 2020-04-17T22:57:22Z

Also, @bruceg, given this comment, can we make sure to add observability to the "plan of attack"? You can add a separate section that as well if you'd like.

bruceg · 2020-04-17T23:09:53Z

Also, @bruceg, given this comment, can we make sure to add observability to the "plan of attack"?

Definitely, that is important too. Our InternalEvent framework doesn't seem to be well suited to what we will want to expose, but some kind of metrics would be good.

binarylogic · 2020-04-17T23:13:47Z

Well, I would like to think about emitting events so that we're exposing behavior through the logs. I understand that we don't want to emit events at a rapid pace, so maybe we can aggregate and emit them on an interval?

I'd put what you're thinking in the RFC so that we can comment inline and discuss.

Signed-off-by: Bruce Guenter <bruce@timber.io>

bruceg · 2020-04-17T23:17:57Z

So a new kind of logging framework then, having attributes of both event logs (emitted immediately to the console) and metrics (emitted only periodically). Does that match what you are thinking? Would that be useful to other parts of Vector?

binarylogic · 2020-04-17T23:22:48Z

We should probably discuss the recent event-driven changes. It is my understanding that metrics aggregate and emit on a 1-second interval. So we can create/update metrics at a rapid pace if we like.

Signed-off-by: Bruce Guenter <bruce@timber.io>

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md

binarylogic · 2020-04-23T22:32:08Z

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md

+
+* a histogram metric recording the observed RTTs,
+
+* a histogram metric recording the effective concurrency limit, and


This is probably better represented as a gauge, right?

Well, the effective limit will go up and down if Vector bumps into limits, and it is integer valued over a small range (ex 1-10), so I figured a histogram would be able to describe both the minimum, maximum, and around what values did Vector spend the most time.

binarylogic · 2020-04-23T22:32:25Z

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md

+
+* a histogram metric recording the effective concurrency limit, and
+
+* a histogram metric recording the actual concurrent requests in flight.


This should probably be a gauge as well.

binarylogic · 2020-04-23T22:37:35Z

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md

+* a rate-limited event logged every time a remote service explicitly
+  limits the request rate:
+
+  `Requests throttled by remote service, concurrency reduced.`


I don't think this is accurate. Vector should receive a 429 response and that'll be logged like any other error response, right? What we want, instead, is to log when Vector internally rate limits. That is, your algorithm has adjusted the in_flight_limit to 10 and an 11th request gets rate limited. When that happens we should replace this log:

TRACE tower_limit::rate::service: rate limit exceeded, disabling service

With this log

WARN sink{name=out type=http} vector::sinks::http has been internally rate-limited, `requests.in_flight_limit` is set to 10 and this is request 11.

What do you think?

This has been discussed in #1391, but I think it's very important we communicate this properly to users, so it should be included in the scope of this RFC.

I agree that would be a useful thing to report, and clears up some misunderstanding. It does sound, however, like something better as a metric than a log message. Even rate limited we likely want to have a counter on how often it happens.

bruceg · 2020-04-24T21:06:36Z

The question about logging when the concurrency was exceeded raises another question. If we have enough data flow to maintain a concurrency of 4, for example, but service times are not increasing, the algorithm expressed here would raise the concurrency limit all the way to the maximum. I think we shouldn't raise the limit beyond 5. In this way, we are only raising the limit to X+1 once we observe that X concurrent requests go through without issues. Also, if there is a spike of traffic, we don't immediately blow past downstream limits but rather continue to ramp up the concurrency once the load is actually there. Does this make sense?

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic · 2020-04-24T22:15:01Z

@bruceg very much so. I agree with that behavior. I'd update the RFC with any relevant changes.

binarylogic · 2020-04-25T14:07:33Z

@bruceg that's another good example of a test case we should verify: sudden increase in volume does not blow away the downstream service. That's why I think it'd be good to think about those and add them in this RFC, this way we can just execute on them once we're all in agreement.

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic · 2020-04-29T23:48:03Z

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md

+* If the sender experiences a sudden increase in volume of events,
+  Vector will not overload the remote service with concurrent requests.
+  Instead, Vector will use the maximum concurrency previously set, which
+  will be at most one higher than the previously observed limit, and
+  continue to ramp up to the configured maximum from there.


binarylogic

This looks great. Nice work! Looking forward to starting on this.

…stment (vectordotdev#2329) Signed-off-by: Bruce Guenter <bruce@timber.io> Signed-off-by: Brian Menges <brian.menges@anaplan.com>

Initial RFC for automatic limit adjustment

106f889

Signed-off-by: Bruce Guenter <bruce@timber.io>

bruceg added type: task Generic non-code related tasks domain: networking Anything related to Vector's networking domain: sinks Anything related to the Vector's sinks have: should We should have this feature, but is not required. It is medium priority. labels Apr 14, 2020

bruceg requested review from Hoverbear, binarylogic and LucioFranco April 14, 2020 22:47

bruceg self-assigned this Apr 14, 2020

bruceg requested a review from lukesteensen April 14, 2020 22:49

Hoverbear reviewed Apr 15, 2020

View reviewed changes

rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved

rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved

Improve prior art links

d310d0f

Signed-off-by: Bruce Guenter <bruce@timber.io>

Hoverbear approved these changes Apr 15, 2020

View reviewed changes

lukesteensen approved these changes Apr 16, 2020

View reviewed changes

LucioFranco approved these changes Apr 16, 2020

View reviewed changes

Bruce Guenter added 2 commits April 16, 2020 16:54

Add question about jitter

24d23cb

Signed-off-by: Bruce Guenter <bruce@timber.io>

Note that increases happen only once per RTT

4503e03

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic suggested changes Apr 17, 2020

View reviewed changes

binarylogic and others added 4 commits April 17, 2020 17:06

Merge branch 'master' into auto-backoff-rfc

639cb79

Add note about disabling the dynamic control

7b08132

Signed-off-by: Bruce Guenter <bruce@timber.io>

Note that only server failures will cause a decrease

3d322f1

Signed-off-by: Bruce Guenter <bruce@timber.io>

"administrator" => "operator"

92fa0bb

Signed-off-by: Bruce Guenter <bruce@timber.io>

Rename RFC to note it's request limits being adjusted

d960d69

Signed-off-by: Bruce Guenter <bruce@timber.io>

Bruce Guenter added 3 commits April 20, 2020 10:52

Make note of desired behavior from this proposal

07fae10

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add planning items

20c2167

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add notes about observability and pseudocode

6779df3

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic reviewed Apr 23, 2020

View reviewed changes

rfcs/2020-04-06-1858-automatically-adjust-request-limits.md Show resolved Hide resolved

binarylogic reviewed Apr 23, 2020

View reviewed changes

Change note about concurrency exceeded, add pseudocode for new counter

8e44ac6

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic mentioned this pull request Apr 28, 2020

Specification for metrics collection #2007

Closed

Bruce Guenter added 2 commits April 29, 2020 16:48

Add protection against flooding remote servers

d142eba

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add code for events

0feac28

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic reviewed Apr 29, 2020

View reviewed changes

binarylogic self-requested a review April 29, 2020 23:48

binarylogic approved these changes Apr 29, 2020

View reviewed changes

bruceg merged commit ed8abcd into master Apr 30, 2020

bruceg deleted the auto-backoff-rfc branch April 30, 2020 16:41

bruceg mentioned this pull request May 3, 2020

Implement spike level framework for automatic request limiting #2529

Closed

binarylogic mentioned this pull request Jul 28, 2020

Automatically adjust backoff values RFC #1858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

bruceg commented Apr 14, 2020 •

edited

Loading

github-actions bot commented Apr 14, 2020

Hoverbear left a comment

lukesteensen left a comment

LucioFranco left a comment

LucioFranco Apr 16, 2020

bruceg commented Apr 16, 2020

binarylogic left a comment

binarylogic Apr 17, 2020

bruceg Apr 17, 2020

binarylogic Apr 17, 2020

binarylogic Apr 17, 2020

binarylogic Apr 17, 2020

binarylogic Apr 17, 2020

bruceg Apr 17, 2020

binarylogic Apr 17, 2020

binarylogic Apr 17, 2020 •

edited

Loading

bruceg Apr 17, 2020

binarylogic commented Apr 17, 2020

bruceg commented Apr 17, 2020

binarylogic commented Apr 17, 2020 •

edited

Loading

bruceg commented Apr 17, 2020

binarylogic commented Apr 17, 2020

binarylogic Apr 23, 2020

bruceg Apr 24, 2020

binarylogic Apr 23, 2020

binarylogic Apr 23, 2020

binarylogic Apr 23, 2020

bruceg Apr 24, 2020

bruceg commented Apr 24, 2020

binarylogic commented Apr 24, 2020

binarylogic commented Apr 25, 2020

binarylogic Apr 29, 2020

binarylogic left a comment


		## Outstanding Questions

		* The ideal value for the weighting α is unknown. Too large a value will

		@@ -0,0 +1,164 @@
		# RFC 1858 - 2020-04-06 - Automatically Adjust Limits

		automatically rate limiting its requests to maximally fill the service's
		capacity without overwhelming it and causing additional problems.

	Most sinks in Vector have their request structure managed by the [tower
	Most sinks in Vector have their HTTP request structure managed by the [tower


		* a histogram metric recording the observed RTTs,

		* a histogram metric recording the effective concurrency limit, and


		* a histogram metric recording the effective concurrency limit, and

		* a histogram metric recording the actual concurrent requests in flight.

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

Conversation

bruceg commented Apr 14, 2020 • edited Loading

github-actions bot commented Apr 14, 2020

Hoverbear left a comment

Choose a reason for hiding this comment

lukesteensen left a comment

Choose a reason for hiding this comment

LucioFranco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bruceg commented Apr 16, 2020

binarylogic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarylogic Apr 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binarylogic commented Apr 17, 2020

bruceg commented Apr 17, 2020

binarylogic commented Apr 17, 2020 • edited Loading

bruceg commented Apr 17, 2020

binarylogic commented Apr 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bruceg commented Apr 24, 2020

binarylogic commented Apr 24, 2020

binarylogic commented Apr 25, 2020

Choose a reason for hiding this comment

binarylogic left a comment

Choose a reason for hiding this comment

bruceg commented Apr 14, 2020 •

edited

Loading

binarylogic Apr 17, 2020 •

edited

Loading

binarylogic commented Apr 17, 2020 •

edited

Loading