Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: RFC #1858 - 2020-04-06 - Automatic rate limit adjustment #2329

Merged
merged 15 commits into from
Apr 30, 2020

Conversation

bruceg
Copy link
Member

@bruceg bruceg commented Apr 14, 2020

🎉

Ref #1858

Signed-off-by: Bruce Guenter bruce@timber.io

Rendered

Signed-off-by: Bruce Guenter <bruce@timber.io>
@bruceg bruceg added type: task Generic non-code related tasks domain: networking Anything related to Vector's networking domain: sinks Anything related to the Vector's sinks have: should We should have this feature, but is not required. It is medium priority. labels Apr 14, 2020
@bruceg bruceg self-assigned this Apr 14, 2020
@github-actions
Copy link

Great PR! Please pay attention to the following items before merging:

Files matching rfcs/**:

  • Have at least 3 team members approved this RFC?

This is an automatically generated QA checklist based on modified files

Copy link
Contributor

@Hoverbear Hoverbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a well considered, minimal, known-good solution!

rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved
rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved
Signed-off-by: Bruce Guenter <bruce@timber.io>
Copy link
Member

@lukesteensen lukesteensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great! Convenient that it's entirely transparent to the user and we still remain under the existing configured lower bounds. We'll just want to make sure to spend time making the logs as informative as possible so users can understand what's happening. May also want to let people opt out.

Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great write up! One thing that we may want to also consider is some sort of jitter when it comes to updating the concurrency limit. This should help prevent multiple vectors making a decision at once to up their limit and possibly overwhelm the downstream system.


## Outstanding Questions

* The ideal value for the weighting α is unknown. Too large a value will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably start with a low alpha and let the user customize this. I think that is what makes sense, with a low enough alpha we should be able to eventually step up to the "real" value.

@bruceg
Copy link
Member Author

bruceg commented Apr 16, 2020

One thing that we may want to also consider is some sort of jitter when it comes to updating the concurrency limit. This should help prevent multiple vectors making a decision at once to up their limit and possibly overwhelm the downstream system.

That is at least partially mitigated by the slow increase. If the slow increase causes a the downstream to be overwhelmed, there are probably capacity issues anyways (ie too many clients). Additionally, all the servers will be making those decisions at separate times, so there is some jitter built in. Still, it is worth thinking about.

Bruce Guenter added 2 commits April 16, 2020 16:54
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>
Copy link
Contributor

@binarylogic binarylogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! One big comment about the "plan of attack". Happy to chat through it more if it helps.

@@ -0,0 +1,164 @@
# RFC 1858 - 2020-04-06 - Automatically Adjust Limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# RFC 1858 - 2020-04-06 - Automatically Adjust Limits
# RFC 1858 - 2020-04-06 - Automatically Adjust HTTP Request Limits

I think we should make it clear that this RFC focuses on HTTP based sinks only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true, and it's worth clarifying that this doesn't apply universally, but the HTTP part seems accidental. It is documented to apply at the level of the tower crate service builder, so any service that uses that framework would have this capability. Now, it happens that all of our HTTP services use the tower framework, and vice versa, but it isn't necessarily so.

Comment on lines 3 to 5
This RFC proposes a new scheme for rate limiting requests to external
services in order to maximize the sustained transmission rate over
varying conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This RFC proposes a new scheme for rate limiting requests to external
services in order to maximize the sustained transmission rate over
varying conditions.
This RFC proposes a new scheme for rate-limiting HTTP requests to
external services in order to maximize the sustained transmission
rate over varying conditions.

Comment on lines 15 to 16
automatically rate limiting its requests to maximally fill the service's
capacity without overwhelming it and causing additional problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
automatically rate limiting its requests to maximally fill the service's
capacity without overwhelming it and causing additional problems.
automatically rate-limiting its HTTP requests to maximally fill the
service's capacity without overwhelming it and causing additional
problems.

automatically rate limiting its requests to maximally fill the service's
capacity without overwhelming it and causing additional problems.

Most sinks in Vector have their request structure managed by the [tower
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Most sinks in Vector have their request structure managed by the [tower
Most sinks in Vector have their HTTP request structure managed by the [tower

rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved
rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved
rfcs/2020-04-06-1858-automatically-adjust-limits.md Outdated Show resolved Hide resolved
## Plan Of Attack

* [ ] Submit a PR with spike-level code _roughly_ demonstrating the change.
* [ ] Benchmark the approach under various conditions to determine a good
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, I think, needs much more detail. How exactly do you plan to test/benchmark this change to ensure it is:

  1. Behaving as expected.
  2. Does not regress over time.
  3. Allows us to expand as we learn more.

For example. Starting simple, it would help to define a set of scenarios and their ideal outcome. We could then replay these scenarios to verify behavior as we hone in on the optimal algorithm. In other words, tests 😄 . But I think we should agree on those test criteria within this RFC. Ex:

  1. Service suddenly becomes unresponsive with sustained request time outs.
    • Vector should eventually reduce to an in_flight_limit of 1.
  2. Service gradually increases in response time by 100ms each request.
    • Vector should eventually reduce to an in_flight_limit of 1.
  3. Service implements a rate limit of 5 requests per second. If this limit is exceeded a 429 is returned:
    • Vector's in_flight_limit should hover around 5.
  4. etc..

It would be nice to have these static expectations to keep us honest. Beyond that, I think we would benefit from a randomized simulation of some sort. For example, we could define criteria that would cause a mock service to become unresponsive (shut down) and the test fails if the service reaches that state. The service would implement a rate-limit that randomly adjusts, and if the algorithm works, Vector should correctly hone in on that rate-limit as it adjusts. This should all be done within defined bounds that, if exceeded, would shut down the service.

Easier said than done! I'm happy to explain better. And I have to assume there is prior art for this kind of testing. If not, it would make for excellent blog post!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this needs a more concrete testing and benchmarking plan, and these are all good ideas. Does all of this detail really belong in this implementation plan, or am I free to spread it elsewhere and just reference it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all. Feel free to add a section or add it under the "Guide Level Proposal" section.

Copy link
Contributor

@binarylogic binarylogic Apr 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to clarify the "Plan of Attack", it should be progressive and outline an optimal plan. It does not mean we will do all of the items. For example, if a simulator is part of the plan, I would expect that to be at the end unless it's absolutely necessary. It's possible we won't do it depending on how everything comes together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW condition 3 above doesn't actually compute—the rate limit and concurrency are different units. A rate limit of 5 requests per second might be achieved with any number of different concurrency levels. It depends on what the RTT is between Vector and the service. More precisely, Vector's concurrency should approximate rate_limit / RTT.

binarylogic and others added 4 commits April 17, 2020 17:06
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>
@binarylogic
Copy link
Contributor

Also, @bruceg, given this comment, can we make sure to add observability to the "plan of attack"? You can add a separate section that as well if you'd like.

@bruceg
Copy link
Member Author

bruceg commented Apr 17, 2020

Also, @bruceg, given this comment, can we make sure to add observability to the "plan of attack"?

Definitely, that is important too. Our InternalEvent framework doesn't seem to be well suited to what we will want to expose, but some kind of metrics would be good.

@binarylogic
Copy link
Contributor

binarylogic commented Apr 17, 2020

Well, I would like to think about emitting events so that we're exposing behavior through the logs. I understand that we don't want to emit events at a rapid pace, so maybe we can aggregate and emit them on an interval?

I'd put what you're thinking in the RFC so that we can comment inline and discuss.

Signed-off-by: Bruce Guenter <bruce@timber.io>
@bruceg
Copy link
Member Author

bruceg commented Apr 17, 2020

So a new kind of logging framework then, having attributes of both event logs (emitted immediately to the console) and metrics (emitted only periodically). Does that match what you are thinking? Would that be useful to other parts of Vector?

@binarylogic
Copy link
Contributor

We should probably discuss the recent event-driven changes. It is my understanding that metrics aggregate and emit on a 1-second interval. So we can create/update metrics at a rapid pace if we like.

Bruce Guenter added 3 commits April 20, 2020 10:52
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>

* a histogram metric recording the observed RTTs,

* a histogram metric recording the effective concurrency limit, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably better represented as a gauge, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the effective limit will go up and down if Vector bumps into limits, and it is integer valued over a small range (ex 1-10), so I figured a histogram would be able to describe both the minimum, maximum, and around what values did Vector spend the most time.


* a histogram metric recording the effective concurrency limit, and

* a histogram metric recording the actual concurrent requests in flight.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be a gauge as well.

Comment on lines 165 to 168
* a rate-limited event logged every time a remote service explicitly
limits the request rate:

`Requests throttled by remote service, concurrency reduced.`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is accurate. Vector should receive a 429 response and that'll be logged like any other error response, right? What we want, instead, is to log when Vector internally rate limits. That is, your algorithm has adjusted the in_flight_limit to 10 and an 11th request gets rate limited. When that happens we should replace this log:

TRACE tower_limit::rate::service: rate limit exceeded, disabling service

With this log

WARN sink{name=out type=http} vector::sinks::http has been internally rate-limited, `requests.in_flight_limit` is set to 10 and this is request 11.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been discussed in #1391, but I think it's very important we communicate this properly to users, so it should be included in the scope of this RFC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that would be a useful thing to report, and clears up some misunderstanding. It does sound, however, like something better as a metric than a log message. Even rate limited we likely want to have a counter on how often it happens.

@bruceg
Copy link
Member Author

bruceg commented Apr 24, 2020

The question about logging when the concurrency was exceeded raises another question. If we have enough data flow to maintain a concurrency of 4, for example, but service times are not increasing, the algorithm expressed here would raise the concurrency limit all the way to the maximum. I think we shouldn't raise the limit beyond 5. In this way, we are only raising the limit to X+1 once we observe that X concurrent requests go through without issues. Also, if there is a spike of traffic, we don't immediately blow past downstream limits but rather continue to ramp up the concurrency once the load is actually there. Does this make sense?

Signed-off-by: Bruce Guenter <bruce@timber.io>
@binarylogic
Copy link
Contributor

@bruceg very much so. I agree with that behavior. I'd update the RFC with any relevant changes.

@binarylogic
Copy link
Contributor

@bruceg that's another good example of a test case we should verify: sudden increase in volume does not blow away the downstream service. That's why I think it'd be good to think about those and add them in this RFC, this way we can just execute on them once we're all in agreement.

Bruce Guenter added 2 commits April 29, 2020 16:48
Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Bruce Guenter <bruce@timber.io>
Comment on lines +173 to +177
* If the sender experiences a sudden increase in volume of events,
Vector will not overload the remote service with concurrent requests.
Instead, Vector will use the maximum concurrency previously set, which
will be at most one higher than the previously observed limit, and
continue to ramp up to the configured maximum from there.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@binarylogic binarylogic self-requested a review April 29, 2020 23:48
Copy link
Contributor

@binarylogic binarylogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Nice work! Looking forward to starting on this.

@bruceg bruceg merged commit ed8abcd into master Apr 30, 2020
@bruceg bruceg deleted the auto-backoff-rfc branch April 30, 2020 16:41
mengesb pushed a commit to jacobbraaten/vector that referenced this pull request Dec 9, 2020
…stment (vectordotdev#2329)

Signed-off-by: Bruce Guenter <bruce@timber.io>
Signed-off-by: Brian Menges <brian.menges@anaplan.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: networking Anything related to Vector's networking domain: sinks Anything related to the Vector's sinks have: should We should have this feature, but is not required. It is medium priority. type: task Generic non-code related tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants