enhancement(sinks): Adaptive Request Concurrency by bruceg · Pull Request #3094 · vectordotdev/vector

bruceg · 2020-07-17T20:34:01Z

Here is my implementation of the automatic concurrency management, which is in use in all "service2" based sinks (which appears to be everything except prometheus and aws_cloudwatch_logs).

I will continue to work on the tests started in auto_concurrency/service.rs, but otherwise I believe this is feature complete.

Closes #2529

Signed-off-by: Bruce Guenter <bruce@timber.io>

The implementation is complicated by the fact that we can only decrease concurrency by acquiring permits and forgetting them. If there aren't enough free permits, this requires waiting for them when polling for readiness. Signed-off-by: Bruce Guenter <bruce@timber.io>

Signed-off-by: Bruce Guenter <bruce@timber.io>

Has issues that need correcting: * has recursive locking via `fn contract` and `fn expand` * adjusts on every measurement, not once per RTT * adjusts for any response, does not differentiate Signed-off-by: Bruce Guenter <bruce@timber.io>

Signed-off-by: Bruce Guenter <bruce@timber.io>

I think the problem is that the Timeout layer needs to be inside the AutoConcurrencyLimit layer, but the Error type in the Timeout layer is a Box<Error> which differs from the RetryLogic Error type. Signed-off-by: Bruce Guenter <bruce@timber.io>

Signed-off-by: Ana Hobden <operator@hoverbear.org>

Signed-off-by: Bruce Guenter <bruce@timber.io>

MOZGIII · 2020-08-05T02:55:40Z

+        let mut inner = self.inner.lock().expect("Controller mutex is poisoned");
+
+        #[cfg(test)]
+        let mut stats = self.stats.lock().expect("Stats mutex is poisoned");


Might be useful to switch to try_lock here and in the rest of the places if possible - to avoid the effects of additional locking affecting test results.

So, if we use lock, then the stats may not reflect how the sink will operate in practice because there is additional lock contention. If we use try_lock, then the stats may not reflect how the sink actually did operate because it sometimes didn't take the lock and so couldn't report stats, resulting in incomplete data. Either way, stats can't always reflect actual operation, which I agree is unfortunate. Using try_lock also requires additional conditionals when trying to report stats. I think I'll stick with lock unless there is another consideration.

Now, I can reorganize the code a bit to make the time the mutex is held even shorter, which will reduce the effect. Do you have any thoughts on whether it is better to take the mutex once and hold it for a bit longer in adjust_to_back_pressure, or to take it once for each place it is needed and drop it immediately?

Yeah, I see the issue with the try_lock access to stats.

I had some thoughts on using lock-free ring buffers (channels?) to collect the stats for test purposes, and then to aggregate them separately. But it's very early - I'm still reading through the code and wrapping my head around what's going on.

A bit preliminary, but to me, implementing channel based mocks to tightly control the execution in a tight lock-step looks promising here. This would allow us to assert the states in all the right places, and eliminate the need for manual locking and associated race conditions.

The idea is we use a channel to send notifications that we reached a certain state to the test controlling logic along the way, and we then wait on another rx channel to continue. The test logic waits for a particular notification, does the assertion, and sends a message that lets the tested code continue the execution. We might even be able to use the exact values in the tests and make the tests deterministic.

Again, this is very preliminary - I'll have to take a deeper look if this has any actual benefits over what we have already - but I'm very curious to hear what are your thoughts on this.

I had considered using channels for reporting test metrics, but couldn't see a path that was as straightforward as using locking. It is worth considering, though.

I see there being roughly 3 levels of testing: low level (does it take the steps that the code says it should), behavior (do the steps give it the expected high-level behavior), and performance (does it interoperate optimally with live servers). It sounds like you are describing some fairly low-level tests, covering some of the steps tested in auto_concurrency/service.rs using tokio's timers to manually advance steps. It might be worth covering other paths though.

Yes, you're right, the approach I'm thinking about is just briefly above - and very similar to - tracing the code along it's execution in lock-step, and asserting that exactly what's expected happens.

I'm thinking of tokio timers as just one of the ways to advance.
Consider having state at the test end rather than as part of the struct, and only passing it into the Controller code via a channel. That channel will pass ownership and access to state, and it'll have to naturally be passed between the logical threads to mix the updates and assertions.

MOZGIII · 2020-08-05T03:39:48Z

I ran tests locally and got some failures.

The command I used was TEST_LOG=trace cargo test --lib -- sinks::util::auto_concurrency --nocapture 2>&1 | tee log.txt.

Output: log.txt

UPD: if I do cargo test --lib without filtering the tests, the outcome is the same. It also reproduces reliably.

I encourage others to run the tests locally to gather more stats (literally).

Signed-off-by: Bruce Guenter <bruce@timber.io>

bruceg · 2020-08-05T23:40:01Z

Hmm, I can definitely confirm the stats issues you are having, as I am seeing them now too. When I added the tests, though, I ran them many times to ensure the boundaries were reasonable. While it would be good to track down what changed, and to adjust the boundaries to ensure tests now pass, this makes me wonder if these tests are reliable enough to keep around.

Signed-off-by: Bruce Guenter <bruce@timber.io>

This restores the previous defaults for the fixed concurrency limits, and enables the automatic limiter if `request.in_flight_limit = "auto"`. This fixes the self tests that were accidentally broken by commit 8605d64 changing the behavior of specifying an in_flight_limit number in the settings. Signed-off-by: Bruce Guenter <bruce@timber.io>

bruceg · 2020-08-06T23:10:25Z

The issue is that commit 8605d64 switched the behavior of specifying a number for in_flight_limit, effectively turning the tests on to a fixed concurrency again instead of the variable concurrency. I discovered this while working on making the variable concurrency opt-in, and I have a fix.

MOZGIII

Looks good overall! We might want to revisit tests though.

The previous cheap fix to make in_flight_limit an opt-in parameter did not properly keep the concurrency limit fixed. In the case of increasing RTT, it will still decrease the concurrency at some point, and so return to the variable concurrency behavior. This change completely turns off the varying mechanism when in_flight_limit is not auto. Signed-off-by: Bruce Guenter <bruce@timber.io>

Signed-off-by: Bruce Guenter <bruce@timber.io>

lukesteensen

Looks good! Excited to get this in and collect some real-world feedback.

Ideally these should run with exact timing, but I haven't worked that in yet. Signed-off-by: Bruce Guenter <bruce@timber.io>

* Simplify retry logic with `matches!` * Import base concurrency limiter from tower * Use new stub AutoConcurrencyLimit layer in service2-based sinks * Drop unused Load impl on AutoConcurrencyLimit * Adjust pub markers and drop more unused methods * Introduce Controller wrapper for the semaphore The implementation is complicated by the fact that we can only decrease concurrency by acquiring permits and forgetting them. If there aren't enough free permits, this requires waiting for them when polling for readiness. * Link the semaphore controller and start time into the response future * Add function to controller to calculate average RTT * Initial implementation of RTT-based concurrency adjustment * Limit adjustments to once per RTT * Introduce new EWMA abstraction * Wrap the shrinkable semaphore in its own module * Aggregate responses for each RTT interval * Handle back pressure through applying RetryLogic Signed-off-by: Bruce Guenter <bruce@timber.io> * Fixup ;) Signed-off-by: Ana Hobden <operator@hoverbear.org> * Add statistics-based behavior tests * Move creation of Metric from a Measurement into `src/event/metric.rs` * Move `get_controller` and `capture_metrics` into src/metrics.rs * Introduce an `assert_within` macro for range matches * Add and test internal metrics for the controller * Add get_ref and get_mut methods to access internals of our service layers * Make test_util into a module * Update Cargo for new get_ref methods in tokio03 * Enhance the test_util stats * Make the automatic limit opt-in by default This restores the previous defaults for the fixed concurrency limits, and enables the automatic limiter if `request.in_flight_limit = "auto"`. * Only run the tests on unix systems, due to timing variability problems Ideally these should run with exact timing, but I haven't worked that in yet. Signed-off-by: Bruce Guenter <bruce@timber.io> Co-authored-by: Ana Hobden <operator@hoverbear.org>

* Simplify retry logic with `matches!` * Import base concurrency limiter from tower * Use new stub AutoConcurrencyLimit layer in service2-based sinks * Drop unused Load impl on AutoConcurrencyLimit * Adjust pub markers and drop more unused methods * Introduce Controller wrapper for the semaphore The implementation is complicated by the fact that we can only decrease concurrency by acquiring permits and forgetting them. If there aren't enough free permits, this requires waiting for them when polling for readiness. * Link the semaphore controller and start time into the response future * Add function to controller to calculate average RTT * Initial implementation of RTT-based concurrency adjustment * Limit adjustments to once per RTT * Introduce new EWMA abstraction * Wrap the shrinkable semaphore in its own module * Aggregate responses for each RTT interval * Handle back pressure through applying RetryLogic Signed-off-by: Bruce Guenter <bruce@timber.io> * Fixup ;) Signed-off-by: Ana Hobden <operator@hoverbear.org> * Add statistics-based behavior tests * Move creation of Metric from a Measurement into `src/event/metric.rs` * Move `get_controller` and `capture_metrics` into src/metrics.rs * Introduce an `assert_within` macro for range matches * Add and test internal metrics for the controller * Add get_ref and get_mut methods to access internals of our service layers * Make test_util into a module * Update Cargo for new get_ref methods in tokio03 * Enhance the test_util stats * Make the automatic limit opt-in by default This restores the previous defaults for the fixed concurrency limits, and enables the automatic limiter if `request.in_flight_limit = "auto"`. * Only run the tests on unix systems, due to timing variability problems Ideally these should run with exact timing, but I haven't worked that in yet. Signed-off-by: Bruce Guenter <bruce@timber.io> Co-authored-by: Ana Hobden <operator@hoverbear.org> Signed-off-by: Kirill Fomichev <fanatid@ya.ru>

* Simplify retry logic with `matches!` * Import base concurrency limiter from tower * Use new stub AutoConcurrencyLimit layer in service2-based sinks * Drop unused Load impl on AutoConcurrencyLimit * Adjust pub markers and drop more unused methods * Introduce Controller wrapper for the semaphore The implementation is complicated by the fact that we can only decrease concurrency by acquiring permits and forgetting them. If there aren't enough free permits, this requires waiting for them when polling for readiness. * Link the semaphore controller and start time into the response future * Add function to controller to calculate average RTT * Initial implementation of RTT-based concurrency adjustment * Limit adjustments to once per RTT * Introduce new EWMA abstraction * Wrap the shrinkable semaphore in its own module * Aggregate responses for each RTT interval * Handle back pressure through applying RetryLogic Signed-off-by: Bruce Guenter <bruce@timber.io> * Fixup ;) Signed-off-by: Ana Hobden <operator@hoverbear.org> * Add statistics-based behavior tests * Move creation of Metric from a Measurement into `src/event/metric.rs` * Move `get_controller` and `capture_metrics` into src/metrics.rs * Introduce an `assert_within` macro for range matches * Add and test internal metrics for the controller * Add get_ref and get_mut methods to access internals of our service layers * Make test_util into a module * Update Cargo for new get_ref methods in tokio03 * Enhance the test_util stats * Make the automatic limit opt-in by default This restores the previous defaults for the fixed concurrency limits, and enables the automatic limiter if `request.in_flight_limit = "auto"`. * Only run the tests on unix systems, due to timing variability problems Ideally these should run with exact timing, but I haven't worked that in yet. Signed-off-by: Bruce Guenter <bruce@timber.io> Co-authored-by: Ana Hobden <operator@hoverbear.org> Signed-off-by: Brian Menges <brian.menges@anaplan.com>

Bruce Guenter and others added 30 commits July 17, 2020 14:08

Simplify retry logic with matches!

ec6d013

Signed-off-by: Bruce Guenter <bruce@timber.io>

Import base concurrency limiter from tower

bd8907e

Signed-off-by: Bruce Guenter <bruce@timber.io>

Use new stub AutoConcurrencyLimit layer in service2-based sinks

25f024d

Signed-off-by: Bruce Guenter <bruce@timber.io>

Drop unused Load impl on AutoConcurrencyLimit

900c973

Signed-off-by: Bruce Guenter <bruce@timber.io>

Adjust pub markers and drop more unused methods

91df0e0

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add increase and decrease methods to the controller

3ec7b60

Signed-off-by: Bruce Guenter <bruce@timber.io>

Link the semaphore controller and start time into the response future

803ec35

Signed-off-by: Bruce Guenter <bruce@timber.io>

Calculate (and throw away) RTT in response future

1c6a734

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add function to controller to calculate average RTT

5b3e5e7

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fix recursive locking when making adjustments, and rename "dropping"

7fb2c80

Signed-off-by: Bruce Guenter <bruce@timber.io>

Limit adjustments to once per RTT

446e304

Signed-off-by: Bruce Guenter <bruce@timber.io>

Introduce new EWMA abstraction

9298f70

Signed-off-by: Bruce Guenter <bruce@timber.io>

Wrap the shrinkable semaphore in its own module

ec0c314

Signed-off-by: Bruce Guenter <bruce@timber.io>

Aggregate responses for each RTT interval

c4685b1

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add RetryLogic to auto concurrency layer (unused)

3feb8fa

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fixup ;)

a014a2d

Signed-off-by: Ana Hobden <operator@hoverbear.org>

Drop unused use std::error::Error

8f974a3

Signed-off-by: Bruce Guenter <bruce@timber.io>

Start work on tests

4ad9729

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fix: Make sure there is an initial set of measurements

f15c863

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fix: Only reset the current measurements after using them

0c25d63

Signed-off-by: Bruce Guenter <bruce@timber.io>

Start of auto request limiting test framework

f628cca

Signed-off-by: Bruce Guenter <bruce@timber.io>

Silence a clippy warning in tests

4d6cfc1

Signed-off-by: Bruce Guenter <bruce@timber.io>

Move creation of Metric from a Measurement into src/event/metric.rs

f5b21fc

Signed-off-by: Bruce Guenter <bruce@timber.io>

Move get_controller and capture_metrics into src/metrics.rs

9b2db01

Signed-off-by: Bruce Guenter <bruce@timber.io>

Introduce an assert_between macro for range matches

2d20308

Signed-off-by: Bruce Guenter <bruce@timber.io>

Rename current to current_limit in controller

cec8070

Signed-off-by: Bruce Guenter <bruce@timber.io>

Add and test internal metrics for the controller

acf0bca

Signed-off-by: Bruce Guenter <bruce@timber.io>

MOZGIII reviewed Aug 5, 2020

View reviewed changes

Mark the unhandled error type as unreachable!

c6f5afc

Signed-off-by: Bruce Guenter <bruce@timber.io>

Bruce Guenter added 2 commits August 5, 2020 17:58

Reorganize controller test stats access to reduce lock hold time

e980a31

Signed-off-by: Bruce Guenter <bruce@timber.io>

Rename a couple variables to make the algorithm clearer

f7f22bf

Signed-off-by: Bruce Guenter <bruce@timber.io>

binarylogic added domain: performance Anything related to Vector's performance and removed type: performance labels Aug 6, 2020

MOZGIII approved these changes Aug 10, 2020

View reviewed changes

Bruce Guenter added 6 commits August 10, 2020 15:45

Add a test for fixed concurrency and expand some more result ranges

f8acfd3

Signed-off-by: Bruce Guenter <bruce@timber.io>

Merge remote-tracking branch 'origin/master' into auto-concurrency

d288fda

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fix auto_concurrency tests for new tokio

bff244a

Signed-off-by: Bruce Guenter <bruce@timber.io>

Fix clippy lint issues with cloning Arc refs

a6ff218

Signed-off-by: Bruce Guenter <bruce@timber.io>

Merge remote-tracking branch 'origin/master' into auto-concurrency

891b441

Signed-off-by: Bruce Guenter <bruce@timber.io>

lukesteensen approved these changes Aug 11, 2020

View reviewed changes

Only run the tests on unix systems, due to timing variability problems

db1aea7

Ideally these should run with exact timing, but I haven't worked that in yet. Signed-off-by: Bruce Guenter <bruce@timber.io>

bruceg merged commit 0f15567 into master Aug 11, 2020

bruceg deleted the auto-concurrency branch August 11, 2020 20:20

jszwedko mentioned this pull request Aug 12, 2020

Failing autoconcurrency tests on mac #3429

Closed

binarylogic mentioned this pull request Sep 25, 2020

Document the new automatic concurrency feature #4121

Closed

5 tasks

binarylogic changed the title ~~enhancement(sinks): Automatic concurrency management~~ enhancement(sinks): Adaptive concurrency management Nov 26, 2020

binarylogic changed the title ~~enhancement(sinks): Adaptive concurrency management~~ enhancement(sinks): Adaptive Request Concurrency Nov 26, 2020

jszwedko restored the auto-concurrency branch September 27, 2023 18:41

jszwedko deleted the auto-concurrency branch September 27, 2023 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancement(sinks): Adaptive Request Concurrency#3094

enhancement(sinks): Adaptive Request Concurrency#3094
bruceg merged 83 commits intomasterfrom
auto-concurrency

bruceg commented Jul 17, 2020

Uh oh!

MOZGIII Aug 5, 2020 •

edited

Loading

Uh oh!

bruceg Aug 5, 2020

Uh oh!

MOZGIII Aug 6, 2020 •

edited

Loading

Uh oh!

bruceg Aug 6, 2020

Uh oh!

MOZGIII Aug 7, 2020 •

edited

Loading

Uh oh!

MOZGIII commented Aug 5, 2020 •

edited

Loading

Uh oh!

bruceg commented Aug 5, 2020

Uh oh!

bruceg commented Aug 6, 2020

Uh oh!

MOZGIII left a comment

Uh oh!

lukesteensen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

bruceg commented Jul 17, 2020

Uh oh!

MOZGIII Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bruceg Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

MOZGIII Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bruceg Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

MOZGIII Aug 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MOZGIII commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruceg commented Aug 5, 2020

Uh oh!

bruceg commented Aug 6, 2020

Uh oh!

MOZGIII left a comment

Choose a reason for hiding this comment

Uh oh!

lukesteensen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MOZGIII Aug 5, 2020 •

edited

Loading

MOZGIII Aug 6, 2020 •

edited

Loading

MOZGIII Aug 7, 2020 •

edited

Loading

MOZGIII commented Aug 5, 2020 •

edited

Loading