Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rt: reduce no-op wakeups in the multi-threaded scheduler #4383

Merged
merged 4 commits into from Jan 13, 2022

Conversation

carllerche
Copy link
Member

@carllerche carllerche commented Jan 5, 2022

This PR reduces the number of times worker threads wake up without having work to do in the multi-threaded scheduler. Unnecessary wake-ups are expensive and slow down the scheduler. I have observed this change reduce no-op wakes by up to 50%.

The multi-threaded scheduler is work-stealing. When a worker has tasks to process, and other workers are idle (parked), these idle workers must be unparked so that they can steal work from the busy worker. However, unparking threads is expensive, so there is an optimization that avoids unparking a worker if there already exists workers in a "searching" state (the worker is unparked and looking for work). This works pretty well, but transitioning from 1 "searching" worker to 0 searching workers introduces a race condition where a thread unpark can be lost:

  • thread 1: last searching worker about to exit searching state
  • thread 2: needs to unpark a thread, but skip because there is a searching worker.
  • thread 1: exits searching state w/o seeing thread 2's work.

Because this should be a rare condition, Tokio solves this by always unparking a new worker when the current worker:

  • is the last searching worker
  • is transitioning out of searching
  • has work to process.

When the newly unparked worker wakes, if the race condition described above happened, "thread 2"'s work will be found. Otherwise, it will just go back to sleep.

Now we come to the issue at hand. A bug incorrectly set a worker to "searching" when the I/O driver unparked the thread. In a situation where the scheduler was only partially under load and is able to operate with 1 active worker, the I/O driver would unpark the thread when new I/O events are received, incorrectly transition it to "searching", find new work generated by inbound I/O events, incorrectly transition itself from the last searcher -> no searchers, and unpark a new thread. This new thread would wake, find no work and go back to sleep.

Note that, when the scheduler is fully saturated, this change will make no impact as most workers are always unparked and the optimization to avoid unparking threads described at the top applies.

Benchmarks

Hyper

This benches Hyper's "hello" server using wrk -t1 -c400 -d10s http://127.0.0.1:3000/

master

Requests/sec: 118495.92
Transfer/sec: 9.94MB

this PR

Requests/sec: 135261.99
Transfer/sec: 11.35MB

mini-redis

This benches mini-redis using redis-benchmark -c 5 -t get,set

master

====== SET ======
  100000 requests completed in 1.03 seconds
  5 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

99.17% <= 0.1 milliseconds
99.82% <= 0.2 milliseconds
99.92% <= 0.3 milliseconds
99.98% <= 0.4 milliseconds
99.99% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
96711.80 requests per second

====== GET ======
  100000 requests completed in 1.03 seconds
  5 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

99.68% <= 0.1 milliseconds
99.82% <= 0.2 milliseconds
99.93% <= 0.3 milliseconds
99.98% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.7 milliseconds
100.00% <= 0.7 milliseconds
97370.98 requests per second

this PR

====== SET ======
  100000 requests completed in 0.83 seconds
  5 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

99.11% <= 0.1 milliseconds
99.72% <= 0.2 milliseconds
99.92% <= 0.3 milliseconds
99.98% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
120627.27 requests per second

====== GET ======
  100000 requests completed in 0.81 seconds
  5 parallel clients
  3 bytes payload
  keep alive: 1
  multi-thread: no

99.72% <= 0.1 milliseconds
99.85% <= 0.2 milliseconds
99.89% <= 0.3 milliseconds
99.97% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.5 milliseconds
123152.71 requests per second

@github-actions github-actions bot added the R-loom Run loom tests on this PR label Jan 5, 2022
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime labels Jan 6, 2022
@carllerche carllerche force-pushed the investigate-noop-wakes branch 2 times, most recently from 9bc41b1 to 175d8b8 Compare January 11, 2022 00:16
@@ -620,8 +620,7 @@ impl Core {
// If a task is in the lifo slot, then we must unpark regardless of
// being notified
if self.lifo_slot.is_some() {
worker.shared.idle.unpark_worker_by_id(worker.index);
self.is_searching = true;
self.is_searching = !worker.shared.idle.unpark_worker_by_id(worker.index);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key bit. When the worker wakes, it only enters the "searching" state if it was unparked by another thread. This distinguishes unparks from other threads to signal to the worker that it should steal work vs. the I/O driver unparked the thread because events arrived.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth having a comment in the code explaining that? not a hard blocker though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@carllerche carllerche changed the title rt: investigate noop wakes rt: reduce no-op wakeups in the multi-threaded scheduler Jan 12, 2022
@carllerche carllerche marked this pull request as ready for review January 12, 2022 06:06
@carllerche
Copy link
Member Author

@Darksonn @hawkw this should be ready for review, it would be nice to have it confirmed in a "real app" too.

@blt
Copy link
Contributor

blt commented Jan 12, 2022

@Darksonn @hawkw this should be ready for review, it would be nice to have it confirmed in a "real app" too.

I'm planning on wiring this up into vector and running it through our integrated benchmarks. Assuming all goes well I'll link numbers here in a handful of hours.

hawkw
hawkw approved these changes Jan 12, 2022
Copy link
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me --- the actual change is pretty simple and makes sense based on the explanation in the PR description.

It might be nice to add a bit to the comments in this code explaining this behavior, but that's not a hard blocker.

I'd like to test this out in Linkerd today or tomorrow, I'll post benchmark results if I get a chance to do that!

@@ -620,8 +620,7 @@ impl Core {
// If a task is in the lifo slot, then we must unpark regardless of
// being notified
if self.lifo_slot.is_some() {
worker.shared.idle.unpark_worker_by_id(worker.index);
self.is_searching = true;
self.is_searching = !worker.shared.idle.unpark_worker_by_id(worker.index);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth having a comment in the code explaining that? not a hard blocker though.

tokio/src/runtime/thread_pool/worker.rs Outdated Show resolved Hide resolved
@blt
Copy link
Contributor

blt commented Jan 12, 2022

No change in observed vector throughput, per these results. That said, we don't track CPU use in our experimental setup and the observation duration in a PR is constrained to 200 seconds to make turn-around time feasible.

@carllerche
Copy link
Member Author

@blt thanks for checking. It would be interesting to know the CPU load as that would let us know if it falls within the scope of this change. If you are saturating all workers in your tests, then there would be no visible improvement as it only applies when only a few workers are actually kept busy.

@blt
Copy link
Contributor

blt commented Jan 12, 2022

Unfortunately we don't capture that kind of saturation information yet, though we do have an issue for it: vectordotdev/vector#10456. That said, we're intending to fully saturate vector so it's reasonable to believe there's little idle time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants