automata: reduce regex contention considerably #1080

BurntSushi · 2023-08-30T23:19:37Z

Context: A Regex uses internal mutable space (called a Cache)
while executing a search. Since a Regex really wants to be easily
shared across multiple threads simultaneously, it follows that a
Regex either needs to provide search functions that accept a &mut Cache (thereby pushing synchronization to a problem for the caller
to solve) or it needs to do synchronization itself. While there are
lower level APIs in regex-automata that do the former, they are
less convenient. The higher level APIs, especially in the regex
crate proper, need to do some kind of synchronization to give a
search the mutable Cache that it needs.

The current approach to that synchronization essentially uses a
Mutex<Vec<Cache>> with an optimization for the "owning" thread
that lets it bypass the Mutex. The owning thread optimization
makes it so the single threaded use case essentially doesn't pay for
any synchronization overhead, and that all works fine. But once the
Regex is shared across multiple threads, that Mutex<Vec<Cache>>
gets hit. And if you're doing a lot of regex searches on short
haystacks in parallel, that Mutex comes under extremely heavy
contention. To the point that a program can slow down by enormous
amounts.

This PR attempts to address that problem.

Note that it's worth pointing out that this issue can be worked
around.

The simplest work-around is to clone a Regex and send it to other
threads instead of sharing a single Regex. This won't use any
additional memory (a Regex is reference counted internally),
but it will force each thread to use the "owner" optimization
described above. This does mean, for example, that you can't
share a Regex across multiple threads conveniently with a
lazy_static/OnceCell/OnceLock/whatever.

The other work-around is to use the lower level search APIs on a
meta::Regex in the regex-automata crate. Those APIs accept a
&mut Cache explicitly. In that case, you can use the thread_local
crate or even an actual thread_local! or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with Regex once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing self.stacks[thread_id % 8]. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a cloned and shared
approach. The cloned approach clones the regex before sending it to
each thread where as the shared approach shares a single regex across
multiple threads. The cloned approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
shared approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'

And here's what it looks like after this PR:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934

film42 · 2023-08-31T14:27:46Z

One question I posed on twitter was what is the best case scenario here where all threads are sharing the same Pool (which will be much worse than the cloned case). I didn't see that the thread_id assignment was made by this library to be an incrementing number which means the bench where 8 threads are testing in parallel is the best case. And, if you scale up to 16 threads with MAX_POOL_STACKS=16 you get similar performance as 8 threads. At 32, it begins to slow down further and I suspect it's cpu contention starting to build up on the Pool. This was run on a 16c/32t threadripper cpu.

Max pool stack @ 8. Including here as a reference for the benchmark running on my machine. THREADS=16 is expected to be slower (and match your PR comment), and it does.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/examples/repro
  Time (mean ± σ):      41.4 ms ±   6.7 ms    [User: 206.4 ms, System: 1.7 ms]
  Range (min … max):    30.2 ms …  57.4 ms    65 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/examples/repro
  Time (mean ± σ):     132.9 ms ±  10.5 ms    [User: 956.1 ms, System: 210.2 ms]
  Range (min … max):   113.0 ms … 147.9 ms    24 runs

Max pool stack @ 16. This matches the case where the max pool stack is 8 and you run with 8 threads. That's good.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/examples/repro
  Time (mean ± σ):      34.2 ms ±   4.8 ms    [User: 314.5 ms, System: 3.1 ms]
  Range (min … max):    26.9 ms …  43.0 ms    77 runs

Max pool stack @ 32. This is consistently ~2x slower than Max pool stack @ 16.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=32 ./target/release/examples/repro
  Time (mean ± σ):      81.9 ms ±   6.7 ms    [User: 1667.0 ms, System: 8.9 ms]
  Range (min … max):    66.4 ms …  94.4 ms    33 runs

So I think this strategy of sharing a pool at 8 threads is a great first step and should be included. If someone has a clever idea for growing the stack when concurrent callers exceed the pool size in a way that isn't slow to manage, that would be worth pursuing.

Even so, for multi-core apps that need "as fast as possible" throughput but also depend on regex, cloned is still 2-5x faster. It would be cool if there was a clippy lint that could warn you about this.

> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934

See #1080 for thoughts on why this doesn't seem to help.

BurntSushi · 2023-09-02T14:06:54Z

I also experimented with another optimization here suggested by @Shnatsel where, instead of creating new pool values when it's under heavy contention, we clone from an existing pool value. The idea here is that cloning should be cheaper overall especially when it comes to the lazy DFA because it will avoid searches needing to re-compute the transition table. This optimization does complicate the implementation somewhat, but it's not terrible. This optimization also requires a different benchmark than the one in #934, since the one in #934 is just an empty regex and doesn't really benefit from having parts of the transition table pre-computed.

So I played around with that idea. I tweaked the benchmark to use \pL{10}[A-Z] as a pattern and πρςστυφχψωZ as a haystack (among other sizes). I chose that pattern and haystack because it should cause the lazy DFA to compute some new transition for each position in the haystack, thus making cache reuse as "valuable" as possible. With the cloning optimization, I was able to observe some very small improvement, but timings usually seemed to come out in a wash.

I wrote this on Discord explaining why I think the optimization doesn't help:

@Shnatsel my theory after trying some more inputs and thinking about
it a bit more is that there is a feedback loop. when contention is
low, cloning from an existing cache doesn't help because, well, the
cloning optimization only really kicks in when contention is high.
one thing that limits contention is longer search times. so as you
grow the size of the regex and/or haystack, search times increase
and contention drops because more time is being spent in the regex
engine. therefore, it seems likely that the cloning optimization is
only going to have a chance of helping when the search time is very
short. but when the search time is very short, there's less of an
opportunity (i think) for the cache reuse to actually help.

i have confirmed, via profiling, that without the cloning
optimization, the search itself is spending a bit more time building
the transition table. and that with the cloning optimization,
that part of the profile essentially disappears. so the cloning
optimization is behaving as intended, but it's just not helping as
much as i think we hoped it would. namely, even if it makes the
search times faster, that in turn is going to increase contention and
apparently negate the gains.

In short, there's a feedback loop where making search times faster doesn't necessarily lead to overall faster times because it increases contention.

Even though there is a plausibly small improvement in one benchmark, I've also seen it result in slower times overall. Since the win isn't clear and since it complicates the implementation, I'm going to forgo this optimization until new evidence emerges that suggests it's a good idea.

If anyone else wants to experiment, the optimization is implemented in the ag/betta-pool-clone-optimization branch. I probably won't keep that branch around forever, but I'll leave it up for now.

BurntSushi · 2023-09-02T15:14:44Z

This PR is on crates.io regex 1.9.5 and regex-automata 0.3.8.

BurntSushi mentioned this pull request Aug 30, 2023

sharing one regex across many threads can lead to big slowdowns due to mutex contention #934

Closed

BurntSushi force-pushed the ag/betta-pool branch from 2aa6744 to 142bc91 Compare September 1, 2023 21:27

BurntSushi changed the title ~~automata: reduce regex contention somewhat~~ automata: reduce regex contention considerably Sep 1, 2023

BurntSushi force-pushed the ag/betta-pool branch from 142bc91 to b3facd3 Compare September 1, 2023 21:50

BurntSushi added a commit that referenced this pull request Sep 2, 2023

automata: experiment with a cloning optimization

e021533

See #1080 for thoughts on why this doesn't seem to help.

BurntSushi merged commit f578d74 into master Sep 2, 2023
16 checks passed

BurntSushi deleted the ag/betta-pool branch September 2, 2023 14:42

BurntSushi restored the ag/betta-pool branch September 2, 2023 15:13

BurntSushi deleted the ag/betta-pool branch September 2, 2023 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automata: reduce regex contention considerably #1080

automata: reduce regex contention considerably #1080

BurntSushi commented Aug 30, 2023 •

edited

film42 commented Aug 31, 2023 •

edited

BurntSushi commented Sep 2, 2023

BurntSushi commented Sep 2, 2023

automata: reduce regex contention considerably #1080

automata: reduce regex contention considerably #1080

Conversation

BurntSushi commented Aug 30, 2023 • edited

film42 commented Aug 31, 2023 • edited

BurntSushi commented Sep 2, 2023

BurntSushi commented Sep 2, 2023

BurntSushi commented Aug 30, 2023 •

edited

film42 commented Aug 31, 2023 •

edited