Closed
Description
To reproduce, create a Cargo.toml
with this:
[package]
name = "regex-contention-repro-work"
version = "0.1.0"
edition = "2021"
[[bin]]
name = "repro"
path = "main.rs"
[dependencies]
anyhow = "1.0.66"
regex = "1.7.0"
And in the same directory, create a main.rs
containing:
use std::{
io::Write,
sync::Arc,
time::{Duration, Instant},
};
use regex::{Regex, RegexBuilder};
const ITERS: usize = 100_000;
const PATTERN: &str = "";
const HAYSTACK: &str = "ZQZQZQZQ";
#[derive(Debug)]
struct Benchmark {
re: Regex,
threads: u32,
}
impl Benchmark {
fn cloned(&self) -> anyhow::Result<Duration> {
let start = Instant::now();
let mut handles = vec![];
for _ in 0..self.threads {
// When we clone the regex like this, it does NOT make a complete
// copy of all of its internal state, but it does create an entirely
// fresh pool from which to get mutable scratch space for each
// search. Basically, a 'Regex' internally looks like this:
//
// struct Regex {
// // Among other things, this contains the literal
// // prefilters and the Thompson VM bytecode
// // instructions.
// read_only: Arc<ReadOnly>,
// // Contains space used by the regex matcher
// // during search time. e.g., The DFA transition
// // table for the lazy DFA or the set of active
// // threads for the Thompson NFA simulation.
// pool: Pool<ScratchSpace>,
// }
//
// That is, a regex already internally uses reference counting,
// so cloning it does not create an entirely separate copy of the
// data. It's effectively free. However, cloning it does create
// an entirely fresh 'Pool'. It specifically does not reuse pools
// across cloned regexes, and it does this specifically so that
// callers have a path that permits them to opt out of contention
// on the pool.
//
// Namely, when a fresh pool is created, it activates a special
// optimization for the first thread that accesses the pool. For
// that thread gets access to a special value ONLY accessible to
// that thread, where as all other threads accessing the pool get
// sent through the "slow" path via a mutex. When a lot of threads
// share the same regex **with the same pool**, this mutex comes
// under very heavy contention.
//
// It is worth pointing out that the mutex is NOT held for the
// duration of the search. Effectively what happens is:
//
// is "first" thread optimization active?
// NO: mutex lock
// pop pointer out of the pool
// mutex unlock
// do a search
// is "first" thread optimization active?
// NO: mutex lock
// push pointer back into pool
// mutex unlock
//
// So in the case where "do a search" is extremely fast, i.e., when
// the haystack is tiny, as in this case, the mutex contention ends
// up dominating the runtime. As the number of threads increases,
// the contention gets worse and worse and thus runtime blows up.
//
// But, all of that contention can be avoided by giving each thread
// a fresh regex and thus each one gets its own pool and each
// thread gets the "first" thread optimization applied. So the
// internal access for the mutable scratch space now looks like
// this:
//
// is "first" thread optimization active?
// YES: return pointer to special mutable scratch space
// do a search
// is "first" thread optimization active?
// YES: do nothing
//
// So how to fix this? Well, it's kind of hard. The regex crate
// used to use the 'thread_local' crate that optimized this
// particular access pattern and essentially kept a hash table
// keyed on thread ID. But this led to other issues. Specifically,
// its memory usage scaled with the number of active threads using
// a regex, where as the current approach scales with the number of
// active threads *simultaneously* using a regex.
//
// I am not an expert on concurrent data structures though, so
// there is likely a better approach. But the idea here is indeed
// to make it possible to opt out of contention by being able to
// clone the regex. Once you do that, there are **zero** competing
// resources between the threads.
//
// Why not just do this in all cases? Well, I guess I would if I
// could, but I don't know how. The reason why explicit cloning
// permits one to opt out is that each thread is handed its own
// copy of the regex and its own pool, and that is specifically
// controlled by the caller. I'm not sure how to do that from
// within the regex library itself, since it isn't really aware of
// threads per se.
let re = self.re.clone();
handles.push(std::thread::spawn(move || {
let mut matched = 0;
for _ in 0..ITERS {
if re.is_match(HAYSTACK) {
matched += 1;
}
}
matched
}));
}
let mut matched = 0;
for h in handles {
matched += h.join().unwrap();
}
assert!(matched > 0);
Ok(Instant::now().duration_since(start))
}
fn shared(&self) -> anyhow::Result<Duration> {
let start = Instant::now();
let mut handles = vec![];
// We clone the regex into an Arc but then share it across all threads.
// Each thread in turn competes with the single regex's shared memory
// pool for mutable scratch space to use during a search. This is what
// ultimately caused this 'shared' benchmark to be much slower than the
// 'cloned' benchmark when run with many threads. Indeed, profiling it
// reveals that most of the time is spent in the regex internal 'Pool'
// type's 'get' and 'get_slow' methods.
let re = Arc::new(self.re.clone());
for _ in 0..self.threads {
let re = Arc::clone(&re);
handles.push(std::thread::spawn(move || {
let mut matched = 0;
for _ in 0..ITERS {
if re.is_match(HAYSTACK) {
matched += 1;
}
}
matched
}));
}
let mut matched = 0;
for h in handles {
matched += h.join().unwrap();
}
assert!(matched > 0);
Ok(Instant::now().duration_since(start))
}
}
fn main() -> anyhow::Result<()> {
let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?;
let re = RegexBuilder::new(PATTERN)
.unicode(false)
.dfa_size_limit(50 * (1 << 20))
.build()?;
let benchmark = Benchmark { re, threads };
let which = std::env::var("REGEX_BENCH_WHICH")?;
let duration = match &*which {
"cloned" => benchmark.cloned(),
"shared" => benchmark.shared(),
unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown),
};
writeln!(std::io::stdout(), "{:?}", duration)?;
Ok(())
}
Now build and run the benchmark:
$ cargo build --release
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 6.5 ms ± 1.2 ms [User: 55.1 ms, System: 3.8 ms]
Range (min … max): 0.1 ms … 10.5 ms 254 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 530.5 ms ± 12.6 ms [User: 1886.3 ms, System: 4994.7 ms]
Range (min … max): 514.2 ms … 552.4 ms 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
As noted in the comments in the code above, the only difference between these two benchmarks is that cloned
creates a fresh Regex
for each thread where as shared
uses the same Regex
across multiple threads. This leads to contention on a single Mutex
in the latter case and overall very poor performance. (See the code comments above for more info.)
We can confirm this by looking at a profile. Here is a screenshot from perf
for the cloned
benchmark:
And now for the shared
benchmark:
As we can see, in the shared
benchmark, virtually all of the time is being spent locking and unlocking the mutex.