Skip to content

sharing one regex across many threads can lead to big slowdowns due to mutex contention #934

Closed
@BurntSushi

Description

@BurntSushi

To reproduce, create a Cargo.toml with this:

[package]
name = "regex-contention-repro-work"
version = "0.1.0"
edition = "2021"

[[bin]]
name = "repro"
path = "main.rs"

[dependencies]
anyhow = "1.0.66"
regex = "1.7.0"

And in the same directory, create a main.rs containing:

use std::{
    io::Write,
    sync::Arc,
    time::{Duration, Instant},
};

use regex::{Regex, RegexBuilder};

const ITERS: usize = 100_000;
const PATTERN: &str = "";
const HAYSTACK: &str = "ZQZQZQZQ";

#[derive(Debug)]
struct Benchmark {
    re: Regex,
    threads: u32,
}

impl Benchmark {
    fn cloned(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        for _ in 0..self.threads {
            // When we clone the regex like this, it does NOT make a complete
            // copy of all of its internal state, but it does create an entirely
            // fresh pool from which to get mutable scratch space for each
            // search. Basically, a 'Regex' internally looks like this:
            //
            //   struct Regex {
            //     // Among other things, this contains the literal
            //     // prefilters and the Thompson VM bytecode
            //     // instructions.
            //     read_only: Arc<ReadOnly>,
            //     // Contains space used by the regex matcher
            //     // during search time. e.g., The DFA transition
            //     // table for the lazy DFA or the set of active
            //     // threads for the Thompson NFA simulation.
            //     pool: Pool<ScratchSpace>,
            //   }
            //
            // That is, a regex already internally uses reference counting,
            // so cloning it does not create an entirely separate copy of the
            // data. It's effectively free. However, cloning it does create
            // an entirely fresh 'Pool'. It specifically does not reuse pools
            // across cloned regexes, and it does this specifically so that
            // callers have a path that permits them to opt out of contention
            // on the pool.
            //
            // Namely, when a fresh pool is created, it activates a special
            // optimization for the first thread that accesses the pool. For
            // that thread gets access to a special value ONLY accessible to
            // that thread, where as all other threads accessing the pool get
            // sent through the "slow" path via a mutex. When a lot of threads
            // share the same regex **with the same pool**, this mutex comes
            // under very heavy contention.
            //
            // It is worth pointing out that the mutex is NOT held for the
            // duration of the search. Effectively what happens is:
            //
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       pop pointer out of the pool
            //       mutex unlock
            //   do a search
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       push pointer back into pool
            //       mutex unlock
            //
            // So in the case where "do a search" is extremely fast, i.e., when
            // the haystack is tiny, as in this case, the mutex contention ends
            // up dominating the runtime. As the number of threads increases,
            // the contention gets worse and worse and thus runtime blows up.
            //
            // But, all of that contention can be avoided by giving each thread
            // a fresh regex and thus each one gets its own pool and each
            // thread gets the "first" thread optimization applied. So the
            // internal access for the mutable scratch space now looks like
            // this:
            //
            //   is "first" thread optimization active?
            //   YES: return pointer to special mutable scratch space
            //   do a search
            //   is "first" thread optimization active?
            //   YES: do nothing
            //
            // So how to fix this? Well, it's kind of hard. The regex crate
            // used to use the 'thread_local' crate that optimized this
            // particular access pattern and essentially kept a hash table
            // keyed on thread ID. But this led to other issues. Specifically,
            // its memory usage scaled with the number of active threads using
            // a regex, where as the current approach scales with the number of
            // active threads *simultaneously* using a regex.
            //
            // I am not an expert on concurrent data structures though, so
            // there is likely a better approach. But the idea here is indeed
            // to make it possible to opt out of contention by being able to
            // clone the regex. Once you do that, there are **zero** competing
            // resources between the threads.
            //
            // Why not just do this in all cases? Well, I guess I would if I
            // could, but I don't know how. The reason why explicit cloning
            // permits one to opt out is that each thread is handed its own
            // copy of the regex and its own pool, and that is specifically
            // controlled by the caller. I'm not sure how to do that from
            // within the regex library itself, since it isn't really aware of
            // threads per se.
            let re = self.re.clone();
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }

    fn shared(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        // We clone the regex into an Arc but then share it across all threads.
        // Each thread in turn competes with the single regex's shared memory
        // pool for mutable scratch space to use during a search. This is what
        // ultimately caused this 'shared' benchmark to be much slower than the
        // 'cloned' benchmark when run with many threads. Indeed, profiling it
        // reveals that most of the time is spent in the regex internal 'Pool'
        // type's 'get' and 'get_slow' methods.
        let re = Arc::new(self.re.clone());
        for _ in 0..self.threads {
            let re = Arc::clone(&re);
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }
}

fn main() -> anyhow::Result<()> {
    let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?;
    let re = RegexBuilder::new(PATTERN)
        .unicode(false)
        .dfa_size_limit(50 * (1 << 20))
        .build()?;
    let benchmark = Benchmark { re, threads };
    let which = std::env::var("REGEX_BENCH_WHICH")?;
    let duration = match &*which {
        "cloned" => benchmark.cloned(),
        "shared" => benchmark.shared(),
        unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown),
    };
    writeln!(std::io::stdout(), "{:?}", duration)?;
    Ok(())
}

Now build and run the benchmark:

$ cargo build --release 
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       6.5 ms ±   1.2 ms    [User: 55.1 ms, System: 3.8 ms]
  Range (min … max):     0.1 ms …  10.5 ms    254 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     530.5 ms ±  12.6 ms    [User: 1886.3 ms, System: 4994.7 ms]
  Range (min … max):   514.2 ms … 552.4 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

As noted in the comments in the code above, the only difference between these two benchmarks is that cloned creates a fresh Regex for each thread where as shared uses the same Regex across multiple threads. This leads to contention on a single Mutex in the latter case and overall very poor performance. (See the code comments above for more info.)

We can confirm this by looking at a profile. Here is a screenshot from perf for the cloned benchmark:

cloned

And now for the shared benchmark:

shared

As we can see, in the shared benchmark, virtually all of the time is being spent locking and unlocking the mutex.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions