Skip to content

implement a faster memory pool #188

@BurntSushi

Description

@BurntSushi

Problem

When a regex search executes, it has to choose a matching engine (sometimes more than one) to carry out the search. Each matching engine needs some amount of fixed mutable space on the heap to carry out a search, which I'll call "scratch space." In general, this space is reusable and reusing the space leads to significant performance benefits when using a regular expression to carry out multiple searches. (For example, the scratch space may contain computed DFA states.) Scratch space is used every time a regular expression executes a search. For example, calling re.find_iter("...") will execute possibly many searches, depending on how many matches it finds.

Here are some constraints I've been working with:

  1. A Regex must be Send and Sync. This permits one to share a regex across multiple threads without any external synchronization.
  2. The regex crate should never spawn a thread.
  3. It's OK to have multiple copies of the scratch space. In other words, it's OK to use a little bit more memory to reduce contention.
  4. Using a regex in a single thread is far more prevalent than sharing a regex across multiple threads, so it is OK to optimize for the single threaded use case.

Constraint (1) is the killer, because it means synchronizing concurrent access to mutable state. For example, one might have a Arc<Regex> where the Regex is used simultaneously among multiple threads. If we gave up on the Sync bound, then callers would need to either Clone regular expressions to use them across multiple threads, or put them in a Mutex. The latter is a bit surprising since Regex doesn't have any public mutable methods. The latter is also very poor performance wise because one search will completely block all other searches. The former, cloning, is somewhat acceptable if a little wasteful. That is, if we dropped the Sync bound, I'd expect users to clone regular expressions if they want to use them across multiple threads.

Constraint (2) just seems like good sense. To me, a thread spawned as a result of running a regex violates the principle of least surprise.

Constraint (3) permits the implementation to be simple and should make contention be a non-factor. If we were more memory conscious, we'd never copy the scratch space, which would mean that each of the regex engines would be forced to do their own type of synchronization. Not only is this much more complex, but it means contention could be a real problem while searching, which seems unfortunate.

Given all of this, it is my belief that the key thing worth optimizing is the overhead of synchronization itself. The cost of copying the scratch space should be amortized through reuse, and contention should be extremely limited since synchronization only needs to occur at the start and end of every search. (I suppose contention could become an issue if a regex that matches very often on very short spans is used simultaneously across multiple threads. The good thing about this is that the caller could work around this by simply cloning the regex, avoiding contention altogether.)

Current solution

The most straight-forward solution to this problem is to wrap some collection data structure in a Mutex. The data structure could trivially be a queue, stack or (singly) linked list. The current implementation is so simple that it can be digested in a quick skim. In particular, the only operations we need to support are get and pop. No ordering invariants are necessary since all copies of scratch space are equally usable for every regex search.

Benchmarks

I have three sets of benchmarks representing three pretty easy-to-implement strategies. The first is the base line, which uses a RefCell<Vec<T>>. This obviously does no synchronization, so in this world, Regex doesn't impl Sync. The second is the current solution, which uses Mutex<Vec<T>>. The third uses a lock free stack from crossbeam, which uses TreiberStack<T>.

Here is a comparison between the base line and Mutex<Vec<T>>, showing only benchmarks with more than 10% difference (positive percent indicates how much slower mutexes are than refcells in this case):

$ cargo-benchcmp rust-refcell.5 rust-mutex.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-mutex.5 ns/iter    diff ns/iter  diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         79 (4,936 MB/s)                   30  61.22%
misc::anchored_literal_short_match  48 (541 MB/s)           77 (337 MB/s)                     29  60.42%
misc::easy0_1K                      226 (4,650 MB/s)        275 (3,821 MB/s)                  49  21.68%
misc::easy0_32                      182 (324 MB/s)          225 (262 MB/s)                    43  23.63%
misc::easy1_1K                      257 (4,062 MB/s)        298 (3,503 MB/s)                  41  15.95%
misc::easy1_32                      160 (325 MB/s)          193 (269 MB/s)                    33  20.62%
misc::medium_32                     255 (235 MB/s)          289 (207 MB/s)                    34  13.33%
misc::one_pass_long_prefix          182 (142 MB/s)          218 (119 MB/s)                    36  19.78%
misc::one_pass_long_prefix_not      154 (168 MB/s)          189 (137 MB/s)                    35  22.73%
misc::one_pass_short                112 (151 MB/s)          147 (115 MB/s)                    35  31.25%
misc::one_pass_short_not            112 (151 MB/s)          146 (116 MB/s)                    34  30.36%
sherlock::everything_greedy         6,838,126 (87 MB/s)     8,261,531 (72 MB/s)        1,423,405  20.82%
sherlock::everything_greedy_nl      5,502,273 (108 MB/s)    6,761,252 (87 MB/s)        1,258,979  22.88%
sherlock::letters                   34,884,851 (17 MB/s)    65,399,440 (9 MB/s)       30,514,589  87.47%
sherlock::letters_lower             33,173,868 (17 MB/s)    60,289,303 (9 MB/s)       27,115,435  81.74%
sherlock::letters_upper             3,787,833 (157 MB/s)    4,650,326 (127 MB/s)         862,493  22.77%
sherlock::name_alt4                 257,680 (2,308 MB/s)    295,505 (2,013 MB/s)          37,825  14.68%
sherlock::name_whitespace           53,913 (11,035 MB/s)    60,149 (9,890 MB/s)            6,236  11.57%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,056,631 (289 MB/s)         422,625  25.86%
sherlock::words                     12,970,070 (45 MB/s)    20,320,171 (29 MB/s)       7,350,101  56.67%

And here is a comparison between the base line and TreiberStack<T>:

$ cargo-benchcmp rust-refcell.5 rust-treiber.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         129 (3,023 MB/s)                    80  163.27%
misc::anchored_literal_short_match  48 (541 MB/s)           129 (201 MB/s)                      81  168.75%
misc::easy0_1K                      226 (4,650 MB/s)        319 (3,294 MB/s)                    93   41.15%
misc::easy0_32                      182 (324 MB/s)          268 (220 MB/s)                      86   47.25%
misc::easy1_1K                      257 (4,062 MB/s)        342 (3,052 MB/s)                    85   33.07%
misc::easy1_32                      160 (325 MB/s)          241 (215 MB/s)                      81   50.62%
misc::hard_32                       307 (192 MB/s)          382 (154 MB/s)                      75   24.43%
misc::medium_1K                     609 (1,727 MB/s)        695 (1,513 MB/s)                    86   14.12%
misc::medium_32                     255 (235 MB/s)          339 (176 MB/s)                      84   32.94%
misc::no_exponential                485 (206 MB/s)          576 (173 MB/s)                      91   18.76%
misc::not_literal                   271 (188 MB/s)          350 (145 MB/s)                      79   29.15%
misc::one_pass_long_prefix          182 (142 MB/s)          268 (97 MB/s)                       86   47.25%
misc::one_pass_long_prefix_not      154 (168 MB/s)          234 (111 MB/s)                      80   51.95%
misc::one_pass_short                112 (151 MB/s)          191 (89 MB/s)                       79   70.54%
misc::one_pass_short_not            112 (151 MB/s)          190 (89 MB/s)                       78   69.64%
sherlock::everything_greedy         6,838,126 (87 MB/s)     10,548,218 (56 MB/s)         3,710,092   54.26%
sherlock::ing_suffix                3,027,510 (196 MB/s)    3,548,306 (167 MB/s)           520,796   17.20%
sherlock::ing_suffix_limited_space  2,965,394 (200 MB/s)    3,444,708 (172 MB/s)           479,314   16.16%
sherlock::letters                   34,884,851 (17 MB/s)    109,569,267 (5 MB/s)        74,684,416  214.09%
sherlock::letters_lower             33,173,868 (17 MB/s)    106,638,088 (5 MB/s)        73,464,220  221.45%
sherlock::letters_upper             3,787,833 (157 MB/s)    6,230,733 (95 MB/s)          2,442,900   64.49%
sherlock::name_alt4                 257,680 (2,308 MB/s)    363,242 (1,637 MB/s)           105,562   40.97%
sherlock::name_whitespace           53,913 (11,035 MB/s)    71,406 (8,331 MB/s)             17,493   32.45%
sherlock::quotes                    947,223 (628 MB/s)      1,065,966 (558 MB/s)           118,743   12.54%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,651,996 (224 MB/s)         1,017,990   62.30%
sherlock::words                     12,970,070 (45 MB/s)    32,089,109 (18 MB/s)        19,119,039  147.41%

I note that the comparisons above seem entirely expected to me. Outside of noise, synchronization makes matching universally slower. Moreover, the benchmarks that actually show up in the list (which is a subset of all benchmarks) correspond to benchmarks with many searches over short haystacks, or many short matches over long haystacks. This is exactly the case where constant overhead will make a difference.

And, to make it easier to read, a comparison between Mutex<Vec<T>> and TreiberStack<T>:

$ cargo-benchcmp rust-mutex.5 rust-treiber.5 --threshold 10
name                                rust-mutex.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   79 (4,936 MB/s)       129 (3,023 MB/s)                    50   63.29%
misc::anchored_literal_short_match  77 (337 MB/s)         129 (201 MB/s)                      52   67.53%
misc::easy0_1K                      275 (3,821 MB/s)      319 (3,294 MB/s)                    44   16.00%
misc::easy0_32                      225 (262 MB/s)        268 (220 MB/s)                      43   19.11%
misc::easy1_1K                      298 (3,503 MB/s)      342 (3,052 MB/s)                    44   14.77%
misc::easy1_32                      193 (269 MB/s)        241 (215 MB/s)                      48   24.87%
misc::hard_32                       334 (176 MB/s)        382 (154 MB/s)                      48   14.37%
misc::medium_32                     289 (207 MB/s)        339 (176 MB/s)                      50   17.30%
misc::no_exponential                520 (192 MB/s)        576 (173 MB/s)                      56   10.77%
misc::not_literal                   297 (171 MB/s)        350 (145 MB/s)                      53   17.85%
misc::one_pass_long_prefix          218 (119 MB/s)        268 (97 MB/s)                       50   22.94%
misc::one_pass_long_prefix_not      189 (137 MB/s)        234 (111 MB/s)                      45   23.81%
misc::one_pass_short                147 (115 MB/s)        191 (89 MB/s)                       44   29.93%
misc::one_pass_short_not            146 (116 MB/s)        190 (89 MB/s)                       44   30.14%
sherlock::everything_greedy         8,261,531 (72 MB/s)   10,548,218 (56 MB/s)         2,286,687   27.68%
sherlock::everything_greedy_nl      6,761,252 (87 MB/s)   5,485,594 (108 MB/s)        -1,275,658  -18.87%
sherlock::ing_suffix                3,210,605 (185 MB/s)  3,548,306 (167 MB/s)           337,701   10.52%
sherlock::ing_suffix_limited_space  3,117,710 (190 MB/s)  3,444,708 (172 MB/s)           326,998   10.49%
sherlock::letters                   65,399,440 (9 MB/s)   109,569,267 (5 MB/s)        44,169,827   67.54%
sherlock::letters_lower             60,289,303 (9 MB/s)   106,638,088 (5 MB/s)        46,348,785   76.88%
sherlock::letters_upper             4,650,326 (127 MB/s)  6,230,733 (95 MB/s)          1,580,407   33.98%
sherlock::name_alt4                 295,505 (2,013 MB/s)  363,242 (1,637 MB/s)            67,737   22.92%
sherlock::name_whitespace           60,149 (9,890 MB/s)   71,406 (8,331 MB/s)             11,257   18.72%
sherlock::the_whitespace            2,056,631 (289 MB/s)  2,651,996 (224 MB/s)           595,365   28.95%
sherlock::words                     20,320,171 (29 MB/s)  32,089,109 (18 MB/s)        11,768,938   57.92%

I admit that I found it somewhat surprising that a lock free stack was being beaten by a mutex, but this is probably definitely due to my complete ignorance of lock free algorithms. (Hence, why I'm writing this ticket.)

Other things

When using a TreiberStack<T>, perf top reports mem::epoch::participant::Participant::enter as a potential hotspot.

When using a Mutex<Vec<T>>, perf top reports pthread_mutex_lock and __pthread_mutex_unlock_usercnt as potential hot spots.

In both cases, other hotspots also of course appear, such as methods in the DFA, memchr, etc...

Another thing I've noticed in profiling is how much time is being spent in the Drop impl for PoolGuard. I tracked this down to time spent in memcpy, so I moved the representation of scratch spaces to a Box, which definitely helped, especially for the DFA whose cache struct isn't tiny. I'm not sure what can be done about this though.

???

So what can we do here? Is a TreiberStack in crossbeam the best lock free algorithm we can use? Does crossbeam have overhead that we could eliminate with a custom implementation? Other ideas?

Scope

In my view, the overhead of synchronization is holding us back from being more competitive with PCRE on very short matches/haystacks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions