implement a faster memory pool

# Problem

When a regex search executes, it has to choose a matching engine (sometimes more than one) to carry out the search. Each matching engine needs some amount of fixed _mutable_ space on the heap to carry out a search, which I'll call "scratch space." In general, this space is reusable and reusing the space leads to significant performance benefits when using a regular expression to carry out multiple searches. (For example, the scratch space may contain computed DFA states.) Scratch space is used every time a regular expression executes a search. For example, calling `re.find_iter("...")` will execute possibly many searches, depending on how many matches it finds.

Here are some constraints I've been working with:
1. A `Regex` must be `Send` and `Sync`. This permits one to share a regex across multiple threads without any external synchronization.
2. The `regex` crate should never spawn a thread.
3. It's OK to have multiple copies of the scratch space. In other words, it's OK to use a little bit more memory to reduce contention.
4. Using a regex in a single thread is far more prevalent than sharing a regex across multiple threads, so it is OK to optimize for the single threaded use case.

Constraint (1) is the killer, because it means synchronizing concurrent access to mutable state. For example, one might have a `Arc<Regex>` where the `Regex` is used simultaneously among multiple threads. If we gave up on the `Sync` bound, then callers would need to either `Clone` regular expressions to use them across multiple threads, or put them in a `Mutex`. The latter is a bit surprising since `Regex` doesn't have any public mutable methods. The latter is also very poor performance wise because one search will completely block all other searches. The former, cloning, is somewhat acceptable if a little wasteful. That is, if we dropped the `Sync` bound, I'd expect users to clone regular expressions if they want to use them across multiple threads.

Constraint (2) just seems like good sense. To me, a thread spawned as a result of running a regex violates the principle of least surprise.

Constraint (3) permits the implementation to be simple and should make contention be a non-factor. If we were more memory conscious, we'd never copy the scratch space, which would mean that each of the regex engines would be forced to do their own type of synchronization. Not only is this much more complex, but it means contention could be a real problem while searching, which seems unfortunate.

Given all of this, it is my belief that the key thing worth optimizing is the overhead of synchronization itself. The cost of copying the scratch space should be amortized through reuse, and contention should be extremely limited since synchronization only needs to occur at the start and end of every search. (I suppose contention could become an issue if a regex that matches _very_ often on very short spans is used simultaneously across multiple threads. The good thing about this is that the caller could work around this by simply cloning the regex, avoiding contention altogether.)
# Current solution

The most straight-forward solution to this problem is to wrap some collection data structure in a `Mutex`. The data structure could trivially be a queue, stack or (singly) linked list. The current implementation is so simple that it can be [digested in a quick skim](https://github.com/rust-lang-nursery/regex/blob/master/src/pool.rs). In particular, the only operations we need to support are `get` and `pop`. No ordering invariants are necessary since all copies of scratch space are equally usable for every regex search.
# Benchmarks

I have three sets of benchmarks representing three pretty easy-to-implement strategies. The first is the base line, which uses a `RefCell<Vec<T>>`. This obviously does no synchronization, so in this world, `Regex` doesn't impl `Sync`. The second is the current solution, which uses `Mutex<Vec<T>>`. The third uses a lock free stack from `crossbeam`, which uses `TreiberStack<T>`.

Here is a comparison between the base line and `Mutex<Vec<T>>`, showing only benchmarks with more than 10% difference (positive percent indicates how much slower mutexes are than refcells in this case):

```
$ cargo-benchcmp rust-refcell.5 rust-mutex.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-mutex.5 ns/iter    diff ns/iter  diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         79 (4,936 MB/s)                   30  61.22%
misc::anchored_literal_short_match  48 (541 MB/s)           77 (337 MB/s)                     29  60.42%
misc::easy0_1K                      226 (4,650 MB/s)        275 (3,821 MB/s)                  49  21.68%
misc::easy0_32                      182 (324 MB/s)          225 (262 MB/s)                    43  23.63%
misc::easy1_1K                      257 (4,062 MB/s)        298 (3,503 MB/s)                  41  15.95%
misc::easy1_32                      160 (325 MB/s)          193 (269 MB/s)                    33  20.62%
misc::medium_32                     255 (235 MB/s)          289 (207 MB/s)                    34  13.33%
misc::one_pass_long_prefix          182 (142 MB/s)          218 (119 MB/s)                    36  19.78%
misc::one_pass_long_prefix_not      154 (168 MB/s)          189 (137 MB/s)                    35  22.73%
misc::one_pass_short                112 (151 MB/s)          147 (115 MB/s)                    35  31.25%
misc::one_pass_short_not            112 (151 MB/s)          146 (116 MB/s)                    34  30.36%
sherlock::everything_greedy         6,838,126 (87 MB/s)     8,261,531 (72 MB/s)        1,423,405  20.82%
sherlock::everything_greedy_nl      5,502,273 (108 MB/s)    6,761,252 (87 MB/s)        1,258,979  22.88%
sherlock::letters                   34,884,851 (17 MB/s)    65,399,440 (9 MB/s)       30,514,589  87.47%
sherlock::letters_lower             33,173,868 (17 MB/s)    60,289,303 (9 MB/s)       27,115,435  81.74%
sherlock::letters_upper             3,787,833 (157 MB/s)    4,650,326 (127 MB/s)         862,493  22.77%
sherlock::name_alt4                 257,680 (2,308 MB/s)    295,505 (2,013 MB/s)          37,825  14.68%
sherlock::name_whitespace           53,913 (11,035 MB/s)    60,149 (9,890 MB/s)            6,236  11.57%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,056,631 (289 MB/s)         422,625  25.86%
sherlock::words                     12,970,070 (45 MB/s)    20,320,171 (29 MB/s)       7,350,101  56.67%
```

And here is a comparison between the base line and `TreiberStack<T>`:

```
$ cargo-benchcmp rust-refcell.5 rust-treiber.5 --threshold 10
name                                rust-refcell.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   49 (7,959 MB/s)         129 (3,023 MB/s)                    80  163.27%
misc::anchored_literal_short_match  48 (541 MB/s)           129 (201 MB/s)                      81  168.75%
misc::easy0_1K                      226 (4,650 MB/s)        319 (3,294 MB/s)                    93   41.15%
misc::easy0_32                      182 (324 MB/s)          268 (220 MB/s)                      86   47.25%
misc::easy1_1K                      257 (4,062 MB/s)        342 (3,052 MB/s)                    85   33.07%
misc::easy1_32                      160 (325 MB/s)          241 (215 MB/s)                      81   50.62%
misc::hard_32                       307 (192 MB/s)          382 (154 MB/s)                      75   24.43%
misc::medium_1K                     609 (1,727 MB/s)        695 (1,513 MB/s)                    86   14.12%
misc::medium_32                     255 (235 MB/s)          339 (176 MB/s)                      84   32.94%
misc::no_exponential                485 (206 MB/s)          576 (173 MB/s)                      91   18.76%
misc::not_literal                   271 (188 MB/s)          350 (145 MB/s)                      79   29.15%
misc::one_pass_long_prefix          182 (142 MB/s)          268 (97 MB/s)                       86   47.25%
misc::one_pass_long_prefix_not      154 (168 MB/s)          234 (111 MB/s)                      80   51.95%
misc::one_pass_short                112 (151 MB/s)          191 (89 MB/s)                       79   70.54%
misc::one_pass_short_not            112 (151 MB/s)          190 (89 MB/s)                       78   69.64%
sherlock::everything_greedy         6,838,126 (87 MB/s)     10,548,218 (56 MB/s)         3,710,092   54.26%
sherlock::ing_suffix                3,027,510 (196 MB/s)    3,548,306 (167 MB/s)           520,796   17.20%
sherlock::ing_suffix_limited_space  2,965,394 (200 MB/s)    3,444,708 (172 MB/s)           479,314   16.16%
sherlock::letters                   34,884,851 (17 MB/s)    109,569,267 (5 MB/s)        74,684,416  214.09%
sherlock::letters_lower             33,173,868 (17 MB/s)    106,638,088 (5 MB/s)        73,464,220  221.45%
sherlock::letters_upper             3,787,833 (157 MB/s)    6,230,733 (95 MB/s)          2,442,900   64.49%
sherlock::name_alt4                 257,680 (2,308 MB/s)    363,242 (1,637 MB/s)           105,562   40.97%
sherlock::name_whitespace           53,913 (11,035 MB/s)    71,406 (8,331 MB/s)             17,493   32.45%
sherlock::quotes                    947,223 (628 MB/s)      1,065,966 (558 MB/s)           118,743   12.54%
sherlock::the_whitespace            1,634,006 (364 MB/s)    2,651,996 (224 MB/s)         1,017,990   62.30%
sherlock::words                     12,970,070 (45 MB/s)    32,089,109 (18 MB/s)        19,119,039  147.41%
```

I note that the comparisons above seem entirely expected to me. Outside of noise, synchronization makes matching universally slower. Moreover, the benchmarks that actually show up in the list (which is a subset of all benchmarks) correspond to benchmarks with _many_ searches over _short haystacks_, or _many short matches_ over _long haystacks_. This is exactly the case where constant overhead will make a difference.

And, to make it easier to read, a comparison between `Mutex<Vec<T>>` and `TreiberStack<T>`:

```
$ cargo-benchcmp rust-mutex.5 rust-treiber.5 --threshold 10
name                                rust-mutex.5 ns/iter  rust-treiber.5 ns/iter    diff ns/iter   diff %
misc::anchored_literal_long_match   79 (4,936 MB/s)       129 (3,023 MB/s)                    50   63.29%
misc::anchored_literal_short_match  77 (337 MB/s)         129 (201 MB/s)                      52   67.53%
misc::easy0_1K                      275 (3,821 MB/s)      319 (3,294 MB/s)                    44   16.00%
misc::easy0_32                      225 (262 MB/s)        268 (220 MB/s)                      43   19.11%
misc::easy1_1K                      298 (3,503 MB/s)      342 (3,052 MB/s)                    44   14.77%
misc::easy1_32                      193 (269 MB/s)        241 (215 MB/s)                      48   24.87%
misc::hard_32                       334 (176 MB/s)        382 (154 MB/s)                      48   14.37%
misc::medium_32                     289 (207 MB/s)        339 (176 MB/s)                      50   17.30%
misc::no_exponential                520 (192 MB/s)        576 (173 MB/s)                      56   10.77%
misc::not_literal                   297 (171 MB/s)        350 (145 MB/s)                      53   17.85%
misc::one_pass_long_prefix          218 (119 MB/s)        268 (97 MB/s)                       50   22.94%
misc::one_pass_long_prefix_not      189 (137 MB/s)        234 (111 MB/s)                      45   23.81%
misc::one_pass_short                147 (115 MB/s)        191 (89 MB/s)                       44   29.93%
misc::one_pass_short_not            146 (116 MB/s)        190 (89 MB/s)                       44   30.14%
sherlock::everything_greedy         8,261,531 (72 MB/s)   10,548,218 (56 MB/s)         2,286,687   27.68%
sherlock::everything_greedy_nl      6,761,252 (87 MB/s)   5,485,594 (108 MB/s)        -1,275,658  -18.87%
sherlock::ing_suffix                3,210,605 (185 MB/s)  3,548,306 (167 MB/s)           337,701   10.52%
sherlock::ing_suffix_limited_space  3,117,710 (190 MB/s)  3,444,708 (172 MB/s)           326,998   10.49%
sherlock::letters                   65,399,440 (9 MB/s)   109,569,267 (5 MB/s)        44,169,827   67.54%
sherlock::letters_lower             60,289,303 (9 MB/s)   106,638,088 (5 MB/s)        46,348,785   76.88%
sherlock::letters_upper             4,650,326 (127 MB/s)  6,230,733 (95 MB/s)          1,580,407   33.98%
sherlock::name_alt4                 295,505 (2,013 MB/s)  363,242 (1,637 MB/s)            67,737   22.92%
sherlock::name_whitespace           60,149 (9,890 MB/s)   71,406 (8,331 MB/s)             11,257   18.72%
sherlock::the_whitespace            2,056,631 (289 MB/s)  2,651,996 (224 MB/s)           595,365   28.95%
sherlock::words                     20,320,171 (29 MB/s)  32,089,109 (18 MB/s)        11,768,938   57.92%
```

I admit that I found it somewhat surprising that a lock free stack was being beaten by a mutex, but this is probably definitely due to my _complete ignorance_ of lock free algorithms. (Hence, why I'm writing this ticket.)
# Other things

When using a `TreiberStack<T>`, `perf top` reports `mem::epoch::participant::Participant::enter` as a potential hotspot.

When using a `Mutex<Vec<T>>`, `perf top` reports `pthread_mutex_lock` and `__pthread_mutex_unlock_usercnt` as potential hot spots.

In both cases, other hotspots also of course appear, such as methods in the DFA, `memchr`, etc...

Another thing I've noticed in profiling is how much time is being spent in the [`Drop` impl for `PoolGuard`](https://github.com/rust-lang-nursery/regex/blob/master/src/pool.rs#L96). I tracked this down to time spent in `memcpy`, so I [moved the representation of scratch spaces to a `Box`](https://github.com/rust-lang-nursery/regex/blob/master/src/prog.rs#L249-L253), which definitely helped, especially for the DFA whose [cache struct isn't tiny](https://github.com/rust-lang-nursery/regex/blob/master/src/dfa.rs#L105). I'm not sure what can be done about this though.
# ???

So what can we do here? Is a `TreiberStack` in `crossbeam` the best lock free algorithm we can use? Does `crossbeam` have overhead that we could eliminate with a custom implementation? Other ideas?
# Scope

In my view, the overhead of synchronization is holding us back from being more competitive with PCRE on very short matches/haystacks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement a faster memory pool #188

Problem

Current solution

Benchmarks

Other things

???

Scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

implement a faster memory pool #188

Description

Problem

Current solution

Benchmarks

Other things

???

Scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions