-
Notifications
You must be signed in to change notification settings - Fork 480
Description
Problem
When a regex search executes, it has to choose a matching engine (sometimes more than one) to carry out the search. Each matching engine needs some amount of fixed mutable space on the heap to carry out a search, which I'll call "scratch space." In general, this space is reusable and reusing the space leads to significant performance benefits when using a regular expression to carry out multiple searches. (For example, the scratch space may contain computed DFA states.) Scratch space is used every time a regular expression executes a search. For example, calling re.find_iter("...")
will execute possibly many searches, depending on how many matches it finds.
Here are some constraints I've been working with:
- A
Regex
must beSend
andSync
. This permits one to share a regex across multiple threads without any external synchronization. - The
regex
crate should never spawn a thread. - It's OK to have multiple copies of the scratch space. In other words, it's OK to use a little bit more memory to reduce contention.
- Using a regex in a single thread is far more prevalent than sharing a regex across multiple threads, so it is OK to optimize for the single threaded use case.
Constraint (1) is the killer, because it means synchronizing concurrent access to mutable state. For example, one might have a Arc<Regex>
where the Regex
is used simultaneously among multiple threads. If we gave up on the Sync
bound, then callers would need to either Clone
regular expressions to use them across multiple threads, or put them in a Mutex
. The latter is a bit surprising since Regex
doesn't have any public mutable methods. The latter is also very poor performance wise because one search will completely block all other searches. The former, cloning, is somewhat acceptable if a little wasteful. That is, if we dropped the Sync
bound, I'd expect users to clone regular expressions if they want to use them across multiple threads.
Constraint (2) just seems like good sense. To me, a thread spawned as a result of running a regex violates the principle of least surprise.
Constraint (3) permits the implementation to be simple and should make contention be a non-factor. If we were more memory conscious, we'd never copy the scratch space, which would mean that each of the regex engines would be forced to do their own type of synchronization. Not only is this much more complex, but it means contention could be a real problem while searching, which seems unfortunate.
Given all of this, it is my belief that the key thing worth optimizing is the overhead of synchronization itself. The cost of copying the scratch space should be amortized through reuse, and contention should be extremely limited since synchronization only needs to occur at the start and end of every search. (I suppose contention could become an issue if a regex that matches very often on very short spans is used simultaneously across multiple threads. The good thing about this is that the caller could work around this by simply cloning the regex, avoiding contention altogether.)
Current solution
The most straight-forward solution to this problem is to wrap some collection data structure in a Mutex
. The data structure could trivially be a queue, stack or (singly) linked list. The current implementation is so simple that it can be digested in a quick skim. In particular, the only operations we need to support are get
and pop
. No ordering invariants are necessary since all copies of scratch space are equally usable for every regex search.
Benchmarks
I have three sets of benchmarks representing three pretty easy-to-implement strategies. The first is the base line, which uses a RefCell<Vec<T>>
. This obviously does no synchronization, so in this world, Regex
doesn't impl Sync
. The second is the current solution, which uses Mutex<Vec<T>>
. The third uses a lock free stack from crossbeam
, which uses TreiberStack<T>
.
Here is a comparison between the base line and Mutex<Vec<T>>
, showing only benchmarks with more than 10% difference (positive percent indicates how much slower mutexes are than refcells in this case):
$ cargo-benchcmp rust-refcell.5 rust-mutex.5 --threshold 10
name rust-refcell.5 ns/iter rust-mutex.5 ns/iter diff ns/iter diff %
misc::anchored_literal_long_match 49 (7,959 MB/s) 79 (4,936 MB/s) 30 61.22%
misc::anchored_literal_short_match 48 (541 MB/s) 77 (337 MB/s) 29 60.42%
misc::easy0_1K 226 (4,650 MB/s) 275 (3,821 MB/s) 49 21.68%
misc::easy0_32 182 (324 MB/s) 225 (262 MB/s) 43 23.63%
misc::easy1_1K 257 (4,062 MB/s) 298 (3,503 MB/s) 41 15.95%
misc::easy1_32 160 (325 MB/s) 193 (269 MB/s) 33 20.62%
misc::medium_32 255 (235 MB/s) 289 (207 MB/s) 34 13.33%
misc::one_pass_long_prefix 182 (142 MB/s) 218 (119 MB/s) 36 19.78%
misc::one_pass_long_prefix_not 154 (168 MB/s) 189 (137 MB/s) 35 22.73%
misc::one_pass_short 112 (151 MB/s) 147 (115 MB/s) 35 31.25%
misc::one_pass_short_not 112 (151 MB/s) 146 (116 MB/s) 34 30.36%
sherlock::everything_greedy 6,838,126 (87 MB/s) 8,261,531 (72 MB/s) 1,423,405 20.82%
sherlock::everything_greedy_nl 5,502,273 (108 MB/s) 6,761,252 (87 MB/s) 1,258,979 22.88%
sherlock::letters 34,884,851 (17 MB/s) 65,399,440 (9 MB/s) 30,514,589 87.47%
sherlock::letters_lower 33,173,868 (17 MB/s) 60,289,303 (9 MB/s) 27,115,435 81.74%
sherlock::letters_upper 3,787,833 (157 MB/s) 4,650,326 (127 MB/s) 862,493 22.77%
sherlock::name_alt4 257,680 (2,308 MB/s) 295,505 (2,013 MB/s) 37,825 14.68%
sherlock::name_whitespace 53,913 (11,035 MB/s) 60,149 (9,890 MB/s) 6,236 11.57%
sherlock::the_whitespace 1,634,006 (364 MB/s) 2,056,631 (289 MB/s) 422,625 25.86%
sherlock::words 12,970,070 (45 MB/s) 20,320,171 (29 MB/s) 7,350,101 56.67%
And here is a comparison between the base line and TreiberStack<T>
:
$ cargo-benchcmp rust-refcell.5 rust-treiber.5 --threshold 10
name rust-refcell.5 ns/iter rust-treiber.5 ns/iter diff ns/iter diff %
misc::anchored_literal_long_match 49 (7,959 MB/s) 129 (3,023 MB/s) 80 163.27%
misc::anchored_literal_short_match 48 (541 MB/s) 129 (201 MB/s) 81 168.75%
misc::easy0_1K 226 (4,650 MB/s) 319 (3,294 MB/s) 93 41.15%
misc::easy0_32 182 (324 MB/s) 268 (220 MB/s) 86 47.25%
misc::easy1_1K 257 (4,062 MB/s) 342 (3,052 MB/s) 85 33.07%
misc::easy1_32 160 (325 MB/s) 241 (215 MB/s) 81 50.62%
misc::hard_32 307 (192 MB/s) 382 (154 MB/s) 75 24.43%
misc::medium_1K 609 (1,727 MB/s) 695 (1,513 MB/s) 86 14.12%
misc::medium_32 255 (235 MB/s) 339 (176 MB/s) 84 32.94%
misc::no_exponential 485 (206 MB/s) 576 (173 MB/s) 91 18.76%
misc::not_literal 271 (188 MB/s) 350 (145 MB/s) 79 29.15%
misc::one_pass_long_prefix 182 (142 MB/s) 268 (97 MB/s) 86 47.25%
misc::one_pass_long_prefix_not 154 (168 MB/s) 234 (111 MB/s) 80 51.95%
misc::one_pass_short 112 (151 MB/s) 191 (89 MB/s) 79 70.54%
misc::one_pass_short_not 112 (151 MB/s) 190 (89 MB/s) 78 69.64%
sherlock::everything_greedy 6,838,126 (87 MB/s) 10,548,218 (56 MB/s) 3,710,092 54.26%
sherlock::ing_suffix 3,027,510 (196 MB/s) 3,548,306 (167 MB/s) 520,796 17.20%
sherlock::ing_suffix_limited_space 2,965,394 (200 MB/s) 3,444,708 (172 MB/s) 479,314 16.16%
sherlock::letters 34,884,851 (17 MB/s) 109,569,267 (5 MB/s) 74,684,416 214.09%
sherlock::letters_lower 33,173,868 (17 MB/s) 106,638,088 (5 MB/s) 73,464,220 221.45%
sherlock::letters_upper 3,787,833 (157 MB/s) 6,230,733 (95 MB/s) 2,442,900 64.49%
sherlock::name_alt4 257,680 (2,308 MB/s) 363,242 (1,637 MB/s) 105,562 40.97%
sherlock::name_whitespace 53,913 (11,035 MB/s) 71,406 (8,331 MB/s) 17,493 32.45%
sherlock::quotes 947,223 (628 MB/s) 1,065,966 (558 MB/s) 118,743 12.54%
sherlock::the_whitespace 1,634,006 (364 MB/s) 2,651,996 (224 MB/s) 1,017,990 62.30%
sherlock::words 12,970,070 (45 MB/s) 32,089,109 (18 MB/s) 19,119,039 147.41%
I note that the comparisons above seem entirely expected to me. Outside of noise, synchronization makes matching universally slower. Moreover, the benchmarks that actually show up in the list (which is a subset of all benchmarks) correspond to benchmarks with many searches over short haystacks, or many short matches over long haystacks. This is exactly the case where constant overhead will make a difference.
And, to make it easier to read, a comparison between Mutex<Vec<T>>
and TreiberStack<T>
:
$ cargo-benchcmp rust-mutex.5 rust-treiber.5 --threshold 10
name rust-mutex.5 ns/iter rust-treiber.5 ns/iter diff ns/iter diff %
misc::anchored_literal_long_match 79 (4,936 MB/s) 129 (3,023 MB/s) 50 63.29%
misc::anchored_literal_short_match 77 (337 MB/s) 129 (201 MB/s) 52 67.53%
misc::easy0_1K 275 (3,821 MB/s) 319 (3,294 MB/s) 44 16.00%
misc::easy0_32 225 (262 MB/s) 268 (220 MB/s) 43 19.11%
misc::easy1_1K 298 (3,503 MB/s) 342 (3,052 MB/s) 44 14.77%
misc::easy1_32 193 (269 MB/s) 241 (215 MB/s) 48 24.87%
misc::hard_32 334 (176 MB/s) 382 (154 MB/s) 48 14.37%
misc::medium_32 289 (207 MB/s) 339 (176 MB/s) 50 17.30%
misc::no_exponential 520 (192 MB/s) 576 (173 MB/s) 56 10.77%
misc::not_literal 297 (171 MB/s) 350 (145 MB/s) 53 17.85%
misc::one_pass_long_prefix 218 (119 MB/s) 268 (97 MB/s) 50 22.94%
misc::one_pass_long_prefix_not 189 (137 MB/s) 234 (111 MB/s) 45 23.81%
misc::one_pass_short 147 (115 MB/s) 191 (89 MB/s) 44 29.93%
misc::one_pass_short_not 146 (116 MB/s) 190 (89 MB/s) 44 30.14%
sherlock::everything_greedy 8,261,531 (72 MB/s) 10,548,218 (56 MB/s) 2,286,687 27.68%
sherlock::everything_greedy_nl 6,761,252 (87 MB/s) 5,485,594 (108 MB/s) -1,275,658 -18.87%
sherlock::ing_suffix 3,210,605 (185 MB/s) 3,548,306 (167 MB/s) 337,701 10.52%
sherlock::ing_suffix_limited_space 3,117,710 (190 MB/s) 3,444,708 (172 MB/s) 326,998 10.49%
sherlock::letters 65,399,440 (9 MB/s) 109,569,267 (5 MB/s) 44,169,827 67.54%
sherlock::letters_lower 60,289,303 (9 MB/s) 106,638,088 (5 MB/s) 46,348,785 76.88%
sherlock::letters_upper 4,650,326 (127 MB/s) 6,230,733 (95 MB/s) 1,580,407 33.98%
sherlock::name_alt4 295,505 (2,013 MB/s) 363,242 (1,637 MB/s) 67,737 22.92%
sherlock::name_whitespace 60,149 (9,890 MB/s) 71,406 (8,331 MB/s) 11,257 18.72%
sherlock::the_whitespace 2,056,631 (289 MB/s) 2,651,996 (224 MB/s) 595,365 28.95%
sherlock::words 20,320,171 (29 MB/s) 32,089,109 (18 MB/s) 11,768,938 57.92%
I admit that I found it somewhat surprising that a lock free stack was being beaten by a mutex, but this is probably definitely due to my complete ignorance of lock free algorithms. (Hence, why I'm writing this ticket.)
Other things
When using a TreiberStack<T>
, perf top
reports mem::epoch::participant::Participant::enter
as a potential hotspot.
When using a Mutex<Vec<T>>
, perf top
reports pthread_mutex_lock
and __pthread_mutex_unlock_usercnt
as potential hot spots.
In both cases, other hotspots also of course appear, such as methods in the DFA, memchr
, etc...
Another thing I've noticed in profiling is how much time is being spent in the Drop
impl for PoolGuard
. I tracked this down to time spent in memcpy
, so I moved the representation of scratch spaces to a Box
, which definitely helped, especially for the DFA whose cache struct isn't tiny. I'm not sure what can be done about this though.
???
So what can we do here? Is a TreiberStack
in crossbeam
the best lock free algorithm we can use? Does crossbeam
have overhead that we could eliminate with a custom implementation? Other ideas?
Scope
In my view, the overhead of synchronization is holding us back from being more competitive with PCRE on very short matches/haystacks.