Optimize the DFA inner loop. #202

BurntSushi · 2016-04-13T01:33:00Z

This employs a number of tricks to make the inner loop faster:

Elides bounds checks with unsafe. This is the first use of
unsafe in regex. It is justified below with benchmarks, and comments
in the source make an argument for correctness.
Store meta data about states in the upper bits of a state pointer.
This reduces the amount of branching needed.
Create a inner inner loop that handles all transitions between
non-dead, non-match and non-start states. (i.e., The majority of cases.)
In particular, this lets us avoid having to check specifically whether
each state is a match state or not. It is unrolled 4 times.
Start states are only treated specially if there is a prefix detected
that we should scan for. Otherwise, start states are no different than
any other state.
Move transitions from State and into one giant transition table,
which should hopefully improve locality and make better use of the
cache.

The use of unsafe is unfortunate, but it significantly reduces the
number of instructions executed in a search. When the DFA spends a lot
of time in the inner loop, eliding the bounds checks leads to better
performance. In most cases, the boost is worth about 5%, but in some
extreme cases (e.g., a match is the entirety of a large haystack), the
boost can be worth nearly 50%.

Here is a comparison between code without unsafe and with unsafe:

$ cargo-benchcmp rust-safe rust-unsafe --threshold 3
name                                     rust-safe ns/iter     rust-unsafe ns/iter     diff ns/iter   diff %
misc::anchored_literal_long_match        29 (13,448 MB/s)      26 (15,000 MB/s)                  -3  -10.34%
misc::anchored_literal_short_match       28 (928 MB/s)         26 (1,000 MB/s)                   -2   -7.14%
misc::easy0_1MB                          49 (21,400,061 MB/s)  42 (24,966,738 MB/s)              -7  -14.29%
misc::easy1_1K                           79 (13,215 MB/s)      76 (13,736 MB/s)                  -3   -3.80%
misc::easy1_32                           79 (658 MB/s)         76 (684 MB/s)                     -3   -3.80%
misc::easy1_32K                          80 (409,850 MB/s)     76 (431,421 MB/s)                 -4   -5.00%
misc::hard_1K                            104 (10,105 MB/s)     100 (10,510 MB/s)                 -4   -3.85%
misc::match_class_unicode                595 (270 MB/s)        571 (281 MB/s)                   -24   -4.03%
misc::medium_1MB                         51 (20,560,862 MB/s)  44 (23,831,909 MB/s)              -7  -13.73%
misc::no_exponential                     378 (264 MB/s)        361 (277 MB/s)                   -17   -4.50%
misc::not_literal                        206 (247 MB/s)        196 (260 MB/s)                   -10   -4.85%
misc::one_pass_long_prefix               116 (224 MB/s)        111 (234 MB/s)                    -5   -4.31%
misc::one_pass_long_prefix_not           116 (224 MB/s)        108 (240 MB/s)                    -8   -6.90%
misc::one_pass_short                     81 (209 MB/s)         76 (223 MB/s)                     -5   -6.17%
misc::one_pass_short_not                 79 (215 MB/s)         75 (226 MB/s)                     -4   -5.06%
misc::reallyhard_1K                      3,796 (276 MB/s)      3,629 (289 MB/s)                -167   -4.40%
misc::reallyhard_1MB                     3,765,536 (278 MB/s)  3,602,215 (291 MB/s)        -163,321   -4.34%
misc::reallyhard_32                      234 (252 MB/s)        222 (265 MB/s)                   -12   -5.13%
misc::reallyhard_32K                     117,917 (278 MB/s)    112,604 (291 MB/s)            -5,313   -4.51%
misc::replace_all                        144                   137                               -7   -4.86%
sherlock::before_holmes                  2,163,856 (274 MB/s)  2,077,792 (286 MB/s)         -86,064   -3.98%
sherlock::everything_greedy              3,641,444 (163 MB/s)  2,578,502 (230 MB/s)      -1,062,942  -29.19%
sherlock::everything_greedy_nl           2,109,164 (282 MB/s)  1,080,933 (550 MB/s)      -1,028,231  -48.75%
sherlock::holmes_coword_watson           1,087,276 (547 MB/s)  1,037,918 (573 MB/s)         -49,358   -4.54%
sherlock::ing_suffix                     2,419,816 (245 MB/s)  2,308,945 (257 MB/s)        -110,871   -4.58%
sherlock::ing_suffix_limited_space       2,360,927 (251 MB/s)  2,259,791 (263 MB/s)        -101,136   -4.28%
sherlock::letters                        27,710,372 (21 MB/s)  25,348,374 (23 MB/s)      -2,361,998   -8.52%
sherlock::letters_lower                  26,888,541 (22 MB/s)  24,759,385 (24 MB/s)      -2,129,156   -7.92%
sherlock::letters_upper                  3,138,611 (189 MB/s)  2,989,327 (199 MB/s)        -149,284   -4.76%
sherlock::line_boundary_sherlock_holmes  2,132,889 (278 MB/s)  2,046,399 (290 MB/s)         -86,490   -4.06%
sherlock::name_alt1                      35,964 (16,542 MB/s)  37,164 (16,008 MB/s)           1,200    3.34%
sherlock::name_whitespace                88,768 (6,702 MB/s)   85,322 (6,972 MB/s)           -3,446   -3.88%
sherlock::quotes                         800,085 (743 MB/s)    769,792 (772 MB/s)           -30,293   -3.79%
sherlock::the_whitespace                 1,315,168 (452 MB/s)  1,238,173 (480 MB/s)         -76,995   -5.85%
sherlock::words                          11,230,278 (52 MB/s)  9,855,296 (60 MB/s)       -1,374,982  -12.24%

This employs a number of tricks to make the inner loop faster: 1. **Elides bounds checks with `unsafe`**. This is the first use of `unsafe` in regex. It is justified below with benchmarks, and comments in the source make an argument for correctness. 2. Store meta data about states in the upper bits of a state pointer. This reduces the amount of branching needed. 3. Create a inner inner loop that handles all transitions between non-dead, non-match and non-start states. (i.e., The majority of cases.) In particular, this lets us avoid having to check specifically whether each state is a match state or not. 4. Start states are only treated specially if there is a prefix detected that we should scan for. Otherwise, start states are no different than any other state. 5. Move transitions from `State` and into one giant transition table, which should hopefully improve locality and make better use of the cache. The use of `unsafe` is unfortunate, but it significantly reduces the number of instructions executed in a search. When the DFA spends a lot of time in the inner loop, eliding the bounds checks leads to better performance. In most cases, the boost is worth about 5%, but in some extreme cases (e.g., a match is the entirety of a large haystack), the boost can be worth nearly 50%. Here is a comparison between code without `unsafe` and with `unsafe`: ``` $ cargo-benchcmp rust-safe rust-unsafe --threshold 3 name rust-safe ns/iter rust-unsafe ns/iter diff ns/iter diff % misc::anchored_literal_long_match 29 (13,448 MB/s) 26 (15,000 MB/s) -3 -10.34% misc::anchored_literal_short_match 28 (928 MB/s) 26 (1,000 MB/s) -2 -7.14% misc::easy0_1MB 49 (21,400,061 MB/s) 42 (24,966,738 MB/s) -7 -14.29% misc::easy1_1K 79 (13,215 MB/s) 76 (13,736 MB/s) -3 -3.80% misc::easy1_32 79 (658 MB/s) 76 (684 MB/s) -3 -3.80% misc::easy1_32K 80 (409,850 MB/s) 76 (431,421 MB/s) -4 -5.00% misc::hard_1K 104 (10,105 MB/s) 100 (10,510 MB/s) -4 -3.85% misc::match_class_unicode 595 (270 MB/s) 571 (281 MB/s) -24 -4.03% misc::medium_1MB 51 (20,560,862 MB/s) 44 (23,831,909 MB/s) -7 -13.73% misc::no_exponential 378 (264 MB/s) 361 (277 MB/s) -17 -4.50% misc::not_literal 206 (247 MB/s) 196 (260 MB/s) -10 -4.85% misc::one_pass_long_prefix 116 (224 MB/s) 111 (234 MB/s) -5 -4.31% misc::one_pass_long_prefix_not 116 (224 MB/s) 108 (240 MB/s) -8 -6.90% misc::one_pass_short 81 (209 MB/s) 76 (223 MB/s) -5 -6.17% misc::one_pass_short_not 79 (215 MB/s) 75 (226 MB/s) -4 -5.06% misc::reallyhard_1K 3,796 (276 MB/s) 3,629 (289 MB/s) -167 -4.40% misc::reallyhard_1MB 3,765,536 (278 MB/s) 3,602,215 (291 MB/s) -163,321 -4.34% misc::reallyhard_32 234 (252 MB/s) 222 (265 MB/s) -12 -5.13% misc::reallyhard_32K 117,917 (278 MB/s) 112,604 (291 MB/s) -5,313 -4.51% misc::replace_all 144 137 -7 -4.86% sherlock::before_holmes 2,163,856 (274 MB/s) 2,077,792 (286 MB/s) -86,064 -3.98% sherlock::everything_greedy 3,641,444 (163 MB/s) 2,578,502 (230 MB/s) -1,062,942 -29.19% sherlock::everything_greedy_nl 2,109,164 (282 MB/s) 1,080,933 (550 MB/s) -1,028,231 -48.75% sherlock::holmes_coword_watson 1,087,276 (547 MB/s) 1,037,918 (573 MB/s) -49,358 -4.54% sherlock::ing_suffix 2,419,816 (245 MB/s) 2,308,945 (257 MB/s) -110,871 -4.58% sherlock::ing_suffix_limited_space 2,360,927 (251 MB/s) 2,259,791 (263 MB/s) -101,136 -4.28% sherlock::letters 27,710,372 (21 MB/s) 25,348,374 (23 MB/s) -2,361,998 -8.52% sherlock::letters_lower 26,888,541 (22 MB/s) 24,759,385 (24 MB/s) -2,129,156 -7.92% sherlock::letters_upper 3,138,611 (189 MB/s) 2,989,327 (199 MB/s) -149,284 -4.76% sherlock::line_boundary_sherlock_holmes 2,132,889 (278 MB/s) 2,046,399 (290 MB/s) -86,490 -4.06% sherlock::name_alt1 35,964 (16,542 MB/s) 37,164 (16,008 MB/s) 1,200 3.34% sherlock::name_whitespace 88,768 (6,702 MB/s) 85,322 (6,972 MB/s) -3,446 -3.88% sherlock::quotes 800,085 (743 MB/s) 769,792 (772 MB/s) -30,293 -3.79% sherlock::the_whitespace 1,315,168 (452 MB/s) 1,238,173 (480 MB/s) -76,995 -5.85% sherlock::words 11,230,278 (52 MB/s) 9,855,296 (60 MB/s) -1,374,982 -12.24% ```

BurntSushi · 2016-04-13T01:33:38Z

Comparison with current master:

$ cargo-benchcmp ~/rust/regex-master/master rust-unsafe --threshold 3
name                                     master ns/iter        rust-unsafe ns/iter     diff ns/iter   diff %
misc::anchored_literal_long_match        34 (11,470 MB/s)      26 (15,000 MB/s)                  -8  -23.53%
misc::anchored_literal_long_non_match    16 (24,375 MB/s)      15 (26,000 MB/s)                  -1   -6.25%
misc::anchored_literal_short_match       34 (764 MB/s)         26 (1,000 MB/s)                   -8  -23.53%
misc::anchored_literal_short_non_match   16 (1,625 MB/s)       15 (1,733 MB/s)                   -1   -6.25%
misc::easy0_32                           29 (2,034 MB/s)       28 (2,107 MB/s)                   -1   -3.45%
misc::easy0_32K                          29 (1,130,862 MB/s)   28 (1,171,250 MB/s)               -1   -3.45%
misc::easy1_1K                           95 (10,989 MB/s)      76 (13,736 MB/s)                 -19  -20.00%
misc::easy1_1MB                          111 (9,446,810 MB/s)  81 (12,945,629 MB/s)             -30  -27.03%
misc::easy1_32                           95 (547 MB/s)         76 (684 MB/s)                    -19  -20.00%
misc::easy1_32K                          99 (331,191 MB/s)     76 (431,421 MB/s)                -23  -23.23%
misc::hard_1K                            124 (8,475 MB/s)      100 (10,510 MB/s)                -24  -19.35%
misc::hard_1MB                           143 (7,332,888 MB/s)  120 (8,738,358 MB/s)             -23  -16.08%
misc::hard_32                            122 (483 MB/s)        102 (578 MB/s)                   -20  -16.39%
misc::hard_32K                           125 (262,360 MB/s)    101 (324,702 MB/s)               -24  -19.20%
misc::long_needle1                       2,553 (39,169 MB/s)   2,390 (41,841 MB/s)             -163   -6.38%
misc::match_class                        87 (931 MB/s)         83 (975 MB/s)                     -4   -4.60%
misc::match_class_in_range               30 (2,700 MB/s)       29 (2,793 MB/s)                   -1   -3.33%
misc::match_class_unicode                649 (248 MB/s)        571 (281 MB/s)                   -78  -12.02%
misc::medium_1K                          31 (33,935 MB/s)      30 (35,066 MB/s)                  -1   -3.23%
misc::medium_32                          31 (1,935 MB/s)       30 (2,000 MB/s)                   -1   -3.23%
misc::medium_32K                         31 (1,057,935 MB/s)   30 (1,093,200 MB/s)               -1   -3.23%
misc::no_exponential                     401 (249 MB/s)        361 (277 MB/s)                   -40   -9.98%
misc::not_literal                        218 (233 MB/s)        196 (260 MB/s)                   -22  -10.09%
misc::one_pass_long_prefix               134 (194 MB/s)        111 (234 MB/s)                   -23  -17.16%
misc::one_pass_long_prefix_not           128 (203 MB/s)        108 (240 MB/s)                   -20  -15.62%
misc::one_pass_short                     86 (197 MB/s)         76 (223 MB/s)                    -10  -11.63%
misc::one_pass_short_not                 90 (188 MB/s)         75 (226 MB/s)                    -15  -16.67%
misc::reallyhard_1K                      3,986 (263 MB/s)      3,629 (289 MB/s)                -357   -8.96%
misc::reallyhard_1MB                     4,098,021 (255 MB/s)  3,602,215 (291 MB/s)        -495,806  -12.10%
misc::reallyhard_32                      248 (237 MB/s)        222 (265 MB/s)                   -26  -10.48%
misc::reallyhard_32K                     123,380 (265 MB/s)    112,604 (291 MB/s)           -10,776   -8.73%
sherlock::before_holmes                  2,262,521 (262 MB/s)  2,077,792 (286 MB/s)        -184,729   -8.16%
sherlock::everything_greedy              5,835,514 (101 MB/s)  2,578,502 (230 MB/s)      -3,257,012  -55.81%
sherlock::everything_greedy_nl           4,875,405 (122 MB/s)  1,080,933 (550 MB/s)      -3,794,472  -77.83%
sherlock::holmes_cochar_watson           253,745 (2,344 MB/s)  240,479 (2,473 MB/s)         -13,266   -5.23%
sherlock::holmes_coword_watson           1,276,976 (465 MB/s)  1,037,918 (573 MB/s)        -239,058  -18.72%
sherlock::ing_suffix                     2,502,738 (237 MB/s)  2,308,945 (257 MB/s)        -193,793   -7.74%
sherlock::ing_suffix_limited_space       2,443,798 (243 MB/s)  2,259,791 (263 MB/s)        -184,007   -7.53%
sherlock::letters_upper                  3,085,120 (192 MB/s)  2,989,327 (199 MB/s)         -95,793   -3.11%
sherlock::line_boundary_sherlock_holmes  2,229,304 (266 MB/s)  2,046,399 (290 MB/s)        -182,905   -8.20%
sherlock::name_alt4                      244,757 (2,430 MB/s)  234,984 (2,531 MB/s)          -9,773   -3.99%
sherlock::quotes                         822,096 (723 MB/s)    769,792 (772 MB/s)           -52,304   -6.36%
sherlock::repeated_class_negation        89,809,610 (6 MB/s)   94,368,614 (6 MB/s)        4,559,004    5.08%
sherlock::the_whitespace                 1,345,371 (442 MB/s)  1,238,173 (480 MB/s)        -107,198   -7.97%
sherlock::word_ending_n                  56,257,144 (10 MB/s)  60,969,805 (9 MB/s)        4,712,661    8.38%
sherlock::words                          10,859,353 (54 MB/s)  9,855,296 (60 MB/s)       -1,004,057   -9.25%

alexcrichton · 2016-04-13T21:23:16Z

Nice! Everything about regex just keeps getting faster...

I wonder if it'd be worth investigating some fuzzing techniques like afl to test this out a bit? I'm pretty happy with the level of comments and thought here though, so I'd be fine merging at any time :)

BurntSushi · 2016-04-13T21:52:28Z

Fuzzing is definitely a good idea. There is some fuzzy in regex-syntax but that doesn't reach beyond the parser. Coming up with good quickcheck tests for regex is hard. AFL, however, might be worth a shot. In the mean time, I will defer fuzzing to #203 and merge. :-) Thanks!

BurntSushi merged commit 01f23d2 into master Apr 13, 2016

BurntSushi mentioned this pull request Apr 13, 2016

decrease memory usage of DFA with variable width delta encoding of instruction pointers #199

Closed

BurntSushi deleted the opt-dfa branch April 13, 2016 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the DFA inner loop. #202

Optimize the DFA inner loop. #202

BurntSushi commented Apr 13, 2016

BurntSushi commented Apr 13, 2016

alexcrichton commented Apr 13, 2016

BurntSushi commented Apr 13, 2016

Optimize the DFA inner loop. #202

Optimize the DFA inner loop. #202

Conversation

BurntSushi commented Apr 13, 2016

BurntSushi commented Apr 13, 2016

alexcrichton commented Apr 13, 2016

BurntSushi commented Apr 13, 2016