Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upPerformance issue of defined expressions #350
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 21, 2017
Member
I'll say this first: I admire your courage to do a regex benchmark, but I think the one you have is really quite insufficient for a number of reasons. It seems fixable with some work. Note that the regex crate itself has its own benchmark harness that is hooked up to many of the regex engines you're testing. For example, here's the PCRE2 wiring: https://github.com/rust-lang/regex/blob/master/bench/src/ffi/pcre2.rs
Your benchmark harness for Rust specifically looks a bit strange to me.
- I wonder why you aren't using regex's C library instead of rolling your own?
- You are unnecessarily paying a UTF-8 validation cost for every single search. You should be using the
regex::bytes::Regexinstead to search on a&[u8]. - It doesn't look like you're enabling the SIMD features of the regex crate. (Which is fair if you're explicitly ruling out a nightly Rust compiler as part of your benchmark, but you will wind up measuring results that are probably slower than they could be.)
Popping up a level and looking at your methodology, there are a couple issues:
- It doesn't look like you're enabling Unicode support in PCRE2, which makes the comparison a bit strange with Rust's regex engine, which has Unicode support enabled by default.
- Your Hyperscan benchmark doesn't appear to find the start of a match, which means it's measuring the wrong thing. (I would, however, still expect Hyperscan to dominate this benchmark.)
- Your benchmarks contain pathological cases for FSM based regex engines (specifically,
[a-q][^u-z]{13}x) but contain no regexes that cause exponential blowup for backtracking based engines. Why? - Your corpus appears to be predominantly ASCII. It seems like you should probably have other corpora for languages other than English.
- I can't actually run your harness. I tried the instructions, but I got build errors, probably because I don't have all the regex engines installed. It really should be possible to run the benchmarks on a subset of regex engines.
As for the specific regexes in question:
[a-q][^u-z]{13}x
I would expect this to be slow because of the large counted repetition, but I am surprised to see it be slower than RE2. It could be that this particular regex falls right on the boundary of whether a DFA is used or not.
∞|✓
I don't know off the top of my head.
(?i)Twain
It appears this is about on par with RE2, but I suspect this will get a speed boost if you used SIMD.
In order for me to say more, I'd actually need to be able to run your harness and that seems difficult to do. From looking at your results, the speed of some regexs is surprising to me, e.g., Tom|Sawyer|Huckleberry|Finn is something the regex crate should do well on.
|
I'll say this first: I admire your courage to do a regex benchmark, but I think the one you have is really quite insufficient for a number of reasons. It seems fixable with some work. Note that the regex crate itself has its own benchmark harness that is hooked up to many of the regex engines you're testing. For example, here's the PCRE2 wiring: https://github.com/rust-lang/regex/blob/master/bench/src/ffi/pcre2.rs Your benchmark harness for Rust specifically looks a bit strange to me.
Popping up a level and looking at your methodology, there are a couple issues:
As for the specific regexes in question:
I would expect this to be slow because of the large counted repetition, but I am surprised to see it be slower than RE2. It could be that this particular regex falls right on the boundary of whether a DFA is used or not.
I don't know off the top of my head.
It appears this is about on par with RE2, but I suspect this will get a speed boost if you used SIMD. In order for me to say more, I'd actually need to be able to run your harness and that seems difficult to do. From looking at your results, the speed of some regexs is surprising to me, e.g., |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 21, 2017
Member
My attempt at building even after I installed all the regex libraries:
[andrew@Serval regex-performance] mkdir build
[andrew@Serval regex-performance] cd build/
[andrew@Serval build] cmake ..
-- The C compiler identification is GNU 6.3.1
-- The CXX compiler identification is GNU 6.3.1
-- Check for working C compiler: /usr/sbin/cc
-- Check for working C compiler: /usr/sbin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/sbin/c++
-- Check for working CXX compiler: /usr/sbin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
fatal: No names found, cannot describe anything.
CMake Error at CMakeLists.txt:37 (string):
string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
command.
CMake Error at CMakeLists.txt:38 (string):
string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
command.
CMake Error at CMakeLists.txt:39 (string):
string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
command.
*******************************************************************
* Regex performance tool version ..
* Build type: Release
* Can be changed with: cmake -DCMAKE_BUILD_TYPE=[Release|Debug]
*******************************************************************
-- Found rust packet manager: cargo 0.18.0-nightly (4a3c0a63b 2017-03-12)
-- Configuring incomplete, errors occurred!
See also "/home/andrew/clones/regex-performance/build/CMakeFiles/CMakeOutput.log".
And here's my CMakeOutput.log: https://gist.github.com/anonymous/3315943f580695923cfe1160959fe2e4
|
My attempt at building even after I installed all the regex libraries:
And here's my |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SchmidtD
Mar 21, 2017
Thank you for your feedback. The project requires cmake 3.0 or newer. I fixed this already in my forked repo: https://github.com/SchmidtD/regex-performance
I will try to incorporate your comments and see whether the results have been changed.
SchmidtD
commented
Mar 21, 2017
|
Thank you for your feedback. The project requires cmake 3.0 or newer. I fixed this already in my forked repo: https://github.com/SchmidtD/regex-performance |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 21, 2017
Member
@SchmidtD I have a recent cmake:
$ cmake --version
cmake version 3.7.2
One other thing I realized: the benchmark containing \b is using a Unicode word boundary in Rust but an ASCII word boundary in every other regex engine. You can use an ASCII word boundary in Rust with (?-u:\b).
|
@SchmidtD I have a recent
One other thing I realized: the benchmark containing |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mdb256
Mar 21, 2017
I tripped over the same CMake error - and I believe it is due to how git describe --tags is invoked.
This only appears to work if your build dir is a subdir of the git tree. If you are building out-of-tree, then when git is executed it fails with the "not a git repo" error.
That's incorrect - I had to add a tag to the git repo for the git describe --tags to work. I just added git tag v0.0.0 and that part worked. I also had to fix some of the path handling to do with CMAKE_CURRENT_SOURCE_DIR and PROJECT_BINARY_DIR relating to the vendor source dirs.
mdb256
commented
Mar 21, 2017
•
|
I tripped over the same CMake error - and I believe it is due to how
That's incorrect - I had to add a tag to the git repo for the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SchmidtD
Mar 22, 2017
Thank you for your feedback and sorry for the circumstances. The rust-leipzig/regex-performance was not in a good state. I fixed the path issues and also the missing tag. The latest commit (92b01eb) is building.
SchmidtD
commented
Mar 22, 2017
|
Thank you for your feedback and sorry for the circumstances. The rust-leipzig/regex-performance was not in a good state. I fixed the path issues and also the missing tag. The latest commit (92b01eb) is building. |
This was referenced Mar 22, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 22, 2017
Member
@SchmidtD OK, I was finally able to run the benchmark harness. I couldn't observe much difference between removing the UTF-8 check and not, but on my machine, running the benchmark harness as defined shows Rust edge out PCRE2's JIT ever so slightly:
Total Results:
[ pcre] time: 9007 ms, score: 7 points,
[ pcre-dfa] time: 9997 ms, score: 10 points,
[ pcre-jit] time: 794 ms, score: 41 points,
[ re2] time: 671 ms, score: 31 points,
[ onig] time: 1856 ms, score: 7 points,
[ tre] time: 7866 ms, score: 0 points,
[ hscan] time: 108 ms, score: 71 points,
[rust_regex] time: 2637 ms, score: 49 points,
If I disabled the UTF-8 check in Rust:
Total Results:
[ pcre] time: 9190 ms, score: 7 points,
[ pcre-dfa] time: 10287 ms, score: 7 points,
[ pcre-jit] time: 811 ms, score: 47 points,
[ re2] time: 678 ms, score: 31 points,
[ onig] time: 1879 ms, score: 7 points,
[ tre] time: 8070 ms, score: 0 points,
[ hscan] time: 114 ms, score: 71 points,
[rust_regex] time: 2775 ms, score: 50 points,
Interestingly, there is a big swing between Rust and PCRE2 here, which suggests the benchmark is susceptible to noise.
And if I enable SIMD, Rust gets a lot faster on several benchmarks:
Total Results:
[ pcre] time: 9276 ms, score: 4 points,
[ pcre-dfa] time: 10343 ms, score: 6 points,
[ pcre-jit] time: 816 ms, score: 45 points,
[ re2] time: 676 ms, score: 28 points,
[ onig] time: 1891 ms, score: 7 points,
[ tre] time: 8080 ms, score: 0 points,
[ hscan] time: 115 ms, score: 71 points,
[rust_regex] time: 2617 ms, score: 61 points,
Raw CSV data is here: https://gist.github.com/anonymous/e55e13e1b72090c0319f5916385d7a24
|
@SchmidtD OK, I was finally able to run the benchmark harness. I couldn't observe much difference between removing the UTF-8 check and not, but on my machine, running the benchmark harness as defined shows Rust edge out PCRE2's JIT ever so slightly:
If I disabled the UTF-8 check in Rust:
Interestingly, there is a big swing between Rust and PCRE2 here, which suggests the benchmark is susceptible to noise. And if I enable SIMD, Rust gets a lot faster on several benchmarks:
Raw CSV data is here: https://gist.github.com/anonymous/e55e13e1b72090c0319f5916385d7a24 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 22, 2017
Member
To look at these regexes again:
[a-q][^u-z]{13}x
This one is just going to be slow. And it's OK because it's a good example of a regex that a FSM will struggle with but a backtracking engine can do well. But I do think you should include a benchmark that does well on FSMs and not backtracking.
∞|✓
SIMD fixes this.
(?i)Twain
SIMD fixes this.
|
To look at these regexes again:
This one is just going to be slow. And it's OK because it's a good example of a regex that a FSM will struggle with but a backtracking engine can do well. But I do think you should include a benchmark that does well on FSMs and not backtracking.
SIMD fixes this.
SIMD fixes this. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Mar 22, 2017
Member
FWIW, I'll also add that my observations are roughly consistent with the results produced by your benchmark:
- Rust and PCRE2-JIT are pretty close.
- Rust tends to edge out RE2.
- Hyperscan dominates. (But note that you really should at least be getting the start of the a match, otherwise you aren't benchmarking the same thing even remotely.)
The first two are supported by the benchmark harness in this crate as well. (Hyperscan isn't part of that harness.)
|
FWIW, I'll also add that my observations are roughly consistent with the results produced by your benchmark:
The first two are supported by the benchmark harness in this crate as well. (Hyperscan isn't part of that harness.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
SchmidtD
Mar 24, 2017
Thank you for your feedback. I adjusted the benchmark to get the start of a match for hyperscan. Hyperscan still dominates, but the gap is smaller.
Since the limitation of the regex crate regarding backtracking and backtracing is documented, I assume that there is nothing to do. In the case someone else has no additions, the ticket can be closed.
SchmidtD
commented
Mar 24, 2017
|
Thank you for your feedback. I adjusted the benchmark to get the start of a match for hyperscan. Hyperscan still dominates, but the gap is smaller. Since the limitation of the regex crate regarding backtracking and backtracing is documented, I assume that there is nothing to do. In the case someone else has no additions, the ticket can be closed. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This looks good: SchmidtD/regex-performance@4fc5a3d Thanks! |
BurntSushi
closed this
Mar 24, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gnzlbg
May 18, 2017
@SchmidtD would it be possible to compare D's regex engine as well (using ldc) ? I see a lots of comments on reddit arguing that D has the fastest regex engine, but I never see proof.
gnzlbg
commented
May 18, 2017
•
|
@SchmidtD would it be possible to compare D's regex engine as well (using ldc) ? I see a lots of comments on reddit arguing that D has the fastest regex engine, but I never see proof. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
May 18, 2017
Member
@gnzlbg I haven't seen any benchmarks either. I looked into doing a comparison a while back, but I don't know D, and was therefore too likely for me to make a mistake.
If D can expose a C library, then I'd be happy to be shown how and add it to this repo's benchmark suite (which already benchmarks C and C++ libraries).
|
@gnzlbg I haven't seen any benchmarks either. I looked into doing a comparison a while back, but I don't know D, and was therefore too likely for me to make a mistake. If D can expose a C library, then I'd be happy to be shown how and add it to this repo's benchmark suite (which already benchmarks C and C++ libraries). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
May 18, 2017
Member
A few months ago, this was posted on reddit: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ --- But the article doesn't say anything about the benchmarking methodology. Based on the description of the implementation, I am quite skeptical. A bit parallel NFA might do better than the normal Thompson construction, but I'm pretty skeptical that it can compete with a JIT or a DFA... But, who knows, I'm happy to be wrong!
|
A few months ago, this was posted on reddit: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ --- But the article doesn't say anything about the benchmarking methodology. Based on the description of the implementation, I am quite skeptical. A bit parallel NFA might do better than the normal Thompson construction, but I'm pretty skeptical that it can compete with a JIT or a DFA... But, who knows, I'm happy to be wrong! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
gnzlbg
Feb 3, 2018
Nice, what's happening with:
(without CTFE, Rust left, D right)
sherlock::line_boundary_sherlock_holmes 931,523 (638 MB/s) 370,706 (1604 MB/s) -560,817 -60.20% x 2.51
(with CTFE, Rust left, D right)
sherlock::line_boundary_sherlock_holmes 931,523 (638 MB/s) 393,608 (1511 MB/s) -537,915 -57.75% x 2.37
It looks like D is 2.5x faster than the Rust implementation there. Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/
gnzlbg
commented
Feb 3, 2018
|
Nice, what's happening with:
It looks like D is 2.5x faster than the Rust implementation there. Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/ |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
BurntSushi
Feb 3, 2018
Member
It looks like D is 2.5x faster than the Rust implementation there.
Dunno. If D does literal extraction, then it might know to look for line boundaries outside the regex engine. Rust's had that optimization at one point, but it had subtle bugs so I threw it out until I can get around to improving literal handling. But this is just a guess. I'm not familiar with D's regex engine and I haven't quite learned how to read D yet.
Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/
Indeed. I don't know D well, so I don't know.
Dunno. If D does literal extraction, then it might know to look for line boundaries outside the regex engine. Rust's had that optimization at one point, but it had subtle bugs so I threw it out until I can get around to improving literal handling. But this is just a guess. I'm not familiar with D's regex engine and I haven't quite learned how to read D yet.
Indeed. I don't know D well, so I don't know. |
SchmidtD commentedMar 21, 2017
I'm working on a comparison of different regex engines including your crate. I discovered some expressions with a slow execution time compared to other engines. I used the test tool from rust-leipzig/regex-performance to measure the performance on a given input file.
The following chart tries to give an overview of the results:

The affected expressions are:
Are there any issues or limitation regarding the given expressions known?
Additionally I attached my results:
results.zip