Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue of defined expressions #350

Closed
BernhardtD opened this issue Mar 21, 2017 · 17 comments
Closed

Performance issue of defined expressions #350

BernhardtD opened this issue Mar 21, 2017 · 17 comments

Comments

@BernhardtD
Copy link

I'm working on a comparison of different regex engines including your crate. I discovered some expressions with a slow execution time compared to other engines. I used the test tool from rust-leipzig/regex-performance to measure the performance on a given input file.

The following chart tries to give an overview of the results:
rust_regex

The affected expressions are:

  1. [a-q][^u-z]{13}x
  2. ∞|✓
  3. (?i)Twain

Are there any issues or limitation regarding the given expressions known?

Additionally I attached my results:
results.zip

@BurntSushi
Copy link
Member

I'll say this first: I admire your courage to do a regex benchmark, but I think the one you have is really quite insufficient for a number of reasons. It seems fixable with some work. Note that the regex crate itself has its own benchmark harness that is hooked up to many of the regex engines you're testing. For example, here's the PCRE2 wiring: https://github.com/rust-lang/regex/blob/master/bench/src/ffi/pcre2.rs

Your benchmark harness for Rust specifically looks a bit strange to me.

  1. I wonder why you aren't using regex's C library instead of rolling your own?
  2. You are unnecessarily paying a UTF-8 validation cost for every single search. You should be using the regex::bytes::Regex instead to search on a &[u8].
  3. It doesn't look like you're enabling the SIMD features of the regex crate. (Which is fair if you're explicitly ruling out a nightly Rust compiler as part of your benchmark, but you will wind up measuring results that are probably slower than they could be.)

Popping up a level and looking at your methodology, there are a couple issues:

  1. It doesn't look like you're enabling Unicode support in PCRE2, which makes the comparison a bit strange with Rust's regex engine, which has Unicode support enabled by default.
  2. Your Hyperscan benchmark doesn't appear to find the start of a match, which means it's measuring the wrong thing. (I would, however, still expect Hyperscan to dominate this benchmark.)
  3. Your benchmarks contain pathological cases for FSM based regex engines (specifically, [a-q][^u-z]{13}x) but contain no regexes that cause exponential blowup for backtracking based engines. Why?
  4. Your corpus appears to be predominantly ASCII. It seems like you should probably have other corpora for languages other than English.
  5. I can't actually run your harness. I tried the instructions, but I got build errors, probably because I don't have all the regex engines installed. It really should be possible to run the benchmarks on a subset of regex engines.

As for the specific regexes in question:

[a-q][^u-z]{13}x

I would expect this to be slow because of the large counted repetition, but I am surprised to see it be slower than RE2. It could be that this particular regex falls right on the boundary of whether a DFA is used or not.

∞|✓

I don't know off the top of my head.

(?i)Twain

It appears this is about on par with RE2, but I suspect this will get a speed boost if you used SIMD.

In order for me to say more, I'd actually need to be able to run your harness and that seems difficult to do. From looking at your results, the speed of some regexs is surprising to me, e.g., Tom|Sawyer|Huckleberry|Finn is something the regex crate should do well on.

@BurntSushi
Copy link
Member

My attempt at building even after I installed all the regex libraries:

[andrew@Serval regex-performance] mkdir build
[andrew@Serval regex-performance] cd build/
[andrew@Serval build] cmake ..
-- The C compiler identification is GNU 6.3.1
-- The CXX compiler identification is GNU 6.3.1
-- Check for working C compiler: /usr/sbin/cc
-- Check for working C compiler: /usr/sbin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/sbin/c++
-- Check for working CXX compiler: /usr/sbin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
fatal: No names found, cannot describe anything.
CMake Error at CMakeLists.txt:37 (string):
  string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
  command.


CMake Error at CMakeLists.txt:38 (string):
  string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
  command.


CMake Error at CMakeLists.txt:39 (string):
  string sub-command REGEX, mode REPLACE needs at least 6 arguments total to
  command.



*******************************************************************
* Regex performance tool version ..
* Build type: Release
* Can be changed with: cmake -DCMAKE_BUILD_TYPE=[Release|Debug]
*******************************************************************

-- Found rust packet manager: cargo 0.18.0-nightly (4a3c0a63b 2017-03-12)

-- Configuring incomplete, errors occurred!
See also "/home/andrew/clones/regex-performance/build/CMakeFiles/CMakeOutput.log".

And here's my CMakeOutput.log: https://gist.github.com/anonymous/3315943f580695923cfe1160959fe2e4

@BernhardtD
Copy link
Author

Thank you for your feedback. The project requires cmake 3.0 or newer. I fixed this already in my forked repo: https://github.com/SchmidtD/regex-performance
I will try to incorporate your comments and see whether the results have been changed.

@BurntSushi
Copy link
Member

@schmidtd I have a recent cmake:

$ cmake --version
cmake version 3.7.2

One other thing I realized: the benchmark containing \b is using a Unicode word boundary in Rust but an ASCII word boundary in every other regex engine. You can use an ASCII word boundary in Rust with (?-u:\b).

@mdb256
Copy link

mdb256 commented Mar 21, 2017

I tripped over the same CMake error - and I believe it is due to how git describe --tags is invoked.

This only appears to work if your build dir is a subdir of the git tree. If you are building out-of-tree, then when git is executed it fails with the "not a git repo" error.

That's incorrect - I had to add a tag to the git repo for the git describe --tags to work. I just added git tag v0.0.0 and that part worked. I also had to fix some of the path handling to do with CMAKE_CURRENT_SOURCE_DIR and PROJECT_BINARY_DIR relating to the vendor source dirs.

@BernhardtD
Copy link
Author

Thank you for your feedback and sorry for the circumstances. The rust-leipzig/regex-performance was not in a good state. I fixed the path issues and also the missing tag. The latest commit (92b01eb) is building.

@BurntSushi
Copy link
Member

@schmidtd OK, I was finally able to run the benchmark harness. I couldn't observe much difference between removing the UTF-8 check and not, but on my machine, running the benchmark harness as defined shows Rust edge out PCRE2's JIT ever so slightly:

Total Results:
[      pcre] time:    9007 ms, score:      7 points,
[  pcre-dfa] time:    9997 ms, score:     10 points,
[  pcre-jit] time:     794 ms, score:     41 points,
[       re2] time:     671 ms, score:     31 points,
[      onig] time:    1856 ms, score:      7 points,
[       tre] time:    7866 ms, score:      0 points,
[     hscan] time:     108 ms, score:     71 points,
[rust_regex] time:    2637 ms, score:     49 points,

If I disabled the UTF-8 check in Rust:

Total Results:
[      pcre] time:    9190 ms, score:      7 points,
[  pcre-dfa] time:   10287 ms, score:      7 points,
[  pcre-jit] time:     811 ms, score:     47 points,
[       re2] time:     678 ms, score:     31 points,
[      onig] time:    1879 ms, score:      7 points,
[       tre] time:    8070 ms, score:      0 points,
[     hscan] time:     114 ms, score:     71 points,
[rust_regex] time:    2775 ms, score:     50 points,

Interestingly, there is a big swing between Rust and PCRE2 here, which suggests the benchmark is susceptible to noise.

And if I enable SIMD, Rust gets a lot faster on several benchmarks:

Total Results:
[      pcre] time:    9276 ms, score:      4 points,
[  pcre-dfa] time:   10343 ms, score:      6 points,
[  pcre-jit] time:     816 ms, score:     45 points,
[       re2] time:     676 ms, score:     28 points,
[      onig] time:    1891 ms, score:      7 points,
[       tre] time:    8080 ms, score:      0 points,
[     hscan] time:     115 ms, score:     71 points,
[rust_regex] time:    2617 ms, score:     61 points,

Raw CSV data is here: https://gist.github.com/anonymous/e55e13e1b72090c0319f5916385d7a24

@BurntSushi
Copy link
Member

To look at these regexes again:

[a-q][^u-z]{13}x

This one is just going to be slow. And it's OK because it's a good example of a regex that a FSM will struggle with but a backtracking engine can do well. But I do think you should include a benchmark that does well on FSMs and not backtracking.

∞|✓

SIMD fixes this.

(?i)Twain

SIMD fixes this.

@BurntSushi
Copy link
Member

BurntSushi commented Mar 22, 2017

FWIW, I'll also add that my observations are roughly consistent with the results produced by your benchmark:

  • Rust and PCRE2-JIT are pretty close.
  • Rust tends to edge out RE2.
  • Hyperscan dominates. (But note that you really should at least be getting the start of the a match, otherwise you aren't benchmarking the same thing even remotely.)

The first two are supported by the benchmark harness in this crate as well. (Hyperscan isn't part of that harness.)

@BernhardtD
Copy link
Author

Thank you for your feedback. I adjusted the benchmark to get the start of a match for hyperscan. Hyperscan still dominates, but the gap is smaller.

Since the limitation of the regex crate regarding backtracking and backtracing is documented, I assume that there is nothing to do. In the case someone else has no additions, the ticket can be closed.

@BurntSushi
Copy link
Member

This looks good: BernhardtD/regex-performance@4fc5a3d

Thanks!

@gnzlbg
Copy link
Contributor

gnzlbg commented May 18, 2017

@schmidtd would it be possible to compare D's regex engine as well (using ldc) ? I see a lots of comments on reddit arguing that D has the fastest regex engine, but I never see proof.

@BurntSushi
Copy link
Member

@gnzlbg I haven't seen any benchmarks either. I looked into doing a comparison a while back, but I don't know D, and was therefore too likely for me to make a mistake.

If D can expose a C library, then I'd be happy to be shown how and add it to this repo's benchmark suite (which already benchmarks C and C++ libraries).

@BurntSushi
Copy link
Member

BurntSushi commented May 18, 2017

A few months ago, this was posted on reddit: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ --- But the article doesn't say anything about the benchmarking methodology. Based on the description of the implementation, I am quite skeptical. A bit parallel NFA might do better than the normal Thompson construction, but I'm pretty skeptical that it can compete with a JIT or a DFA... But, who knows, I'm happy to be wrong!

@BurntSushi
Copy link
Member

@gnzlbg A D benchmark harness was added in #439 and #441

@gnzlbg
Copy link
Contributor

gnzlbg commented Feb 3, 2018

Nice, what's happening with:

(without CTFE, Rust left, D right)
sherlock::line_boundary_sherlock_holmes 931,523 (638 MB/s) 370,706 (1604 MB/s) -560,817 -60.20% x 2.51

(with CTFE, Rust left, D right)
sherlock::line_boundary_sherlock_holmes 931,523 (638 MB/s) 393,608 (1511 MB/s) -537,915 -57.75% x 2.37

It looks like D is 2.5x faster than the Rust implementation there. Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/

@BurntSushi
Copy link
Member

BurntSushi commented Feb 3, 2018

It looks like D is 2.5x faster than the Rust implementation there.

Dunno. If D does literal extraction, then it might know to look for line boundaries outside the regex engine. Rust's had that optimization at one point, but it had subtle bugs so I threw it out until I can get around to improving literal handling. But this is just a guess. I'm not familiar with D's regex engine and I haven't quite learned how to read D yet.

Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/

Indeed. I don't know D well, so I don't know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants