-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue of defined expressions #350
Comments
I'll say this first: I admire your courage to do a regex benchmark, but I think the one you have is really quite insufficient for a number of reasons. It seems fixable with some work. Note that the regex crate itself has its own benchmark harness that is hooked up to many of the regex engines you're testing. For example, here's the PCRE2 wiring: https://github.com/rust-lang/regex/blob/master/bench/src/ffi/pcre2.rs Your benchmark harness for Rust specifically looks a bit strange to me.
Popping up a level and looking at your methodology, there are a couple issues:
As for the specific regexes in question:
I would expect this to be slow because of the large counted repetition, but I am surprised to see it be slower than RE2. It could be that this particular regex falls right on the boundary of whether a DFA is used or not.
I don't know off the top of my head.
It appears this is about on par with RE2, but I suspect this will get a speed boost if you used SIMD. In order for me to say more, I'd actually need to be able to run your harness and that seems difficult to do. From looking at your results, the speed of some regexs is surprising to me, e.g., |
My attempt at building even after I installed all the regex libraries:
And here's my |
Thank you for your feedback. The project requires cmake 3.0 or newer. I fixed this already in my forked repo: https://github.com/SchmidtD/regex-performance |
@schmidtd I have a recent
One other thing I realized: the benchmark containing |
I tripped over the same CMake error - and I believe it is due to how
That's incorrect - I had to add a tag to the git repo for the |
Thank you for your feedback and sorry for the circumstances. The rust-leipzig/regex-performance was not in a good state. I fixed the path issues and also the missing tag. The latest commit (92b01eb) is building. |
@schmidtd OK, I was finally able to run the benchmark harness. I couldn't observe much difference between removing the UTF-8 check and not, but on my machine, running the benchmark harness as defined shows Rust edge out PCRE2's JIT ever so slightly:
If I disabled the UTF-8 check in Rust:
Interestingly, there is a big swing between Rust and PCRE2 here, which suggests the benchmark is susceptible to noise. And if I enable SIMD, Rust gets a lot faster on several benchmarks:
Raw CSV data is here: https://gist.github.com/anonymous/e55e13e1b72090c0319f5916385d7a24 |
To look at these regexes again:
This one is just going to be slow. And it's OK because it's a good example of a regex that a FSM will struggle with but a backtracking engine can do well. But I do think you should include a benchmark that does well on FSMs and not backtracking.
SIMD fixes this.
SIMD fixes this. |
FWIW, I'll also add that my observations are roughly consistent with the results produced by your benchmark:
The first two are supported by the benchmark harness in this crate as well. (Hyperscan isn't part of that harness.) |
Thank you for your feedback. I adjusted the benchmark to get the start of a match for hyperscan. Hyperscan still dominates, but the gap is smaller. Since the limitation of the regex crate regarding backtracking and backtracing is documented, I assume that there is nothing to do. In the case someone else has no additions, the ticket can be closed. |
This looks good: BernhardtD/regex-performance@4fc5a3d Thanks! |
@schmidtd would it be possible to compare D's regex engine as well (using ldc) ? I see a lots of comments on reddit arguing that D has the fastest regex engine, but I never see proof. |
@gnzlbg I haven't seen any benchmarks either. I looked into doing a comparison a while back, but I don't know D, and was therefore too likely for me to make a mistake. If D can expose a C library, then I'd be happy to be shown how and add it to this repo's benchmark suite (which already benchmarks C and C++ libraries). |
A few months ago, this was posted on reddit: https://dlang.org/blog/2016/11/07/big-performance-improvement-for-std-regex/ --- But the article doesn't say anything about the benchmarking methodology. Based on the description of the implementation, I am quite skeptical. A bit parallel NFA might do better than the normal Thompson construction, but I'm pretty skeptical that it can compete with a JIT or a DFA... But, who knows, I'm happy to be wrong! |
Nice, what's happening with:
It looks like D is 2.5x faster than the Rust implementation there. Also, it looks like D's CTFE implementation is slower than the implementation without CTFE, which doesn't make much sense to me :/ |
Dunno. If D does literal extraction, then it might know to look for line boundaries outside the regex engine. Rust's had that optimization at one point, but it had subtle bugs so I threw it out until I can get around to improving literal handling. But this is just a guess. I'm not familiar with D's regex engine and I haven't quite learned how to read D yet.
Indeed. I don't know D well, so I don't know. |
I'm working on a comparison of different regex engines including your crate. I discovered some expressions with a slow execution time compared to other engines. I used the test tool from rust-leipzig/regex-performance to measure the performance on a given input file.
The following chart tries to give an overview of the results:
The affected expressions are:
Are there any issues or limitation regarding the given expressions known?
Additionally I attached my results:
results.zip
The text was updated successfully, but these errors were encountered: