Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexes appear to be significantly slower than re2 in some cases #16989

Closed
brson opened this issue Sep 4, 2014 · 7 comments
Closed

Regexes appear to be significantly slower than re2 in some cases #16989

brson opened this issue Sep 4, 2014 · 7 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@brson
Copy link
Contributor

brson commented Sep 4, 2014

My understanding is that we believe them to be competitive, but two benchmarks I've seen were not really in the ballpark. The easiest one to test is the shootout's regexdna, which you can see from the upstream shootout is drastically slower than the re2-based C++ implementation.

cc @BurntSushi

@BurntSushi
Copy link
Member

Indeed! RE2's C++ implementation is extremely sophisticated. It has a DFA for very quickly finding matches and the location of an overall match. A full NFA simulation is used when you need submatches (and this is made more performant, I think, by running a DFA first to limit the range of text searched by the NFA). This includes doing wonderfully crazy things like reversing the regex itself. Also, I believe RE2/C++ uses parallelism over DFA states, but I'm not entirely sure how that works.

The Rust implementation uses a full NFA simulation. Full stop.

The regex! macro basically encodes that full NFA simulation, but without heap allocation for keeping track of captures (since it knows the total number of captures at compile time). There are other optimizations, but I suspect the elimination of most heap allocation is a big one.

With that said, the main NFA simulation (not the native code specialization) should be roughly on par with Go's regexp package. (I say this because Go's regexp package only has a full NFA simulation and lacks a DFA.) However, Go's regexp package does have a special "one pass" NFA optimization that makes certain regexes faster. There are probably other optimizations in Go's implementation that I'm missing or have forgotten about.

RE2 for Go and C++ were written by the same person (Russ Cox). The sophistication of Rust's regex crate is much more closely aligned with Go's regexp package than it is C++'s RE2 library.

So yes, Rust's regexes are going to be slower than RE2/C++'s regexes until its implementation becomes more sophisticated.

(I've always tried to be very careful to say "RE2/Go" or "RE2/C++", but I'm sure that got lost in the kerfuffle. RE2 is more readily identified with the C++ library, and I don't think I've ever expected my implementation to be on par with it. RE2/Go? Yeah. On par, or close to it. When you add native regexes to the mix, Rust will outperform Go's implementation I think.)

@brson
Copy link
Contributor Author

brson commented Sep 4, 2014

Thanks for the explanation, @BurntSushi.

@pcwalton
Copy link
Contributor

pcwalton commented Sep 4, 2014

IIRC those regexes are pretty simple. Maybe we can add some hacks to make them (and only them) fast?

@brson
Copy link
Contributor Author

brson commented Sep 5, 2014

@BurntSushi do you have a sense of the scope of a project to add the advanced optimizations from RE2/C++? For example, could a student do it in a month?

@BurntSushi
Copy link
Member

@brson I think it's possible. Another idea might be to start them with the "one pass" NFA optimization. Russ Cox briefly describes it here: http://swtch.com/~rsc/regexp/regexp3.html (Do a CTRL-F for one-pass). That could get them warmed up with the existing implementation before they start on the DFA stuff. But that's just an idea. (The one-pass optimization has two steps: analysis of the regex to determine whether it is a one-pass regex or not and then the actual optimization in the NFA simulation.)

Also, in general, Russ's regexp articles are required reading: http://swtch.com/~rsc/regexp/ --- Rust's regex implementation is based on them.

@steveklabnik
Copy link
Member

I meant to move this to the regex crate :(

@steveklabnik
Copy link
Member

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/regex#24

lnicola pushed a commit to lnicola/rust that referenced this issue May 26, 2024
…m-associated-types, r=Veykril

fix: ensure implied bounds from associated types are considered in autocomplete

closes: rust-lang#16989

rust-analyzer needs to consider implied bounds from associated types in order to get all methods suggestions people expect. A pretty easy way to do that is to keep the `candidate_trait_id`'s receiver if it matches `TyFingerprint::Unnameable`.  When benchmarking this change, I didn't notice a meaningful difference in autocomplete latency.

(`TyFingerprint::Unnameable` corresponds to `TyKind::AssociatedType`, `TyKind::OpaqueType`, `TyKind::FnDef`, `TyKind::Closure`, `TyKind::Coroutine`, and `TyKind::CoroutineWitness`.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants