Regexes appear to be significantly slower than re2 in some cases #16989

brson · 2014-09-04T18:15:22Z

My understanding is that we believe them to be competitive, but two benchmarks I've seen were not really in the ballpark. The easiest one to test is the shootout's regexdna, which you can see from the upstream shootout is drastically slower than the re2-based C++ implementation.

cc @BurntSushi

BurntSushi · 2014-09-04T18:33:04Z

Indeed! RE2's C++ implementation is extremely sophisticated. It has a DFA for very quickly finding matches and the location of an overall match. A full NFA simulation is used when you need submatches (and this is made more performant, I think, by running a DFA first to limit the range of text searched by the NFA). This includes doing wonderfully crazy things like reversing the regex itself. Also, I believe RE2/C++ uses parallelism over DFA states, but I'm not entirely sure how that works.

The Rust implementation uses a full NFA simulation. Full stop.

The regex! macro basically encodes that full NFA simulation, but without heap allocation for keeping track of captures (since it knows the total number of captures at compile time). There are other optimizations, but I suspect the elimination of most heap allocation is a big one.

With that said, the main NFA simulation (not the native code specialization) should be roughly on par with Go's regexp package. (I say this because Go's regexp package only has a full NFA simulation and lacks a DFA.) However, Go's regexp package does have a special "one pass" NFA optimization that makes certain regexes faster. There are probably other optimizations in Go's implementation that I'm missing or have forgotten about.

RE2 for Go and C++ were written by the same person (Russ Cox). The sophistication of Rust's regex crate is much more closely aligned with Go's regexp package than it is C++'s RE2 library.

So yes, Rust's regexes are going to be slower than RE2/C++'s regexes until its implementation becomes more sophisticated.

(I've always tried to be very careful to say "RE2/Go" or "RE2/C++", but I'm sure that got lost in the kerfuffle. RE2 is more readily identified with the C++ library, and I don't think I've ever expected my implementation to be on par with it. RE2/Go? Yeah. On par, or close to it. When you add native regexes to the mix, Rust will outperform Go's implementation I think.)

brson · 2014-09-04T20:12:38Z

Thanks for the explanation, @BurntSushi.

pcwalton · 2014-09-04T20:13:27Z

IIRC those regexes are pretty simple. Maybe we can add some hacks to make them (and only them) fast?

brson · 2014-09-05T02:21:34Z

@BurntSushi do you have a sense of the scope of a project to add the advanced optimizations from RE2/C++? For example, could a student do it in a month?

BurntSushi · 2014-09-05T11:06:40Z

@brson I think it's possible. Another idea might be to start them with the "one pass" NFA optimization. Russ Cox briefly describes it here: http://swtch.com/~rsc/regexp/regexp3.html (Do a CTRL-F for one-pass). That could get them warmed up with the existing implementation before they start on the DFA stuff. But that's just an idea. (The one-pass optimization has two steps: analysis of the regex to determine whether it is a one-pass regex or not and then the actual optimization in the NFA simulation.)

Also, in general, Russ's regexp articles are required reading: http://swtch.com/~rsc/regexp/ --- Rust's regex implementation is based on them.

steveklabnik · 2015-01-21T19:39:44Z

I meant to move this to the regex crate :(

steveklabnik · 2015-01-21T19:41:31Z

I'm pulling a massive triage effort to get us ready for 1.0. As part of this, I'm moving stuff that's wishlist-like to the RFCs repo, as that's where major new things should get discussed/prioritized.

This issue has been moved to the RFCs repo: rust-lang/regex#24

…m-associated-types, r=Veykril fix: ensure implied bounds from associated types are considered in autocomplete closes: rust-lang#16989 rust-analyzer needs to consider implied bounds from associated types in order to get all methods suggestions people expect. A pretty easy way to do that is to keep the `candidate_trait_id`'s receiver if it matches `TyFingerprint::Unnameable`. When benchmarking this change, I didn't notice a meaningful difference in autocomplete latency. (`TyFingerprint::Unnameable` corresponds to `TyKind::AssociatedType`, `TyKind::OpaqueType`, `TyKind::FnDef`, `TyKind::Closure`, `TyKind::Coroutine`, and `TyKind::CoroutineWitness`.)

brson added A-libs labels Sep 4, 2014

bstrie added the A-an-interesting-project label Dec 11, 2014

bstrie mentioned this issue Dec 11, 2014

Implement / optimize the shootout benchmarks #18085

Closed

15 tasks

steveklabnik closed this as completed Jan 21, 2015

steveklabnik mentioned this issue Jan 21, 2015

Regexes appear to be significantly slower than re2 in some cases rust-lang/rfcs#663

Closed

steveklabnik mentioned this issue Jan 21, 2015

Regexes appear to be significantly slower than re2 in some cases rust-lang/regex#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regexes appear to be significantly slower than re2 in some cases #16989

Regexes appear to be significantly slower than re2 in some cases #16989

brson commented Sep 4, 2014

BurntSushi commented Sep 4, 2014

brson commented Sep 4, 2014

pcwalton commented Sep 4, 2014

brson commented Sep 5, 2014

BurntSushi commented Sep 5, 2014

steveklabnik commented Jan 21, 2015

steveklabnik commented Jan 21, 2015

Regexes appear to be significantly slower than re2 in some cases #16989

Regexes appear to be significantly slower than re2 in some cases #16989

Comments

brson commented Sep 4, 2014

BurntSushi commented Sep 4, 2014

brson commented Sep 4, 2014

pcwalton commented Sep 4, 2014

brson commented Sep 5, 2014

BurntSushi commented Sep 5, 2014

steveklabnik commented Jan 21, 2015

steveklabnik commented Jan 21, 2015