on short strings that request captures, don't run the DFA #348

BurntSushi · 2017-03-15T12:31:12Z

Currently, if a caller requests captures, then we always run the DFA to determine the extent of the match and then we run either the Pike VM or the bounded backtracker on only the extent of the match to fill in the capture locations. This works well when searching long strings because the DFA can save the NFA engines from doing a lot of work. But on short strings, the DFA probably doesn't pay for itself, so we should just run one of the NFA engines if the string is short enough. (More precisely, in cases where the match is roughly the same length as the entire string, then the DFA isn't helping us at all since the NFA engine will still run the length of the match. But, obviously, this case isn't possible to know up front, so we use "short strings" as a likely predictor of that case.)

There should be some experimentation to determine where this boundary lies, so that we can invent a heuristic for when the string is "short enough."

arielb1 · 2017-03-15T15:37:36Z

If a regex is "anchored" (starting with a ^ and ending with a $ in non-multiline mode), then there is never a point in determining the extent of the match. Anchored regexes are quite a common use-case when regexes are used for parsing/validation.

BurntSushi · 2017-03-15T15:47:44Z

@arielb1 That's a good observation! We should add that to the heuristic as well.

One possible downside I just thought of: the DFA is still useful in the non-match scenario, since it can detect a match failure much more quickly than the NFA engines can.

BurntSushi · 2017-05-08T23:21:49Z

For anyone interested in working on this, the relevant code is in the core implementation of the read_captures_at method:

regex/src/exec.rs

Line 513 in d813518

fn read_captures_at(

In particular, right now, it will always run the DFA engine if the DFA engine was selected by the compiler. The optimization in question here would be to add a conditional in the DFA case that checks if a string is below a certain length, and if it is, just skip directly to running the NFA.

To pick the right length, I'd suggest writing a micro benchmark that:

Uses a regex with several capture groups.
Searches strings of varying lengths.

You may find it useful to construct the regex and the search text such that the regex matches the entire text. In particular, there are cases where running the DFA and the NFA is actually faster because the DFA can determine "not a match" more quickly than the NFA can.

An alternative/additional implementation path is to take @arielb1's advice and check if the regex is completely anchored. You can do that using self.nfa.is_anchored_start and self.nfa.is_anchored_end.

The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. For a more in depth discussion see github issue rust-lang#348.

The DFA can't produce captures, but is still faster than the Pike VM NFA, so the normal approach to finding capture groups is to look for the entire match with the DFA and then run the NFA on the substring of the input that matched. In cases where the regex in anchored, the match always starts at the beginning of the input, so there is never any point to trying the DFA first. The DFA can still be useful for rejecting inputs which are not in the language of the regular expression, but anchored regex with capture groups are most commonly used in a parsing context, so it seems like a fair trade-off. Fixes #348

BurntSushi added enhancement help wanted labels May 8, 2017

ethanpailes mentioned this issue Oct 27, 2017

Issue 348 #410

Closed

BurntSushi closed this as completed in 918d4a0 Dec 30, 2017

mattcollier mentioned this issue Jan 4, 2021

More on regex gannan08/rdf-canonize-rs#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

on short strings that request captures, don't run the DFA #348

on short strings that request captures, don't run the DFA #348

BurntSushi commented Mar 15, 2017 •

edited

Loading

arielb1 commented Mar 15, 2017 •

edited

Loading

BurntSushi commented Mar 15, 2017

BurntSushi commented May 8, 2017

on short strings that request captures, don't run the DFA #348

on short strings that request captures, don't run the DFA #348

Comments

BurntSushi commented Mar 15, 2017 • edited Loading

arielb1 commented Mar 15, 2017 • edited Loading

BurntSushi commented Mar 15, 2017

BurntSushi commented May 8, 2017

BurntSushi commented Mar 15, 2017 •

edited

Loading

arielb1 commented Mar 15, 2017 •

edited

Loading