Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

on short strings that request captures, don't run the DFA #348

Closed
BurntSushi opened this issue Mar 15, 2017 · 3 comments
Closed

on short strings that request captures, don't run the DFA #348

BurntSushi opened this issue Mar 15, 2017 · 3 comments

Comments

@BurntSushi
Copy link
Member

BurntSushi commented Mar 15, 2017

Currently, if a caller requests captures, then we always run the DFA to determine the extent of the match and then we run either the Pike VM or the bounded backtracker on only the extent of the match to fill in the capture locations. This works well when searching long strings because the DFA can save the NFA engines from doing a lot of work. But on short strings, the DFA probably doesn't pay for itself, so we should just run one of the NFA engines if the string is short enough. (More precisely, in cases where the match is roughly the same length as the entire string, then the DFA isn't helping us at all since the NFA engine will still run the length of the match. But, obviously, this case isn't possible to know up front, so we use "short strings" as a likely predictor of that case.)

There should be some experimentation to determine where this boundary lies, so that we can invent a heuristic for when the string is "short enough."

@arielb1
Copy link

arielb1 commented Mar 15, 2017

If a regex is "anchored" (starting with a ^ and ending with a $ in non-multiline mode), then there is never a point in determining the extent of the match. Anchored regexes are quite a common use-case when regexes are used for parsing/validation.

@BurntSushi
Copy link
Member Author

@arielb1 That's a good observation! We should add that to the heuristic as well.

One possible downside I just thought of: the DFA is still useful in the non-match scenario, since it can detect a match failure much more quickly than the NFA engines can.

@BurntSushi
Copy link
Member Author

For anyone interested in working on this, the relevant code is in the core implementation of the read_captures_at method:

fn read_captures_at(

In particular, right now, it will always run the DFA engine if the DFA engine was selected by the compiler. The optimization in question here would be to add a conditional in the DFA case that checks if a string is below a certain length, and if it is, just skip directly to running the NFA.

To pick the right length, I'd suggest writing a micro benchmark that:

  1. Uses a regex with several capture groups.
  2. Searches strings of varying lengths.

You may find it useful to construct the regex and the search text such that the regex matches the entire text. In particular, there are cases where running the DFA and the NFA is actually faster because the DFA can determine "not a match" more quickly than the NFA can.

An alternative/additional implementation path is to take @arielb1's advice and check if the regex is completely anchored. You can do that using self.nfa.is_anchored_start and self.nfa.is_anchored_end.

ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Oct 27, 2017
The DFA can't produce captures, but is still faster
than the Pike VM NFA, so the normal approach to finding
capture groups is to look for the entire match with the
DFA and then run the NFA on the substring of the input
that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so
there is never any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which
are not in the language of the regular expression, but
anchored regex with capture groups are most commonly
used in a parsing context, so it seems like a fair trade-off.

For a more in depth discussion see github issue rust-lang#348.
@ethanpailes ethanpailes mentioned this issue Oct 27, 2017
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Oct 28, 2017
The DFA can't produce captures, but is still faster
than the Pike VM NFA, so the normal approach to finding
capture groups is to look for the entire match with the
DFA and then run the NFA on the substring of the input
that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so
there is never any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which
are not in the language of the regular expression, but
anchored regex with capture groups are most commonly
used in a parsing context, so it seems like a fair trade-off.

For a more in depth discussion see github issue rust-lang#348.
ethanpailes pushed a commit to ethanpailes/regex that referenced this issue Oct 28, 2017
The DFA can't produce captures, but is still faster
than the Pike VM NFA, so the normal approach to finding
capture groups is to look for the entire match with the
DFA and then run the NFA on the substring of the input
that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so
there is never any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which
are not in the language of the regular expression, but
anchored regex with capture groups are most commonly
used in a parsing context, so it seems like a fair trade-off.

For a more in depth discussion see github issue rust-lang#348.
BurntSushi pushed a commit that referenced this issue Dec 30, 2017
The DFA can't produce captures, but is still faster than the Pike VM
NFA, so the normal approach to finding capture groups is to look for
the entire match with the DFA and then run the NFA on the substring
of the input that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so there is never
any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which are not in the
language of the regular expression, but anchored regex with capture
groups are most commonly used in a parsing context, so it seems like a
fair trade-off.

Fixes #348
BurntSushi pushed a commit that referenced this issue Dec 30, 2017
The DFA can't produce captures, but is still faster than the Pike VM
NFA, so the normal approach to finding capture groups is to look for
the entire match with the DFA and then run the NFA on the substring
of the input that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so there is never
any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which are not in the
language of the regular expression, but anchored regex with capture
groups are most commonly used in a parsing context, so it seems like a
fair trade-off.

Fixes #348
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants