Skip to content

Conversation

@BurntSushi
Copy link
Member

This fixes both #264 and #268. It also fixes an unreported bug where the DFA cache size could grow (a lot) bigger than the bound set by the caller.

Typically, when a DFA blows up in size, it happens for two reasons:

  1. It accumulates many states.
  2. Each state accumulates more and more NFA states.

Our previous approximation for the size of the DFA accounted for (1) but
used a constant for the size of (2). This can turn out to result in very
large differences (in the MBs) between the approximate and actual size
of the DFA.

Since computing the actual size is expensive, we compute it as a sum as
states are added.

The end result is that we more stringently respect the memory set by the
caller.
The specific problem here is that our literal search doesn't know about
anchors, so it will try to search all of the detected literals in a regex.
In a regex like `a|^b`, the literal `b` should only be searched for at the
beginning of the haystack and in no other place.

The right way to fix this is probably to make the literal detector smarter,
but the literal detector is already too complex. Instead, this commit
detects whether a regex is partially anchored (that is, when the regex has
at least one matchable sub-expression that is anchored), and if so,
disables the literal engine.

Note that this doesn't disable all literal optimizations, just the
optimization that opts out of regex engines entirely. Both the DFA and the
NFA will still use literal prefixes to search. Namely, if it searches and
finds a literal that needs to be anchored but isn't in the haystack, then
the regex engine rules it out as a false positive.

Fixes rust-lang#268.
If the caller asks for captures, and the DFA runs, and there's a match,
and there are actually captures in the regex, then the haystack sent to
the NFA is shortened to correspond to only the match plus some room at the
end for matching zero-width assertions. This "room at the end" needs to be
big enough to at least fit an UTF-8 encoded Unicode codepoint.

Fixes rust-lang#264.
Docopt uses lazy_static! 2.x, but lazy_static required a new minimum
Rust version in 2.1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant