Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to aho-corasick 0.7 #566

Merged
merged 4 commits into from Mar 30, 2019
Merged

upgrade to aho-corasick 0.7 #566

merged 4 commits into from Mar 30, 2019

Conversation

BurntSushi
Copy link
Member

We also take this opportunity to add an optimization I've long wanted to add: cause regexes like foo|bar|baz|...|quux to just dispatch to Aho-Corasick instead of using the lazy DFA, which can be quite a bit slower for a large dictionary (such as /usr/share/dict/words). The optimization is fairly brittle, but it's a somewhat common case. We couldn't do this before because Aho-Corasick didn't support leftmost-first match semantics, so it couldn't serve as a drop-in replacement for a regex alternation.

No need to build the benchmark suite 4 times.
This is a "dumb" update in that we retain exactly the same functionality
as before.
This adds a couple new methods on HIR expressions for determining whether
they are literals or not. This is useful for determining whether to apply
optimizations such as Aho-Corasick without re-analyzing the syntax.
Finally, if a regex is just `foo|bar|baz|...|quux`, we will now use plain
old Aho-Corasick. The reason why we weren't doing this before is because
Aho-Corasick didn't support proper leftmost-first match semantics. But
since aho-corasick 0.7, it does, so we can now use it as a drop-in
replacement.

This basically fixes a pretty bad performance bug in a really common case,
but it is otherwise really hacked. First of all, this only happens when a
regex is literally `foo|bar|...|baz`. Even something like
`foo|b(a)r|...|baz` will prevent this optimization from happening, which
is a little silly. Second of all, this optimization only kicks in after
we've compiled the full pattern, which adds quite a bit of overhead. Fixing
this isn't trivial, since we may need the compiled program to resolve
capturing groups. The way to do this is probably to specialize compilation
for certain types of expressions. Maybe.

Anyway, we hack this in for now, and punt on further improvements until
we can really re-think how this should all work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant