Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #186. #223

Merged
merged 1 commit into from
May 1, 2016
Merged

Fix #186. #223

merged 1 commit into from
May 1, 2016

Conversation

BurntSushi
Copy link
Member

This enables RegexSets to short-circuit when:

  1. All patterns are anchored to the beginning of the input.
  2. All patterns have either matched or will never match.

We make this happen by checking whether all NFA states in a DFA state
are match states, when a DFA match is observed. If all NFA states are
match states, and since all match states are final states, we know that
the current set of matches will never change. Since we don't care about
reporting location information, we can quit.

N.B. If no matches can be found, then the DFA will short circuit using its
normal mechanism.

This enables RegexSets to short-circuit when:

1. All patterns are anchored to the beginning of the input.
2. All patterns have either matched or will never match.

We make this happen by checking whether all NFA states in a DFA state
are match states, when a DFA match is observed. If all NFA states are
match states, and since all match states are final states, we know that
the current set of matches will never change. Since we don't care about
reporting location information, we can quit.

N.B. If no matches can be found, then the DFA will short circuit using its
normal mechanism.
@BurntSushi
Copy link
Member Author

cc @dprien @birkenfeld

@birkenfeld
Copy link

Thanks for the note! You'll be pleased to know that for my use case, the timings dropped from

test test::benches::highlight_html_001x ... bench:   4,413,488 ns/iter (+/- 97,944)

to a nice

test test::benches::highlight_html_001x ... bench:     278,142 ns/iter (+/- 31,189)

Alas, there's still no speedup compared to sequentially matching all expressions, which clocks in at

test test::benches::highlight_html_001x ... bench:     272,105 ns/iter (+/- 10,361)

This is possibly quite dependent on the individual lexer (the number of regexes per state, for example), so I'll definitely continue to try RegexSet with different configurations as I add them.

@BurntSushi
Copy link
Member Author

That is great news that RegexSet now at least as comparable performance.

And you're right, it is dependent. In particular, on the number of mismatches before finding a match. For example, if you order your regexes by most likely to match to least likely, then depending on the distribution, one could definitely get comparable performance.

In any case, thank you so much for testing out this PR and confirming that it is, in fact, fixed. :-)

@BurntSushi BurntSushi merged commit 4458410 into master May 1, 2016
@BurntSushi BurntSushi deleted the fix-186 branch May 1, 2016 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants