New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use literal scans to speed up .*
#418
Comments
@BurntSushi , bumping this thread as well. |
Yeah I saw this but I just don't have time to digest it completely. It might be a bit before I can. Until then, I have some short pieces of general advice:
|
Thanks for the feedback and the reference! The point you make about the possibility of quadratic runtime is a good one. I need to try to find degenerate cases. The first one that pops out at me is regex like ".*lit(?:a)" on input like "litblitblitblitblitblitblita". I think it would be OK, but I'll spend more time trying to get the algorithm to touch input repeatedly. I'll definitely implement something along these lines and try to find degenerate cases. Hopefully, it will be possible to escape out of quadratic behavior dynamically if any exists. I'll develop this idea with benchmarks and get back to you. |
I've let this sit for too long. I have, however, been working on a prototype implementation. Along the way I've managed to convince myself that it is both very fast whenever it gets triggered, and introduces quadratic worst case running time. Frustratingly, the constant factor speedup is so good that I've had trouble finding test cases that really show it failing (I've mainly tried variations of this, which seems like it should trigger quadratic behavior). There is some noise from the fact that I have other optimizations in place, but I should still be able to force really terrible wall clock time. In any case, my conclusion is that this optimization might be worthwhile for a pure backtracking implementation like pcre, but is definitely the wrong choice for an engine which guarantees linear time like this one or re2. |
The Pitch
I'm a little worried about the soundness of this idea, but nothing obviously broken jumps out at me. Here goes.
During compilation, we look for regex sub-expressions of the form
.*literal_terminator
and emit a new instruction telling the regex engine to scan forward to find the literal terminator followed by a split instruction branching to the next thing in the regex and back to the start of the scan.Greediness could be handled by the precedence of the split in a similar way to how greediness is handled for repetitions in the existing implementation.
This would only work for regex backends that can represent NFA threads which are further ahead in the input than one char, which I think cuts us down to just the bounded backtracker. On the other hand I have a prototype PikeVM with support for representing threads further ahead in the input and I was able to get a 4x speedup for friendly regex and haystacks using this optimization (I don't think it is a good idea to extend the real PikeVM the way that I have, 'cause it introduces overhead).
.*
is so common that I think this would be worth it.Some Case Analysis for Sanity Checking
None
we just kill the current thread. OKI think this covers all the cases.
The text was updated successfully, but these errors were encountered: