Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use literal scans to speed up .* #418

Closed
ethanpailes opened this issue Nov 27, 2017 · 4 comments
Closed

Use literal scans to speed up .* #418

ethanpailes opened this issue Nov 27, 2017 · 4 comments

Comments

@ethanpailes
Copy link
Contributor

ethanpailes commented Nov 27, 2017

The Pitch

I'm a little worried about the soundness of this idea, but nothing obviously broken jumps out at me. Here goes.

During compilation, we look for regex sub-expressions of the form .*literal_terminator and emit a new instruction telling the regex engine to scan forward to find the literal terminator followed by a split instruction branching to the next thing in the regex and back to the start of the scan.

*-scan-split-next-*
^       |  |   ^
--------   |---

Greediness could be handled by the precedence of the split in a similar way to how greediness is handled for repetitions in the existing implementation.

This would only work for regex backends that can represent NFA threads which are further ahead in the input than one char, which I think cuts us down to just the bounded backtracker. On the other hand I have a prototype PikeVM with support for representing threads further ahead in the input and I was able to get a 4x speedup for friendly regex and haystacks using this optimization (I don't think it is a good idea to extend the real PikeVM the way that I have, 'cause it introduces overhead).

.* is so common that I think this would be worth it.

Some Case Analysis for Sanity Checking

  1. The literal is not in the remaining input. In this case we don't want to match with the current thread, so when the scan operation returns None we just kill the current thread. OK
  2. The literal is in the remaining input, but the input directly after it does not match. In this case the thread which loops back to the scan instruction will try again, and the thread which tries to parse the next expression will die. OK
  3. The literal is in the remaining input, and the input directly after matches (Greedy). Because we are greedily matching the thread which loops back to the scan gets higher precedence. Then there are two cases to consider: either there is another terminator or there isn't. If there isn't the scan thread dies and the parse thread finishes the parse (OK). If there is the scan thread finds the next terminator and we inductively assume correctness (OK).
  4. The literal is in the remaining input, and the input directly after matches (Lazy). The parse thread gets higher precedence and completes the match (OK).

I think this covers all the cases.

@ethanpailes
Copy link
Contributor Author

@BurntSushi , bumping this thread as well.

@BurntSushi
Copy link
Member

Yeah I saw this but I just don't have time to digest it completely. It might be a bit before I can. Until then, I have some short pieces of general advice:

  1. Every time I've gone down similar optimization routes, I've always been stymied by implementing an algorithm that goes quadratic. For example, if you take a look at the suffix literal optimizations today, there is actually a heuristic I think that prevents quadratic complexity.
  2. There is a fairly recent paper called "Optimizing Unicode Regular Expression
    Evaluation with Previews" by Howard Chivers which talks about optimizations using a technique that involves building out "previews" of a regex that somewhat reminds me of the description in your issue. There is a nice succinct overview of it here: regexp: port RE2's DFA matcher to the regexp package golang/go#11646 (comment) --- Note that I haven't read the paper yet.
  3. Try it! See what happens. :-) Make sure to think carefully about time complexity, and if possible, write a benchmark for the worst possible case and make sure it's reasonable.

@ethanpailes
Copy link
Contributor Author

Thanks for the feedback and the reference!

The point you make about the possibility of quadratic runtime is a good one. I need to try to find degenerate cases. The first one that pops out at me is regex like ".*lit(?:a)" on input like "litblitblitblitblitblitblita". I think it would be OK, but I'll spend more time trying to get the algorithm to touch input repeatedly.

I'll definitely implement something along these lines and try to find degenerate cases. Hopefully, it will be possible to escape out of quadratic behavior dynamically if any exists. I'll develop this idea with benchmarks and get back to you.

@ethanpailes
Copy link
Contributor Author

I've let this sit for too long.

I have, however, been working on a prototype implementation. Along the way I've managed to convince myself that it is both very fast whenever it gets triggered, and introduces quadratic worst case running time. Frustratingly, the constant factor speedup is so good that I've had trouble finding test cases that really show it failing (I've mainly tried variations of this, which seems like it should trigger quadratic behavior). There is some noise from the fact that I have other optimizations in place, but I should still be able to force really terrible wall clock time.

In any case, my conclusion is that this optimization might be worthwhile for a pure backtracking implementation like pcre, but is definitely the wrong choice for an engine which guarantees linear time like this one or re2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants