-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement one-pass NFA matcher #68
Comments
Can you please elaborate on what exactly you mean by this, and perhaps point to some literature and/or examples? |
I'm not aware of any literature on this topic, and as far as I know, Russ Cox came up with it. He talks about it briefly here: https://swtch.com/~rsc/regexp/regexp3.html (CTRL-F for The definition in my OP is the key: a one-pass NFA can be used to evaluate a regex where, at any point during the evaluation of the machine, at most one transition can lead to a match. So if you know your machine has that property, you don't need to keep track of as much state as the full NFA simulation. The last time I looked at implementing this, the tricky part I think was actually detecting one-pass regexes. It can probably be simplified by starting with a routine that reports a lot of false negatives (false positives cannot be allowed). |
Thanks for the info; what I was missing is that it can only be used for anchored matches, so that we do not have to introduce a new thread of execution for each input character. |
Right. Unanchored regexes have a |
FWIW, the regex-dfa crate makes it trivial to check whether the regex is one-pass: the DFA created by regex-dfa is minimized, and so it's one-pass iff it has no branches. I'm not sure how to go from there to a one-pass NFA, though. Specifically, you'd need to get back the capture groups somehow. |
How does a 1-pass NFA simulation compare with an NFA simulation that does lookahead before spawing new threads at branch points? It seems like the advantage of a 1-pass simulation is that it will never spawn more than 1 thread, because there are no non-deterministic branches. Adding lookahead (or [previews][previews] to generalize even further) is an optimization that other engines sometimes do, and it seems to me like an engine that does lookahead at every branch will never spawn more than 1 thread for a 1-pass NFA. I bring this up because I keep playing with the idea of trying to impliment previews or lookahead (no promises), but this 1-pass NFA thing also seems like a pretty easy optimization to do and I'm wondering if anyone has any insight about whether or not these optimizations are a 1-or-the-other sort of thing. |
Correct. Thereby eliminating the overhead of managing and tracking threads. A one-pass NFA implementation uses this fact to avoid creating threads in the first place.
I don't follow this. A one-pass NFA will never spawn more than one thread, period. Lookahead/preview have nothing to do with that. Adding previews permits you to limit the creation of threads in the general case for a non-one-pass NFA. (I suspect I am misunderstanding your query.) |
I should have said "1-pass regex". You are 100% right that a 1-pass NFA can't spawn new threads. Sorry for the confused communication. To unpack my point a bit more using the examples that Russ Cox provides on his website. |
@ethanpailes Hmm. I'm still having trouble understanding the essence of your question. Note that the determination of whether a regex is one pass or not is made at compile time, and that information is then used to switch to a completely different implementation of the Pike VM that doesn't use threads at all. In that implementation there would be no previews/look-ahead, and in fact, there would be no point at all in using previews/look-ahead for that case since you would only ever be using one thread. |
Oh, now that you mention the PikeVM I realize that the lookahead version would have to do lots of extra work to copy capture group info around because it wouldn't know that it was only ever going to spawn 1 thread. I think my thinking was muddled by thinking more about the bounded backtracker. Thanks! |
So I've been quitely spinning my wheels on this for the last month or so, and after a few embarrassing false starts I think I've gotten something worth talking about in public. I'm a little confused by the terminology surrounding onepass matchers, because it seems to me that they are only applicable when a regex has no non-determinism, in which case a DFA can impliment capture groups just fine, so that's what I implimented. The code I linked is a pretty good proof of concept but there are lots of outstanding issues.
I'm not entirely sure what the execution planner should do with onepass dfas yet. Onepass DFAs are the most useful where we would otherwise have to use an NFA to extract capture groups, but there is a chance that they could do better than the lazy DFA because they should have lower warmup cost. When a regex has no non-determinism, there can never be an exponential blowup in space (no non-determinism means no compound states from the powerset construction), so I just AOT compiled the FSM. Anyway, here's what happens when I ran all the benchmarks containing capture groups.
The bigger versions show less speedup because there is a literal scan optimization happening, so as the input grows, less and less time is spent in the main matching engine. |
I know this is a pretty big wad of code and issues to cough up, so if you are interested in triage: comments about the EmptyLook stuff and pontification on execution planning would probably be the most helpful to me right now. |
That looks like awesome work! I'm not sure when I'll be able to review it, but it might be a good idea to just open a PR when you think it's in a reasonable mergeable state. It is a lot of code though, and unfortunately i won't be able to merge it until I'm confident that I understand it thoroughly. I'm not sure i understand your question about EmptyLook though. |
That's fine. I'll just implement EmptyLooks with an intermediary state. I figured there was about a 50% chance that I wasn't being entirely coherent about flags. I fully expect this to take a long time to merge even after I open the PR. Considering that this is a whole new backend, getting things to a point where you are comfortable with all of the code is very important. A PR is probably a ways out though. |
Sounds good! The way flags are handled in the DFA is incredibly finnicky. They have been a source of bugs. The idea is to build them into the DFA, but the code isn't particularly forthcoming. And any time I need to tweak it for a bug fix or something, I need to re-learn how it works. So... simplifications are welcome. :-) A key part of this will be testing. One potentially problematic area here is that it's not clear how well the existing test suite will cover the new matcher. Today, we kind of get away with things because each backend can handle a very large subset of the tests, and the tests are actually parameterized on the different backends. This might be a little trickier to pull off with a backend that works on fewer regexes. If it's possible to follow the pattern that other backends use, but skip tests that aren't applicable, then that might help. If we end up skipping too many, then we know we need better coverage. |
I did get kinda scared off when I tried to grok that part of the DFA code. I'm glad I'm not the only one who finds it tricky :-)
That's a good point. For now I just set the execution planner to fall back on an existing NFA simulation if the regex to be executed is not onepass, but I've made no effort to figure out how many tests involve onepass regex. I'll definitely try to get a count of that at some point. If not enough test cases light up I think re2 has some automatic testing ideas ripe for stealing, but let's burn that bridge when we get to it. |
@BurntSushi, while I was working on this I began to worry about alternations containing empty branches introducing subtle sources of non-determinism[1], so I wrote a few test cases. To my delight, it seems that they are currently not supported, so I don't have to think about it. I don't want to rely on that remaining the case long term though. It seems like there are two options. (1) commit to just keeping this restriction. The only time you would need empty branches over an optional alternative is if you want to very carfully control the precidence of the empty branch, which seems super niche. (2) Just remain mindful that the onepass matcher (specifically the bit that checks a regex to see if it is onepass) will need some TLC when this restriction gets lifted. I don't think the choice has any impact on my near term actions, so this comment is mostly just a way to air a potential footgun in public. [1]: (e1|e2|)e3 == (e1|e2)?e3. In both cases you have to check for non-determinism between all three of the expressions. |
I've gotten to the point where all tests are green except for this one. It looks like the right answer is a zero-width capture right past the end of the input, but I don't understand what is going on with that |
w.r.t. to empty alternates, it was definitely my intention to allow them with the regex-syntax overhaul. It wasn't until I actually let them through the compiler that I realized the current compiler didn't know how to handle them. But it is definitely possible. I didn't want to spend the time teaching the compiler this, so I just banned them until a refactoring fixes it. Empty alternates are a somewhat niche feature, so if the onepass DFA can't handle them or it's hard to handle them, then simply using that as a criterion to exclude the onepass DFA sounds great to me. Please note though that there are other ways for empty alternates to manifest themselves in a way that the compiler can handle. e.g., I think
I think the specific codepoint is a red herring, but rather, the important bit is to exercise |
As I was migrating the code over from my research branch to make a pretty PR I realized just this and implemented some more principled checks for regex which might accept the emptystring (rather than just special casing an empty branch or a captured empty branch). I think this should mean that adding support for empty alternatives won't cause any trouble. Sorry for the noise. Thanks for unpacking what is going on in that test case for me! Hopefully it will help me when I try to dig into what is going wrong. |
@ethanpailes OK, so I've finally implemented a one-pass engine. Implementation (WIP branch, do not use): https://github.com/BurntSushi/regex-automata/blob/51a69e2520e989d4788eadf848910b58b897e0cd/src/dfa/onepass.rs I want to say that I'm sorry. I think my instincts led you down a very bad path. I think my biggest mistake was pushing you toward an up-front analysis pass on the HIR to determine whether the regex was one-pass or not. The big red flag here was that this resulted in things like The other part I think I was wrong about was using a DFA. We do indeed want a DFA here. Since it's limited to at most the number of states in the NFA, its memory usage shouldn't be too bad. And we can put caps on how much memory it's allowed to use. Moreover, I think it really needs to be a DFA to achieve a compelling speed advantage. Anyway, sorry for steering you down the wrong path. I had started going down that path too, but the more I looked at and started working on the problem the less sense it made. In the end, I ended up with an implementation that is pretty close to RE2's. It's different in some ways (it supports searching for multiple patterns), but the essential shape of the implementation is very similar. |
At this point it's been long enough since I implemented it that I can't remember if I implemented the filter in terms of the HIR or the bytecodes first. I think it makes sense to do everything in terms of the bytecodes so you don't have to worry about utf8 as much. I remember that the biggest difference between my implementation and RE2s was that mine had variable width state entries in the DFA which made it more complicated but let it support more capture groups. Did you wind up doing fixed sized states or variable sized? |
Fixed size. Although I used a |
Yeah I think that approach makes more sense. Getting the variable sized states to work was a pain. I think I was being too completionist about that. |
I usually close tickets on a commit-by-commit basis, but this refactor was so big that it wasn't feasible to do that. So ticket closures are marked here. Closes #244 Closes #259 Closes #476 Closes #644 Closes #675 Closes #824 Closes #961 Closes #68 Closes #510 Closes #787 Closes #891 Closes #429 Closes #517 Closes #579 Closes #779 Closes #850 Closes #921 Closes #976 Closes #1002 Closes #656
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | minor | `1.8.4` -> `1.9.1` | --- ### Release Notes <details> <summary>rust-lang/regex (regex)</summary> ### [`v1.9.1`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#191-2023-07-07) [Compare Source](rust-lang/regex@1.9.0...1.9.1) \================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: - [BUG #​1027](rust-lang/regex#1027): Change the allocation strategy for the backtracker to be less aggressive. ### [`v1.9.0`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#190-2023-07-05) [Compare Source](rust-lang/regex@1.8.4...1.9.0) \================== This release marks the end of a [years long rewrite of the regex crate internals](rust-lang/regex#656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: - A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. - A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. - `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. - The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. - Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: - [FEATURE #​244](rust-lang/regex#244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. - [FEATURE #​259](rust-lang/regex#259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. - [FEATURE #​476](rust-lang/regex#476): `std` is now an optional feature. `regex` may be used with only `alloc`. - [FEATURE #​644](rust-lang/regex#644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. - [FEATURE #​675](rust-lang/regex#675): Anchored search APIs are now available in `regex-automata 0.3`. - [FEATURE #​824](rust-lang/regex#824): Add new `Captures::extract` method for easier capture group access. - [FEATURE #​961](rust-lang/regex#961): Add `regex-lite` crate with smaller binary sizes and faster compile times. - [FEATURE #​1022](rust-lang/regex#1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: - [PERF #​68](rust-lang/regex#68): Added a one-pass DFA engine for faster capture group matching. - [PERF #​510](rust-lang/regex#510): Inner literals are now used to accelerate searches, e.g., `\w+@​\w+` will scan for `@`. - [PERF #​787](rust-lang/regex#787), [PERF #​891](rust-lang/regex#891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: - [BUG #​429](rust-lang/regex#429): Fix matching bugs related to `\B` and inconsistencies across internal engines. - [BUG #​517](rust-lang/regex#517): Fix matching bug with capture groups. - [BUG #​579](rust-lang/regex#579): Fix matching bug with word boundaries. - [BUG #​779](rust-lang/regex#779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. - [BUG #​850](rust-lang/regex#850): Fix matching bug inconsistency between NFA and DFA engines. - [BUG #​921](rust-lang/regex#921): Fix matching bug where literal extraction got confused by `$`. - [BUG #​976](rust-lang/regex#976): Add documentation to replacement routines about dealing with fallibility. - [BUG #​1002](rust-lang/regex#1002): Use corpus rejection in fuzz testing. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNi4wLjAiLCJ1cGRhdGVkSW5WZXIiOiIzNi44LjExIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1957 Reviewed-by: crapStone <crapstone01@gmail.com> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
It turns out that when a regex is deterministic, the NFA simulation can be implemented much more efficiently because it only needs to keep track of one set of capture groups (instead of a set of capture groups for every thread).
There are two components to adding this:
In terms of code, it would be nice if we could find a way to reuse code with the full NFA simulation.
This should be easier to implement than #66, and should boost the performance of a lot of regexes. (Of course, we should do both a one pass NFA and #66!)
The text was updated successfully, but these errors were encountered: