Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify construction for repeatedly-entered backreferences #2

Open
geofflangdale opened this issue Apr 26, 2019 · 2 comments

Comments

@geofflangdale
Copy link

@geofflangdale geofflangdale commented Apr 26, 2019

For a pattern like /foo(.{1,5}(\w).{1,5}\2)+bar/ the capturing subexpression (\w) is entered and resolved many times, not just once.

This is a simple example, but it is my belief that a capturing subexpression can take on many values (and potentially go down many paths - this loop example is a comparatively simple case - we could have paths where the (\w) is optional in a loop and some other, non-capturing component is encountered instead, meaning that the (\w) used will be from the previous iteration.

Am I confused?

@travisdowns

This comment has been minimized.

Copy link
Owner

@travisdowns travisdowns commented Apr 26, 2019

Yes, there are many paths and what you say can happen (I prefer to talk about states since that's more natural for an NFA, whereas path kind of invokes backtracking - but I think we understand each other here anyways).

It doesn't change anything though. I don't make any assumption that any group is captured only once. A group can be captured many times over the course of the simulation. After, all for your example with 1 backrefence there will be ~1000 states for an input of length 10, or ~1,000,000 states (not all reachable) for an input of length 100, so don't worry there are lots of states to help encode all the possibilities!

Similarly things like some groups not being captured some time around the "loop" don't lose any problem.

The basic invariant is simply that while processing any particular character in the string, any given capture group has been captured zero or more times. If more than zero, we remember the last capture, but it can come from anywhere (the NFA doesn't really care about loops or anything like that). All such states are explored, as long as they are possibly matching. So after you get a few characters into the input of your example (after the foo part, ie a few times around the "loop") you'll have states representing every possible distinct combination of cases where the two {1,5} matched any number of times, and what the () captured most recently (note the former implies the latter).

This is no different than how an plain NFA has to track a lot of states as well, eg for the same regex with the backref removed, the NFA would still be tracking all distinct states regarding the current match. In that case more states would "collapse" (ie not end up distinct) since for example matching 1 char then 2 in the two {1,5} repeats puts you in the same state as 2 then 1, but that's not true for the capturing case since a different character was matched. So yes, the backrefence case needs (many) more states but not exponentially many.

I think the problem may be that it is hard to tell the difference between many (high degree polynomial) and many (exponential) just by eyeballing it through. I think it is worthwhile to read the "implementation assisted" proof sketch in the readme and maybe look at the code and try to find a specific flaw there. Better, it is definitely worthwhile to run your examples against polyregex, since after all, if the construction is wrong, then one of two things must be true: (1) the "counting" in the proof sketch is wrong (exponentially wrong) or (2) the implementation doesn't correctly match some things like your example as it should. Assuming (2) is more likely, it should be easy(ish) to show, right? (No doubt there are bugs, but I expect they could be fixed without invalidating the high level idea and P time).

@travisdowns

This comment has been minimized.

Copy link
Owner

@travisdowns travisdowns commented Apr 30, 2019

@geofflangdale - I have added some debug output to the BackrefMatcher to help illustrate what I try to hand-wave above. I have also added a variant of your regex above as testGeoff2 - the exact regex is:

(?:(?:.|..|...)(.)(?:.|..|...)\\1)+

Basically I replaced the .{1,5} with .|..|... (equivalent to {1,3}) because polyregex doesn't support counted repetition with {}, and changed the \w to . because polyregex doesn't support character classes (easy to add tho), and I don't think it matters here. I removed the foo and bar because they don't seem relevant (the behavior was the same with them, as long as the input started with foo).

If you run testGeoff2 like so:

mvn -DBackrefMatcher.debug=2 -Dtest='io.github.travisdowns.polyregex.BackrefTest#testGeoff2[*BackrefMatcher]' test

Following this line (in order to skip past the first two test cases to the one that matches against 39 x chars):

BRDEBUG: Matching text xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx against pattern (?:(?:.|..|...)(.)(?:.|..|...)\1)+

You see output like this that shows what happens, character by character, as the NFA simulation runs, the first few lines like:

BRDEBUG: Creating SubNFA for captstate [starts={-1},ends={-1},textlen=40]
BRDEBUG: Created starting state list with 3 states (6 visited)
BRDEBUG:   ANY     id= 4 (captures : (-1,-1))
BRDEBUG:   ANY     id=21 (captures : (-1,-1))
BRDEBUG:   ANY     id=19 (captures : (-1,-1))
BRDEBUG: Creating SubNFA for captstate [starts={1},ends={-1},textlen=40]
BRDEBUG: Processed character x at position 0: 3 current states (4 visited)
BRDEBUG:   ANY     id=22 (captures : (-1,-1))
BRDEBUG:   ANY     id=20 (captures : (-1,-1))
BRDEBUG:   ANY     id= 6 (captures : (1,-1))
BRDEBUG: Creating SubNFA for captstate [starts={2},ends={-1},textlen=40]
BRDEBUG: Creating SubNFA for captstate [starts={1},ends={2},textlen=40]
BRDEBUG: Processed character x at position 1: 5 current states (9 visited)
BRDEBUG:   ANY     id=16 (captures : (1,2))
BRDEBUG:   ANY     id=14 (captures : (1,2))
BRDEBUG:   ANY     id=23 (captures : (-1,-1))
BRDEBUG:   ANY     id=10 (captures : (1,2))
BRDEBUG:   ANY     id= 6 (captures : (2,-1))
BRDEBUG: Creating SubNFA for captstate [starts={3},ends={-1},textlen=40]
BRDEBUG: Creating SubNFA for captstate [starts={2},ends={3},textlen=40]
BRDEBUG: Processed character x at position 2: 7 current states (12 visited)
BRDEBUG:   ANY     id=10 (captures : (2,3))
BRDEBUG:   ANY     id= 6 (captures : (3,-1))
BRDEBUG:   CHAR[x] id=-1 (captures : (1,2))
BRDEBUG:   ANY     id=14 (captures : (2,3))
BRDEBUG:   ANY     id=17 (captures : (1,2))
BRDEBUG:   ANY     id=15 (captures : (1,2))
BRDEBUG:   ANY     id=16 (captures : (2,3))

The initial state has 3 ANY nodes, which are the first characters of the three possibilities in the alternation .|..|...1. All of the states have their capture as (-1,-1) since the capture hasn't been entered.

The state after the first character:

BRDEBUG: Processed character x at position 0: 3 current states (4 visited)
BRDEBUG:   ANY     id=22 (captures : (-1,-1))
BRDEBUG:   ANY     id=20 (captures : (-1,-1))
BRDEBUG:   ANY     id= 6 (captures : (1,-1))

Again 3 states, two of which are the second . for the .. and ... part of the alternation (still (-1,-1) capture) and the last one represents the state that started in the . branch of the alternation, and has now moved on to the capture (.): so you see captures (1,-1) which means that in this state the first (only) capture starts at character 1 and does not yet end.

Look at the step next for the next few characters and you see the evolution. The number of states starts to grow pretty quickly since the number of possible states (including captures) fans out. Then something interesting happens: the number of states flatlines at 31 after processing character 11. It doens't grow any more.

So far from being O(n^(2k+2)) i.e., n^4 time, polyregex handles this in linear time.

It's not hard to see why this happens. Consider what happens when the FNA has simulated say 100 characters. The repeated part of the regex .{1,3}(.).{1,3} is at most 7 characters long. What happens in the current and prior repeat is important, because it defines what character was captured, and our current state (whether we have matched 1,2 or 3 characters in each {1,3} is important). What happened before that is irrelevant: it doens't matter what path we took to match the first 90 characters: all that matters is we got there. There is no "memory" except the current position in the regex (base NFA state) and the most recent capture for \1. The (.) group may have matched in many different ways earlier (one of your concerns), but all the states collapse together except for the recent behavior in the last 7(ish) characters.

You can definitely come with worse regexes though. Consider testGeoff3 which uses (?:.*(.+).*\\1)+ as the pattern. We've gotten rid of the {1,3} and made the inner repeats unbounded (and allowed zero-length), and we've also made the match variable size.

Running:

mvn -DBackrefMatcher.debug=2 -Dtest='io.github.travisdowns.polyregex.BackrefTest#testGeoff3[*BackrefMatcher]' test > out

will give you some ~90,000 lines of output, but lets just look at the number of states after processing each character:

$ grep -oP '[0-9]*(?= current states)' out
5
12
22
37
56
81
111
148
191
242
300
367
442
527
621
726
841
968
1106
1257
1420
1597
1787
1992
2211
2446
2696
2963
3246
3547
3865
4202
4557
4932
5326
5741
6176
6633
7111
7612

Note that the number of states grows very quickly! I think this is in at least some sense the "explosion" of possibilities you have been talking about. If you look at some of the states in the last iteration:

BRDEBUG:   CHAR[x] id=-1 (captures : (1,15))
BRDEBUG:   CHAR[x] id=-1 (captures : (14,17))
BRDEBUG:   CHAR[x] id=-1 (captures : (10,25))
BRDEBUG:   ANY     id= 3 (captures : (6,21))
BRDEBUG:   CHAR[x] id=-1 (captures : (30,37))
BRDEBUG:   MATCH   id= 0 (captures : (8,18))
BRDEBUG:   CHAR[x] id=-1 (captures : (12,23))
BRDEBUG:   CHAR[x] id=-1 (captures : (11,33))
BRDEBUG:   CHAR[x] id=-1 (captures : (5,23))
BRDEBUG:   CHAR[x] id=-1 (captures : (3,12))
BRDEBUG:   CHAR[x] id=-1 (captures : (17,19))
BRDEBUG:   CHAR[x] id=-1 (captures : (10,34))
BRDEBUG:   ANY     id= 3 (captures : (11,12))
BRDEBUG:   ANY     id= 9 (captures : (1,5))
BRDEBUG:   CHAR[x] id=-1 (captures : (3,32))
BRDEBUG:   CHAR[x] id=-1 (captures : (10,24))
BRDEBUG:   CHAR[x] id=-1 (captures : (0,31))

You'll see all sorts of capture states, including some like the last one above which goes all the way back to the first character, since the size of the repetition is unbounded.

Still, this isn't exponential. The number of states looks like this:

states-vs-chars

Based on the earlier analysis, we expect at worst O(n^(2k+1)) states, and in fact this is a perfect match for 3-rd degree polynomial in n, as R^2=0.999999... shows. So in fact this regex hits our upper bound, so we can consider that bound tight, at least for the k=1 case.


1 It would be more efficient to write the .|..|... alternation like .(?:..+)+ which should have the same effect but has only 3 ANY states rather than 1 + 2 + 3 = 6 ANY states. I didn't do that 'cause it looks confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.