-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture behavior depends on the order of an alternation #80040
Comments
I have two regexes: /(a|ab)?b/ and /(ab|a)?b/. I am not actually sure which behavior is correct. Interpretation 1: The (ab|a) clause matches the a, satisfying the (ab|a)*? once, and the engine proceeds to the b and completes. The capture group ends up containing "a". Interpretation 2: The (ab|a) clause matches the a. Since the clause is marked with *, the engine repeats the attempt and finds nothing the second time. It proceeds to the b and completes. Because the second match attempt on (ab|a) found nothing, the capture group ends up empty. The behavior depends on both the order of (ab|a) vs. (a|ab), and the use of the non-greedy quantifier. I cannot see why changing the order of the alternation should have this effect. The change in behavior occurs in the built-in "re" module but not in the competing "regex" module. I have included the confusing-regex-behavior.py file for troubleshooting. Below is the behavior for matches on these and many variants. Regex pattern matched? matched string captured content (08:58:48) jamie@woody ~ $ python3 /tmp/confusing-regex-behavior.py Behavior from re Regex pattern matched? matched string captured content Behavior from regex Regex pattern matched? matched string captured content |
The first regex, r'(a|ab)?b', looks for the first alternative group by matching left-to-right [1] stopping at the first matching alternation "a". Roughly, the regex simplifies to r'(a)?b' giving 'a' in the captured group. The second regex, r'(ab|a)?b', looks for the first alternative group by matching left-to-right [1] stopping at the first matching alternation "ab". Roughly, the regex simplifies to r'(ab)?b' giving '' in the captured group. From there, I'm not clear on how a non-greedy kleene-star works with capturing groups and with the overall span(). A starting point would be to look at the re.DEBUG output for each pattern [2][3]. [1] From the re docs for the alternation operator: [2] re.DEBUG output for r'(a|ab)*?b' [3] re.DEBUG output for r'(ab|a)*?b'
19: branch 3 (to 22) |
Thanks for your thoughts, Raymond. I understand that the alternation has "short-circuit" behavior, but I still find it confusing in this case. Consider these two: Regex pattern matched? matched string captured content In order to satisfy the first "(ab|a)+?" clause the regex engine has to find at least one match for (ab|a), and still match the final "b" clause of the pattern. In this case, although "(ab|a)" will match "ab", doing so would cause the overall pattern to mismatch. So it seems like in order to obtain the match (which it does, see the second column), the regex engine must proceed past the first "ab" into the "a" part of the OR. But then I would expect the capture group to contain "a" and it does not. For what it's worth, I also tried the match /(ab|a)*?b/ in PHP, Perl, Java, Ruby, Go, Rust and Node.js. The other 7 regex engines all matched "ab" and captured "a". Only Python's re module matches with an empty capture -- and even here it disagrees with the Python "regex" module as I noted in my initial post. |
It looks like a bug in re to me. |
I'm not at all clear on how these features should interact (alternation, non-greedy matching, and group capture). Was just pointing that the first-match behavior of alternation is the documented behavior and that re.DEBUG can be used to explore what the regex is actually doing. Whether this is correct, intended, desirable or consistent with other tools is a completely different question. Even if not, there is a question about whether the behavior should be changed or just documented rather than risk breaking an untold number of regexes in already tested and deployed code. |
It matches, and the span is (0, 2). The only way that it can match like that is for the capture group to match the 'a', and the final 'b' to match the 'b'. Therefore, re.search(r'(ab|a)*b', 'ab').groups() should be ('a', ), as it is for the pattern with a greedy repeat. |
You can |02FAC684|02FC7402|MARK 0 In my computer, 02FC7400 points to "ab", 02FC7401 points 'b' in "ab", 02FC7402 points to the end of "ab". This capture group, begin at 02FC7402, end at 02FC7401. This seems a bug, the begin should at 02FC7400. |
For a capture group, state->mark[] array stores it's begin and end: So state->mark[0] is the begin of the first capture group. re.search(r'(ab|a)*?b', 'ab') 01 MARK 0 MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark, while the BRANCH op uses this macro to protect capture groups before trying a branch. So capture group 1 is [MARK-0 at Line-08, MARK-1 at line-12), this is wrong. |
A bug harvest, see PR11756, maybe sre has more bugs. Any ideas from regular expression experts? |
The PR11756 is prepared. 🔴 Step 1, test-cases Show the wrong behaviors before this fix, the corresponding test-case will be updated in next steps. 🔴 Step 2, MARK_PUSH(lastmark) macro bug This bug was reported in this issue. MARK_PUSH(lastmark) macro didn't protect MARK-0 if it was the only available mark. Note that if skip this step and apply Step 3, it happens to pass the test. But this is indeed a bug that should be fixed. 🔴 Step 3, jump JUMP_MIN_UNTIL_3 needs LASTMARK_SAVE() and MARK_PUSH() This bug was triggered by a pattern which provided by Serhiy Storchaka:
>>> re.match(r'(ab?)*?b', 'ab').group(1)
''
Expected result: 'a' This is due to the lack of protection before JUMP. in op SRE_OP_MAX_UNTIL 🔹 (...)* in op SRE_OP_MIN_UNTIL 🔹 (...)*? in op SRE_OP_REPEAT_ONE 🔹 .* in op SRE_OP_MIN_REPEAT_ONE 🔹 .*? in op SRE_OP_BRANCH 🔹 ...|... in op SRE_OP_ASSERT 🔹 (?=...) in op SRE_OP_ASSERT_NOT 🔹 (?!=...) These protections have not been changed since commit caf1c9d (2003-4-27).
Now we found some test-cases to break them, it seems JUMP_MIN_UNTIL_3 should be protected. I tried to summarize a rule of protection: if is_backtrack_point: # may backtrack if the tail fail
LASTMARK_SAVE()
elif is_repeat_body: # is a sub-pattern inside (...)* or (...)*?
LASTMARK_SAVE()
MARK_PUSH() I did some experiments, it seems this rule works. With this rule, JUMP_MAX_UNTIL_3 should LASTMARK_SAVE() as well, but this is not needed. sre uses stack to simulate recursive call, in fact JUMP_MAX_UNTIL_3 is protected by JUMP_MAX_UNTIL_2 from the previous SRE_OP_MAX_UNTIL execution. 🔴 Step 4, JUMP_ASSERT_NOT needs LASTMARK_SAVE() This bug is about negative assertion, initially reported in bpo-725149, but they didn't fix it for some reasons:
IMO we can fix it, here is two test-cases: Case-A, negative assertion: Case-B, negative assertion in a repeat:
PHP 7.3.2 NULL NULL In Case-A, sre looks compatible with Perl. 🔴 Interesting sidelights 1 Found a Perl bug: perl -le 'print defined($1)?"\"$1\"":"undef",",",defined($2)?"\"$2\"":"undef" if "ab" =~ /((ab?)*?b)/' Expected result: "ab","a" All other engines (except unpatched sre) return the correct result. 🔴 Interesting sidelights 2 >>> re.match(r'(((a)|b)*)', 'ab').groups()
('ab', 'b', 'a') Maybe the correct result is ('ab', 'b', None), only Node.js 10.15.1 returns this. |
Found another bug in re: >>> re.match(r'(?:.*?\b(?=(\t)|(x))x)*', 'a\txa\tx').groups()
('\t', 'x') Expected result: (None, 'x') PHP 7.3.2 NULL, "x" This is a very rare bug, can be fixed by adding MARH_PUSH() before JUMP_MIN_REPEAT_ONE. And maybe other JUMPs should MARK_PUSH() as well. I'm impressed with regex module, it never went wrong. ~~~~~~~~~~~~~~~~~~~~~~
I reported to Perl, it's a bug in perl-5.26, and already fixed in perl-5.28.0. |
Thank you for your great work Ma Lin! But it will take a time to make a review of it. Could you please create and run some microbenchmarks to measure possible performance penalty of additional MARH_PUSHes? I am especially interesting in worst cases. If the penalty is significant, it will be a goal of future optimizations. If it is unsignificant, we will not be bothered about this. I am not sure about backporting these changes. This behavior is such old, that there is a chance to break someone's code that depend on it. |
Besides the worst case, I prepared two solutions. Solution_A (PR12288): Solution_B (PR12289): Worst (PR12290): Legend of this table:
JUMP_MAX_UNTIL_1 No No No -> L/Mu JUMP_MIN_UNTIL_1 No No No -> L/Mu JUMP_REPEAT_ONE_1 L -> L/Mr L/Mr -> L/Mu JUMP_MIN_REPEAT_ONE L -> L/Mr L/Mr -> L/Mu JUMP_BRANCH L/Mr L/Mr L/Mr -> L/Mu JUMP_ASSERT No No No -> L/Mu JUMP_ASSERT_NOT No -> L/Mr L/Mr -> L/Mu 🔶 I made a benchmark tool for Python RE engines. re100mb.py was extracted from a real case, process 100MB real data in 16 rounds (with 16 patterns).
test 01 best: 0.647 0.629 0.625 0.635 0.102 There seems no significant changes. 🔶 Below are some micro benchmarks, run t.py with the benchmark tool testit.py SRE_OP_MAX_UNTIL a few repeats SRE_OP_MAX_UNTIL many repeats ------------------ SRE_OP_MIN_UNTIL many repeats ------------------ SRE_OP_REPEAT_ONE long string ------------------ SRE_OP_MIN_REPEAT_ONE long string ------------------ ------------------ SRE_OP_ASSERT_NOT 🔶 reduce the size of match_context struct In stack-consuming scenes, Solution_A and Solution_B are better than unpatched. On 32 bit platform, sizeof(match_context): 36 bytes -> 32 bytes. It brings:
If limit the stack size to 1GB, the max value of n is: re.match(r'(ab)*', n * 'ab') # need to save MARKs re.match(r'(?:ab)*', n * 'ab') # no need to save MARKs 🔶 Future optimizations
Maybe these two questions can be predicted when compiling the pattern:
|
I guess PR12427 is mature enough for review, I have been working on it these days. You may review these commits one by one, commit message is review guide. Maybe you will need two or three days to understand it, and ponder some days.
How about this plan, if no one complains these changes in Python 3.8 before the end of this year, then we backport to 3.7 and 2.7 branches at that time, and document the changes in notable changes section. |
Is there hope to merge to 3.9 branch? |
Do I need to write a detailed review guide? I suppose that after reading it from beginning to end, it will be easy to understand PR 12427, no need to read anything else. Or plan to replace the sre module with the regex module in a future version? |
Since the old behavior in many cases matches the behavior of Perl and Java (which are considered bugs, but still), it was decided to not backport the fix to avoid possible breakage in bugfix releases. Thank you Ma Lin for your contribution. |
Thanks for your review. 3.11 has a more powerful re module, also thank you for rebasing the atomic grouping code. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: