New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undocumented feature prevents re module from finding certain matches #67880
Comments
This pattern matches:
But this doesn't:
The difference is that {2} is replaced by {0,2}. This shouldn't prevent the pattern from matching anywhere where it matched before. The reason for this misbehavior is a feature which is designed to protect re engine from infinite loops, but in fact it sometimes prevents patterns from matching where they should. I think that this feature should be at least properly documented, by properly I mean that it should be possible to reconstruct the exact behavior from documentation, as the implementation is not particularly easy to understand. |
Hi there, if anyone's able to provide any guidance on this issue I'd be happy to take a look into it. Is this a behaviour that is feasible to fix, or should this just be documented in some way as suggested by Evgeny? |
Suppose you had a pattern:
It would advance one character on each iteration of the * until the . failed to match. The text is finite, so it would stop matching eventually. Now suppose you had a pattern:
On each iteration of the * it wouldn't advance, so it would keep matching forever. A way to avoid that is to stop the * if it hasn't advanced. The example pattern shows that there's still a problem. It advances if a group has matched, but that group doens't match until the first iteration, after the test, and does not, itself, advance. The * stops because it hasn't advanced, but, in this instance, that doesn't mean it never will. The solution is for the * to check not only whether it has advanced, but also whether a group has changed. (Strictly speaking, the latter check is needed only if the repeated part tests whether a group also in the repeated part has changed, but it's probably not worth "optimising" for that possibility.) In the regex module, it increments a "capture changed" counter whenever any group is changed (a group's first match or a change to a group's span). That makes it easier for the * to check. The code needs to save that counter for backtracking and restore it when backtracking. I've mentioned only the *, but the same remarks apply to + and {...}, except that the {...} should keep repeating until it has reached its prescribed minimum. |
Thanks for the explanation Matthew, I'll take a further look at some point in the coming weeks. |
Hi Matthew, thank you for your suggestions of where to start. |
Matthew referred to the code of the regex module (of which he is the author). |
Ah thank you very much Serhiy, that's super helpful! |
I've got a bit confused and am doubting myself - is the below output expected?
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a')
>>> m.groups()
('', '')
>>> m = re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a')
>>> m.groups()
('', None) The first pattern doesn't behave as I would (probably naively expect) given Matthew's explanation of this bug - wouldn't the bug cause the match to fail with {1,2} as well? Anyone have any thoughts? |
It's been many years since I looked at the code, and there have been changes since then, so some of the details might not be correct. As to have it should behave: re.match('(?:()|(?(1)()|z)){1,2}(?(2)a|z)', 'a') Iteration 1. re.match('(?:()|(?(1)()|z)){1,2}(?(1)a|z)', 'a') Iteration 1. |
Hi Matthew, Serhiy, I tried to identify the right places in re to fix things but have found it a bit difficult. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: