-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lookback with group references incorrect (two issues?) #53425
Comments
from re import compile # these work as expected assert compile('(a)b(?<=b)(c)').match('abc') # but when you add groups, you get bugs assert not compile('(?:(a)|(x))b(?<=(?(2)x|c))c').match('abc') # matches! # but lookahead works as expected assert compile('(?:(a)|(x))b(?=(?(2)x|c))c').match('abc') # these are similar but, in my opinion, shouldn't even compile assert not compile('(a)b(?<=(?(2)x|c))(c)').match('abc') # matches! assert compile('(a)b(?=(?(2)x|c))(c)').match('abc') # this is the error we should see above |
I hope the above is clear enough (you need to stare at the regexps for a time) - basically, lookback with a group conditional is not as expected (it appears to be evaluated as lookahead?). Also, some patterns compile that probably shouldn't. The re package only supports (according to the docs) lookback on expressions whose length is known. So I guess it's also possible that (?(n)pat1|pat2) should always fail that, even when len(pat1) = len(pat2)? Also, the generally excellent unit tests for the re package don't have much coverage for lookback (I am writing my own regexp lib and it passes all the re unit tests but had a similar bug - that's how I found this one...). |
If it's any help, these are the equivalent tests as I think they should be (you'll need to translate engine(parse(... to compile(...) |
Thanks very much for the reports.
Yes, this seems likely to me. Possibly even the compile stage should fail, though I've no idea how feasible it is to make that happen. Unfortunately I'm not sure that any of the currently active Python developers is particularly well versed in the intricacies of the re module. The most realistic option here may be just to document the restrictions on lookbehind assertions more clearly. Unless you're able to provide a patch? |
I thought someone was working on the re module these days? I thought there I'd seen some issues with patches etc? Anyway, short term, sorry - no patch. Medium/long term, yes it's possible, but please don't rely on it. The simplest way to document it is as you suggest, I think - just extend the qualifier on lookback requiring fixed length to exclude references to groups (it does seem to *bind* groups correctly on lookback, so there's no need to exclude them completely). |
Well, there's bpo-2636. It doesn't seem likely that that work will land in core Python any time soon, though. |
Should a regex compile if a group is referenced before it's defined? Consider this:
Other regex implementations permit forward references to groups. BTW, I had a look at the re module, found it too difficult, and so started on my own implementation of the matching engine (available on PyPI). |
Ah good point, thanks. |
Given the comment from Matthew Barnett in msg109399 "...I had a look at the re module, found it too difficult..." can this be closed as "won't fix"? |
Here is a patch which fixes lookbacks with group references and with group conditionals. I have used Andrew's examples as the base for tests. |
The patch also fixes bpo-814253. If there are no objections I'll commit it soon. |
If there are no objections I'm going to commit the patch soon. |
New changeset fac649bf2d10 by Serhiy Storchaka in branch '2.7': New changeset 9fcf4008b626 by Serhiy Storchaka in branch '3.4': New changeset 60fccf0aad83 by Serhiy Storchaka in branch 'default': |
The more I think about it, the more doubt. This patch added a behavior that is incompatible with the regex module. The regex module proceeds lookbehind assertions in the opposite direction, from right to left. This allows it to work with lookbehind assertions of non-fixed length. But the side effect is that in regex group reference in lookbehind assertion can refer only to a group defined right in the same lookbehind assertion (or defined left outside). In re now group reference in lookbehind assertion can refer only to a group defined left. This is likely to change in the future, which brings us to the problem of incompatibility. There are several quick ways to resolve the problem:
What is your decision Benjamin? Here is a patch against 2.7 which implements variant 3. |
New changeset d1f7c3f45ffe by Benjamin Peterson in branch '3.4': New changeset f385bc6e6e09 by Benjamin Peterson in branch 'default': New changeset 8a3807e15a1f by Benjamin Peterson in branch '2.7': |
I just backed out the change. Thanks for brining up the issue. |
What would be the best solution for 2.7? Here is a patch which forbids any group references in lookbehind assertions (they are not work currently and users shouldn't use them). |
On Sun, Nov 30, 2014, at 12:55, Serhiy Storchaka wrote:
You can pick. I just always favor not changing things for release
|
Updated documentation. If there are no objections I'll commit re_forbid_groupref_in_lookbehind-2.7_2.patch to 2.7 and 3.4. For 3.5 I prefer to add support of group references. |
New changeset b78195af96f5 by Serhiy Storchaka in branch 'default': New changeset 5387095b8675 by Serhiy Storchaka in branch '2.7': New changeset e295ad9be16d by Serhiy Storchaka in branch '3.4': |
Only warnings are raised in 2.7 and 3.4, so it will not break third party code that "works" by accident. In 3.5 only references to groups defined outside of lookbehind assertion work, so the behavior is compatible with regex. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: