-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Implement comparison (x==y and x!=y) for _sre.SRE_Pattern #72913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Attached patch implements rich comparison for _sre.SRE_Pattern objects created by re.compile(). Comparison between patterns is used in the warnings module to not add duplicated filters, see issue bpo-18383: New changeset f57f4e33ba5e by Martin Panter in branch '3.5': For the warnings module, it became a problem in test_warnings since the Python test runner started to clear all caches. When re.purge() is called, re.compile() creates a new object, whereas with the cache it returns the same object and so the two patterns are equal since it's the same object. => see issue bpo-28688 |
Ten subtest in test_re fail with TypeError: unhashable type: '_sre.SRE_Pattern' |
Oops, right. Updated patch implements also hash() on patterns. |
Added comments on Rietveld. |
I hope you make it clear what you mean by 'equal', i.e. that it's comparing the pattern and the flags (AFAICT), so re.compile('(?x)a') != re.compile('(?x) a '). |
Patch version 3:
|
Ok, I hope that it's the last attempt: patch 5
|
There is a problem with locale-depending flags. The same pattern may be compiled with the LOCALE flag to different pattern objects in different locales. Perhaps we should compare the compiled code instead of pattern strings. Or both. |
Serhiy: "There is a problem with locale-depending flags. The same pattern may be compiled with the LOCALE flag to different pattern objects in different locales." Oh, I didn't know and you are right. "Perhaps we should compare the compiled code instead of pattern strings. Or both." PatternObject contains many fields. I used the following two fields which come from re.compile():
I considered that they were enough because pattern_repr() only displays these ones. Other fields:
weakreflist can be skipped, isbytes is already tested in my patch. Would it be possible to only compare code instead of pattern? What are groups, groupindex and indexgroup: should we also compare them? Maybe I can start from pattern_compare-4.patch and only add a test comparing code? |
'[abc]' and '[cba]' are compiled to the same code. Do we want to handle them as equal? This is implementation defined. |
Comparison must be fast. If the comparison is just memcmp(code1, I agree that we must put a similar somewhere to say that some parts |
I see two options:
Since this issue becomes a little ambiguous, I would target the patch to 3.7 only. Maybe we will find other subtle details or will decide to change the meaning of equality of pattern objects before releasing 3.7. |
New approach: patch 5 now compares indexgroup, groupindex and code instead of comparing pattern, to handle correctly two equal pattern strings compiled with the re.LOCALE flag and two different locales. The patch also converts indexgroup list to a tuple to be able to hash it. (It also prevents modification, but this is just a side effect, and groupindex remains a mutable dictionary.) _sre.compile() checks types which helps to identify a bug in unit tests. |
This looks too complicated. groups, indexgroup and groupindex are unambiguously derived from pattern string. If caching works different pattern strings are compiled to different pattern objects. Currently they are not equal, even if their codes are equal. And I don't see large need to consider them equal. |
Back to basis, patch 6:
I removed the @cpython_only unit test and rewrote test_pattern_compare_bytes() to make it easier to follow. re.compile('abc', re.IGNORECASE) is different than re.compile('ABC', re.IGNORECASE), but it's a deliberate choice to not test it. I consider that the behaviour can change depending on the Python implementation and in a future version. |
pattern_compare-6.patch LGTM. |
New changeset 5e8ef1493843 by Victor Stinner in branch '3.6': |
Serhiy Storchaka: "pattern_compare-6.patch LGTM." Thank you very much for your very useful reviews! I pushed the change. |
For stricter checks on _sre.compile() arguments, I created the issue bpo-28765: "_sre.compile(): be more strict on types of indexgroup and groupindex". |
+ && left->codesize && right->codesize); There is a typo. Should be: + && left->codesize == right->codesize); |
While we are here, it perhaps worth to add a fast path for self == other. |
New changeset 6b43d15fd2d7 by Victor Stinner in branch '3.6': New changeset c2cb70c97163 by Victor Stinner in branch '3.6': |
Serhiy Storchaka: "&& left->codesize && right->codesize)" Ooops! Fixed! "While we are here, it perhaps worth to add a fast path for self == other." Done. |
Misc/NEWS
so that it is managed by towncrier #552Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: