bpo-433030: Add support of atomic grouping in regular expressions #31982

serhiy-storchaka · 2022-03-18T19:00:08Z

https://bugs.python.org/issue433030

tim-one

I don't know anything about re's implementation. Does anyone? 😉

Lib/test/test_re.py

Doc/library/re.rst

terryjreedy · 2022-03-18T21:00:57Z

Doc/library/re.rst

+  successful, continues to match the rest of the pattern following it.  If the
+  subsequent pattern fails to match, the stack can only be unwound to a point
+  *before* the ``(?>...)`` because once exited, the expression, known as an
+  :dfn:`Atomic Group`, has thrown away all stack points within itself.  Thus,


Is 'stack point' the same as 'backtrack point'?

Not sure yet.

Lib/test/test_re.py

tim-one · 2022-03-18T21:38:20Z

BTW, I'll add that I think these (atomic groups and possessive quantifiers) are, and by far, the most important things Python's engine is missing among "modern" features. Although, ya, I would rather see MRAB fold his regex engine in, since it has essentially all "modern" features already.

serhiy-storchaka · 2022-03-19T10:18:30Z

Lib/sre_parse.py

+            capture = True
+            atomic = False


We could use a three-state flag (capturing/non-capturing/atomic) instead of two boolean flags. It could even be a tiny bit faster if use built-in singletons True/False/None. But the current code may be clearer.

serhiy-storchaka · 2022-03-19T10:34:17Z

I did have doubts about this feature, because it is an advanced feature not supported in older regular expression implementations, and it is difficult to learning.

But some regular expressions (e.g. in textwrap) can be simplified with atomic groups, and we have found recently that atomic groups can allow to fix some security issues. So there is a need of atomic groups in the stdlib. Slowly they will be used in user code too.

tim-one · 2022-03-19T17:55:19Z

I'd say, to the contrary, that these are easy to learn. Indeed, that's why they're near-universal now among other regexp engines. The possibility for backtracking is rarely needed for lexical analysis, and gimmicks that promise not to backtrack save users from worlds of unintended quadratic, cubic, ..., even exponential time (non-)matching disaster cases.

tim-one · 2022-03-20T00:08:49Z

Example: a while back, our fnmatch.py was found to take exponential time in some non-matching cases, a "security hole". This was entirely due to unintended and unwanted backtracking in re. But there's no straightforward way now to stop that from happening. Instead I redid the regexp construction in a highly convoluted way, relying on that matches done inside a lookahead assertion don't support backtracking. So do the "real" matching in a lookahead assertion, wrap the guts of the lookahead assertion inside a named group, then do the actual matching of the characters via a backreference to that named group.

But I expect there are only a relative handful of people now who could repair that if another bug turns up, because the workaround was so obscure and technical.

With atomic groups, though, the workaround could be entirely straightforward - indeed, it would be obvious.

serhiy-storchaka · 2022-03-20T15:58:26Z

This is a really impressive example. I shudder at the thought that I would have to repeat this work in a modified copy of fnmatch.translate(). Now I think I'll just use regex and atomic grouping in that code.

If there are no objections I am going to merge this PR tomorrow. I can't say that I understand this code 100%, but on the other hand I don't see anything openly suspicious, the tests passed successfully, and the core has remained stable for many years.

tim-one · 2022-03-21T02:36:45Z

I agree merging is the best thing to do, and especially since it's early in the release cycle.

Some encouragement: while I haven't studied Python's re engine, as a general principle engines "like this" use much the same implementation code for atomic groups and forward lookahead assertions. Both try to match starting from the current position, and once they find something that matches, they're done, period - if, after getting out of the group, something later fails to match, they do no internal backtracking. The only "real" difference is that the atomic group consumes the substring it matched (advances the current search position), but the assertion does not. That's why I expect Fredrick said (on the bpo report) that most of the code to do atomic groups was already there.

After you merge this, I intend to change fnmatch.py to exploit it. It would give me real joy to rip out the excruciating cleverness I added 😄.

serhiy-storchaka · 2022-03-21T07:47:45Z

There are some differences in a code of atomic groups and lookahead assertions. I have tested them, and atomic groups seems work correctly. For example re.fullmatch(r'(?=a+)', 'a') returns None, but re.fullmatch(r'(?>a+)', 'a') returns a match. I'll continue to search further.

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

serhiy-storchaka · 2022-03-21T08:05:44Z

Other bug: re.findall(r'a*+', 'aab') and re.findall(r'a?+', 'aab') hang.

Well, I wrote the code which allows non-possessive repetition to work correctly in these cases, so I know how to fix it.

thatbirdguythatuknownot · 2022-03-21T23:16:51Z

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

@serhiy-storchaka Shouldn't that not be a bug and shouldn't that match a?

tim-one · 2022-03-21T23:39:19Z

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

thatbirdguythatuknownot · 2022-03-22T00:11:51Z

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

I see. Thanks for clarifying.

bpo-433030: Add support of atomic grouping in regular expressions

a4f8ac0

serhiy-storchaka added the type-feature A feature request or enhancement label Mar 18, 2022

serhiy-storchaka requested a review from tim-one March 18, 2022 19:00

bedevere-bot added the awaiting core review label Mar 18, 2022

the-knights-who-say-ni added the CLA signed label Mar 18, 2022

tim-one reviewed Mar 18, 2022

View reviewed changes

Lib/test/test_re.py Outdated Show resolved Hide resolved

terryjreedy reviewed Mar 18, 2022

View reviewed changes

serhiy-storchaka added 2 commits March 19, 2022 08:34

Polishing.

5066bd1

Update documentation.

8015968

serhiy-storchaka commented Mar 19, 2022

View reviewed changes

Fix incorrect use of the term role.

599109c

serhiy-storchaka marked this pull request as ready for review March 19, 2022 10:29

serhiy-storchaka added 3 commits March 20, 2022 17:08

Polishing.

711975d

Merge branch 'main' into re-atomic-groups

1ff76c3

Simplify the compliler code.

9d54059

tim-one self-requested a review March 21, 2022 03:40

serhiy-storchaka added 2 commits March 21, 2022 10:55

Fix possesive repetition of 1-character pattern

7e4f64a

Add the original author in Misc/ACKS.

b9e20f0

serhiy-storchaka merged commit 345b390 into python:main Mar 21, 2022

bedevere-bot removed the awaiting core review label Mar 21, 2022

serhiy-storchaka deleted the re-atomic-groups branch March 21, 2022 16:28

hugovk mentioned this pull request Apr 11, 2022

SRE: Atomic Grouping (?>...) is not supported #34627

Closed

ghost mentioned this pull request Apr 17, 2022

Path.rglob fails with *.* on Python 3.11.0a7 #91616

Closed

vxgmichel mentioned this pull request Sep 28, 2022

Make the prevent sync pattern generated from default_pattern.ignore stable Scille/parsec-cloud#3147

Merged

angus-lherrou mentioned this pull request Oct 22, 2022

Add Python 3.11's implementation of atomic groups and possessive quantifiers firasdib/Regex101#1930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-433030: Add support of atomic grouping in regular expressions #31982

bpo-433030: Add support of atomic grouping in regular expressions #31982

serhiy-storchaka commented Mar 18, 2022 •

edited by bedevere-bot

tim-one left a comment

terryjreedy Mar 18, 2022

serhiy-storchaka Mar 19, 2022

tim-one commented Mar 18, 2022

serhiy-storchaka Mar 19, 2022

serhiy-storchaka commented Mar 19, 2022

tim-one commented Mar 19, 2022

tim-one commented Mar 20, 2022

serhiy-storchaka commented Mar 20, 2022

tim-one commented Mar 21, 2022

serhiy-storchaka commented Mar 21, 2022

serhiy-storchaka commented Mar 21, 2022

thatbirdguythatuknownot commented Mar 21, 2022

tim-one commented Mar 21, 2022

thatbirdguythatuknownot commented Mar 22, 2022

bpo-433030: Add support of atomic grouping in regular expressions #31982

bpo-433030: Add support of atomic grouping in regular expressions #31982

Conversation

serhiy-storchaka commented Mar 18, 2022 • edited by bedevere-bot

tim-one left a comment

Choose a reason for hiding this comment

terryjreedy Mar 18, 2022

Choose a reason for hiding this comment

serhiy-storchaka Mar 19, 2022

Choose a reason for hiding this comment

tim-one commented Mar 18, 2022

serhiy-storchaka Mar 19, 2022

Choose a reason for hiding this comment

serhiy-storchaka commented Mar 19, 2022

tim-one commented Mar 19, 2022

tim-one commented Mar 20, 2022

serhiy-storchaka commented Mar 20, 2022

tim-one commented Mar 21, 2022

serhiy-storchaka commented Mar 21, 2022

serhiy-storchaka commented Mar 21, 2022

thatbirdguythatuknownot commented Mar 21, 2022

tim-one commented Mar 21, 2022

thatbirdguythatuknownot commented Mar 22, 2022

serhiy-storchaka commented Mar 18, 2022 •

edited by bedevere-bot