Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-433030: Add support of atomic grouping in regular expressions #31982

Merged
merged 9 commits into from
Mar 21, 2022

Conversation

serhiy-storchaka
Copy link
Member

@serhiy-storchaka serhiy-storchaka commented Mar 18, 2022

Copy link
Member

@tim-one tim-one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about re's implementation. Does anyone? 😉

Lib/test/test_re.py Outdated Show resolved Hide resolved
Doc/library/re.rst Show resolved Hide resolved
successful, continues to match the rest of the pattern following it. If the
subsequent pattern fails to match, the stack can only be unwound to a point
*before* the ``(?>...)`` because once exited, the expression, known as an
:dfn:`Atomic Group`, has thrown away all stack points within itself. Thus,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 'stack point' the same as 'backtrack point'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure yet.

Lib/test/test_re.py Outdated Show resolved Hide resolved
@tim-one
Copy link
Member

tim-one commented Mar 18, 2022

BTW, I'll add that I think these (atomic groups and possessive quantifiers) are, and by far, the most important things Python's engine is missing among "modern" features. Although, ya, I would rather see MRAB fold his regex engine in, since it has essentially all "modern" features already.

Comment on lines +696 to +697
capture = True
atomic = False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use a three-state flag (capturing/non-capturing/atomic) instead of two boolean flags. It could even be a tiny bit faster if use built-in singletons True/False/None. But the current code may be clearer.

@serhiy-storchaka serhiy-storchaka marked this pull request as ready for review March 19, 2022 10:29
@serhiy-storchaka
Copy link
Member Author

I did have doubts about this feature, because it is an advanced feature not supported in older regular expression implementations, and it is difficult to learning.

But some regular expressions (e.g. in textwrap) can be simplified with atomic groups, and we have found recently that atomic groups can allow to fix some security issues. So there is a need of atomic groups in the stdlib. Slowly they will be used in user code too.

@tim-one
Copy link
Member

tim-one commented Mar 19, 2022

I'd say, to the contrary, that these are easy to learn. Indeed, that's why they're near-universal now among other regexp engines. The possibility for backtracking is rarely needed for lexical analysis, and gimmicks that promise not to backtrack save users from worlds of unintended quadratic, cubic, ..., even exponential time (non-)matching disaster cases.

@tim-one
Copy link
Member

tim-one commented Mar 20, 2022

Example: a while back, our fnmatch.py was found to take exponential time in some non-matching cases, a "security hole". This was entirely due to unintended and unwanted backtracking in re. But there's no straightforward way now to stop that from happening. Instead I redid the regexp construction in a highly convoluted way, relying on that matches done inside a lookahead assertion don't support backtracking. So do the "real" matching in a lookahead assertion, wrap the guts of the lookahead assertion inside a named group, then do the actual matching of the characters via a backreference to that named group.

But I expect there are only a relative handful of people now who could repair that if another bug turns up, because the workaround was so obscure and technical.

With atomic groups, though, the workaround could be entirely straightforward - indeed, it would be obvious.

@serhiy-storchaka
Copy link
Member Author

This is a really impressive example. I shudder at the thought that I would have to repeat this work in a modified copy of fnmatch.translate(). Now I think I'll just use regex and atomic grouping in that code.

If there are no objections I am going to merge this PR tomorrow. I can't say that I understand this code 100%, but on the other hand I don't see anything openly suspicious, the tests passed successfully, and the core has remained stable for many years.

@tim-one
Copy link
Member

tim-one commented Mar 21, 2022

I agree merging is the best thing to do, and especially since it's early in the release cycle.

Some encouragement: while I haven't studied Python's re engine, as a general principle engines "like this" use much the same implementation code for atomic groups and forward lookahead assertions. Both try to match starting from the current position, and once they find something that matches, they're done, period - if, after getting out of the group, something later fails to match, they do no internal backtracking. The only "real" difference is that the atomic group consumes the substring it matched (advances the current search position), but the assertion does not. That's why I expect Fredrick said (on the bpo report) that most of the code to do atomic groups was already there.

After you merge this, I intend to change fnmatch.py to exploit it. It would give me real joy to rip out the excruciating cleverness I added 😄.

@tim-one tim-one self-requested a review March 21, 2022 03:40
@serhiy-storchaka
Copy link
Member Author

There are some differences in a code of atomic groups and lookahead assertions. I have tested them, and atomic groups seems work correctly. For example re.fullmatch(r'(?=a+)', 'a') returns None, but re.fullmatch(r'(?>a+)', 'a') returns a match. I'll continue to search further.

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

@serhiy-storchaka
Copy link
Member Author

Other bug: re.findall(r'a*+', 'aab') and re.findall(r'a?+', 'aab') hang.

Well, I wrote the code which allows non-possessive repetition to work correctly in these cases, so I know how to fix it.

@serhiy-storchaka serhiy-storchaka merged commit 345b390 into python:main Mar 21, 2022
@serhiy-storchaka serhiy-storchaka deleted the re-atomic-groups branch March 21, 2022 16:28
@thatbirdguythatuknownot
Copy link
Contributor

But I have found a bug in possessive repetition. For example, re.fullmatch(r'a++', 'ab') should return None, but returns a match.

@serhiy-storchaka Shouldn't that not be a bug and shouldn't that match a?

@tim-one
Copy link
Member

tim-one commented Mar 21, 2022

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

@thatbirdguythatuknownot
Copy link
Contributor

@thatbirdguythatuknownot, that's a bug. Note that fullmatch() is being used. That requires matching the entire string, not just a proper prefix. The trailing b in the string isn't matched.

I see. Thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants