Skip to content

Commit

Permalink
bpo-433030: Add support of atomic grouping in regular expressions (GH…
Browse files Browse the repository at this point in the history
…-31982)

* Atomic grouping: (?>...).
* Possessive quantifiers: x++, x*+, x?+, x{m,n}+.
  Equivalent to (?>x+), (?>x*), (?>x?), (?>x{m,n}).

Co-authored-by: Jeffrey C. Jacobs <timehorse@users.sourceforge.net>
  • Loading branch information
serhiy-storchaka and Jeffrey C. Jacobs committed Mar 21, 2022
1 parent 2bde682 commit 345b390
Show file tree
Hide file tree
Showing 11 changed files with 593 additions and 92 deletions.
54 changes: 54 additions & 0 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,30 @@ The special characters are:
characters as possible will be matched. Using the RE ``<.*?>`` will match
only ``'<a>'``.

.. index::
single: *+; in regular expressions
single: ++; in regular expressions
single: ?+; in regular expressions
``*+``, ``++``, ``?+``
Like the ``'*'``, ``'+'``, and ``'?'`` qualifiers, those where ``'+'`` is
appended also match as many times as possible.
However, unlike the true greedy qualifiers, these do not allow
back-tracking when the expression following it fails to match.
These are known as :dfn:`possessive` qualifiers.
For example, ``a*a`` will match ``'aaaa'`` because the ``a*`` will match
all 4 ``'a'``s, but, when the final ``'a'`` is encountered, the
expression is backtracked so that in the end the ``a*`` ends up matching
3 ``'a'``s total, and the fourth ``'a'`` is matched by the final ``'a'``.
However, when ``a*+a`` is used to match ``'aaaa'``, the ``a*+`` will
match all 4 ``'a'``, but when the final ``'a'`` fails to find any more
characters to match, the expression cannot be backtracked and will thus
fail to match.
``x*+``, ``x++`` and ``x?+`` are equivalent to ``(?>x*)``, ``(?>x+)``
and ``(?>x?)`` correspondigly.

.. versionadded:: 3.11

.. index::
single: {} (curly brackets); in regular expressions

Expand All @@ -178,6 +202,21 @@ The special characters are:
6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
while ``a{3,5}?`` will only match 3 characters.

``{m,n}+``
Causes the resulting RE to match from *m* to *n* repetitions of the
preceding RE, attempting to match as many repetitions as possible
*without* establishing any backtracking points.
This is the possessive version of the qualifier above.
For example, on the 6-character string ``'aaaaaa'``, ``a{3,5}+aa``
attempt to match 5 ``'a'`` characters, then, requiring 2 more ``'a'``s,
will need more characters than available and thus fail, while
``a{3,5}aa`` will match with ``a{3,5}`` capturing 5, then 4 ``'a'``s
by backtracking and then the final 2 ``'a'``s are matched by the final
``aa`` in the pattern.
``x{m,n}+`` is equivalent to ``(?>x{m,n})``.

.. versionadded:: 3.11

.. index:: single: \ (backslash); in regular expressions

``\``
Expand Down Expand Up @@ -336,6 +375,21 @@ The special characters are:
.. versionchanged:: 3.7
The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.

``(?>...)``
Attempts to match ``...`` as if it was a separate regular expression, and
if successful, continues to match the rest of the pattern following it.
If the subsequent pattern fails to match, the stack can only be unwound
to a point *before* the ``(?>...)`` because once exited, the expression,
known as an :dfn:`atomic group`, has thrown away all stack points within
itself.
Thus, ``(?>.*).`` would never match anything because first the ``.*``
would match all characters possible, then, having nothing left to match,
the final ``.`` would fail to match.
Since there are no stack points saved in the Atomic Group, and there is
no stack point before it, the entire expression would thus fail to match.

.. versionadded:: 3.11

.. index:: single: (?P<; in regular expressions

``(?P<name>...)``
Expand Down
6 changes: 6 additions & 0 deletions Doc/whatsnew/3.11.rst
Original file line number Diff line number Diff line change
Expand Up @@ -295,6 +295,12 @@ os
instead of ``CryptGenRandom()`` which is deprecated.
(Contributed by Dong-hee Na in :issue:`44611`.)

re
--

* Atomic grouping (``(?>...)``) and possessive qualifiers (``*+``, ``++``,
``?+``, ``{m,n}+``) are now supported in regular expressions.
(Contributed by Jeffrey C. Jacobs and Serhiy Storchaka in :issue:`433030`.)

shutil
------
Expand Down
38 changes: 27 additions & 11 deletions Lib/sre_compile.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,16 @@
assert _sre.MAGIC == MAGIC, "SRE module mismatch"

_LITERAL_CODES = {LITERAL, NOT_LITERAL}
_REPEATING_CODES = {REPEAT, MIN_REPEAT, MAX_REPEAT}
_SUCCESS_CODES = {SUCCESS, FAILURE}
_ASSERT_CODES = {ASSERT, ASSERT_NOT}
_UNIT_CODES = _LITERAL_CODES | {ANY, IN}

_REPEATING_CODES = {
MIN_REPEAT: (REPEAT, MIN_UNTIL, MIN_REPEAT_ONE),
MAX_REPEAT: (REPEAT, MAX_UNTIL, REPEAT_ONE),
POSSESSIVE_REPEAT: (POSSESSIVE_REPEAT, SUCCESS, POSSESSIVE_REPEAT_ONE),
}

# Sets of lowercase characters which have the same uppercase.
_equivalences = (
# LATIN SMALL LETTER I, LATIN SMALL LETTER DOTLESS I
Expand Down Expand Up @@ -138,27 +143,21 @@ def _compile(code, pattern, flags):
if flags & SRE_FLAG_TEMPLATE:
raise error("internal: unsupported template operator %r" % (op,))
if _simple(av[2]):
if op is MAX_REPEAT:
emit(REPEAT_ONE)
else:
emit(MIN_REPEAT_ONE)
emit(REPEATING_CODES[op][2])
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
_compile(code, av[2], flags)
emit(SUCCESS)
code[skip] = _len(code) - skip
else:
emit(REPEAT)
emit(REPEATING_CODES[op][0])
skip = _len(code); emit(0)
emit(av[0])
emit(av[1])
_compile(code, av[2], flags)
code[skip] = _len(code) - skip
if op is MAX_REPEAT:
emit(MAX_UNTIL)
else:
emit(MIN_UNTIL)
emit(REPEATING_CODES[op][1])
elif op is SUBPATTERN:
group, add_flags, del_flags, p = av
if group:
Expand All @@ -169,6 +168,17 @@ def _compile(code, pattern, flags):
if group:
emit(MARK)
emit((group-1)*2+1)
elif op is ATOMIC_GROUP:
# Atomic Groups are handled by starting with an Atomic
# Group op code, then putting in the atomic group pattern
# and finally a success op code to tell any repeat
# operations within the Atomic Group to stop eating and
# pop their stack if they reach it
emit(ATOMIC_GROUP)
skip = _len(code); emit(0)
_compile(code, av, flags)
emit(SUCCESS)
code[skip] = _len(code) - skip
elif op in SUCCESS_CODES:
emit(op)
elif op in ASSERT_CODES:
Expand Down Expand Up @@ -709,7 +719,8 @@ def print_2(*args):
else:
print_(FAILURE)
i += 1
elif op in (REPEAT, REPEAT_ONE, MIN_REPEAT_ONE):
elif op in (REPEAT, REPEAT_ONE, MIN_REPEAT_ONE,
POSSESSIVE_REPEAT, POSSESSIVE_REPEAT_ONE):
skip, min, max = code[i: i+3]
if max == MAXREPEAT:
max = 'MAXREPEAT'
Expand All @@ -725,6 +736,11 @@ def print_2(*args):
print_(op, skip, arg, to=i+skip)
dis_(i+2, i+skip)
i += skip
elif op is ATOMIC_GROUP:
skip = code[i]
print_(op, skip, to=i+skip)
dis_(i+1, i+skip)
i += skip
elif op is INFO:
skip, flags, min, max = code[i: i+4]
if max == MAXREPEAT:
Expand Down
5 changes: 4 additions & 1 deletion Lib/sre_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

# update when constants are added or removed

MAGIC = 20171005
MAGIC = 20220318

from _sre import MAXREPEAT, MAXGROUPS

Expand Down Expand Up @@ -97,6 +97,9 @@ def _makecodes(names):
REPEAT_ONE
SUBPATTERN
MIN_REPEAT_ONE
ATOMIC_GROUP
POSSESSIVE_REPEAT
POSSESSIVE_REPEAT_ONE
GROUPREF_IGNORE
IN_IGNORE
Expand Down
32 changes: 26 additions & 6 deletions Lib/sre_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@

WHITESPACE = frozenset(" \t\n\r\v\f")

_REPEATCODES = frozenset({MIN_REPEAT, MAX_REPEAT})
_REPEATCODES = frozenset({MIN_REPEAT, MAX_REPEAT, POSSESSIVE_REPEAT})
_UNITCODES = frozenset({ANY, RANGE, IN, LITERAL, NOT_LITERAL, CATEGORY})

ESCAPES = {
Expand Down Expand Up @@ -190,6 +190,10 @@ def getwidth(self):
i, j = av.getwidth()
lo = lo + i
hi = hi + j
elif op is ATOMIC_GROUP:
i, j = av.getwidth()
lo = lo + i
hi = hi + j
elif op is SUBPATTERN:
i, j = av[-1].getwidth()
lo = lo + i
Expand Down Expand Up @@ -675,16 +679,22 @@ def _parse(source, state, verbose, nested, first=False):
if group is None and not add_flags and not del_flags:
item = p
if sourcematch("?"):
# Non-Greedy Match
subpattern[-1] = (MIN_REPEAT, (min, max, item))
elif sourcematch("+"):
# Possessive Match (Always Greedy)
subpattern[-1] = (POSSESSIVE_REPEAT, (min, max, item))
else:
# Greedy Match
subpattern[-1] = (MAX_REPEAT, (min, max, item))

elif this == ".":
subpatternappend((ANY, None))

elif this == "(":
start = source.tell() - 1
group = True
capture = True
atomic = False
name = None
add_flags = 0
del_flags = 0
Expand Down Expand Up @@ -726,7 +736,7 @@ def _parse(source, state, verbose, nested, first=False):
len(char) + 2)
elif char == ":":
# non-capturing group
group = None
capture = False
elif char == "#":
# comment
while True:
Expand Down Expand Up @@ -800,6 +810,10 @@ def _parse(source, state, verbose, nested, first=False):
subpatternappend((GROUPREF_EXISTS, (condgroup, item_yes, item_no)))
continue

elif char == ">":
# non-capturing, atomic group
capture = False
atomic = True
elif char in FLAGS or char == "-":
# flags
flags = _parse_flags(source, state, char)
Expand All @@ -813,17 +827,19 @@ def _parse(source, state, verbose, nested, first=False):
continue

add_flags, del_flags = flags
group = None
capture = False
else:
raise source.error("unknown extension ?" + char,
len(char) + 1)

# parse group contents
if group is not None:
if capture:
try:
group = state.opengroup(name)
except error as err:
raise source.error(err.msg, len(name) + 1) from None
else:
group = None
sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
not (del_flags & SRE_FLAG_VERBOSE))
p = _parse_sub(source, state, sub_verbose, nested + 1)
Expand All @@ -832,7 +848,11 @@ def _parse(source, state, verbose, nested, first=False):
source.tell() - start)
if group is not None:
state.closegroup(group, p)
subpatternappend((SUBPATTERN, (group, add_flags, del_flags, p)))
if atomic:
assert group is None
subpatternappend((ATOMIC_GROUP, p))
else:
subpatternappend((SUBPATTERN, (group, add_flags, del_flags, p)))

elif this == "^":
subpatternappend((AT, AT_BEGINNING))
Expand Down

0 comments on commit 345b390

Please sign in to comment.