New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular expressions with multiple repeat codes #71987
Comments
In the documentation for the “re” module, it says repetition codes like {4} and “” operate on the preceding regular expression. But even though “a{4}” is a valid expression, the obvious way to apply a “” repetition to it fails: >>> re.compile("a{4}*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/proj/python/cpython/Lib/re.py", line 223, in compile
return _compile(pattern, flags)
File "/home/proj/python/cpython/Lib/re.py", line 292, in _compile
p = sre_compile.compile(pattern, flags)
File "/home/proj/python/cpython/Lib/sre_compile.py", line 555, in compile
p = sre_parse.parse(p, flags)
File "/home/proj/python/cpython/Lib/sre_parse.py", line 792, in parse
p = _parse_sub(source, pattern, 0)
File "/home/proj/python/cpython/Lib/sre_parse.py", line 406, in _parse_sub
itemsappend(_parse(source, state))
File "/home/proj/python/cpython/Lib/sre_parse.py", line 610, in _parse
source.tell() - here + len(this))
sre_constants.error: multiple repeat at position 4 As a workaround, I found I can wrap the inner repetition in (?:. . .): >>> re.compile("(?:a{4})*")
re.compile('(?:a{4})*') The problems with the workaround are (a) it is far from obvious, and (b) it adds more complicated syntax. Either this limitation should be documented, or if there is no good reason for it, it should be lifted. It is not clear if my workaround is entirely valid, or if I just found a way to bypass some sanity check. My original use case was scanning a base-64 encoding for bpo-27799: # Without the second level of brackets, this raises a "multiple repeat" error
chunk_re = br'(?: (?: [^A-Za-z0-9+/=]* [A-Za-z0-9+/=] ){4} )*' |
It seems perfectly logical and consistent to me. {4} is a repeat count, as is *. You get the same error if you do 'a?', and the same bypass if you do '(a?)' (though I haven't tested if that does anything useful :). You don't need the ?:, as far as I can tell, you just need to have the * modifying a group, making the group the "preceding regular expression". |
"*" and the other quantifiers ("+", "?" and "{...}") operate on the preceding _item_, not the entire preceding expression. For example, "ab*" means "a" followed by zero or more repeats of "b". You're not allowed to use multiple quantifiers together. The proper way is to use the non-capturing "(?:...)". It's too late to change that because some of them already have a special meaning when used after another quantifier: "a*?" is a lazy quantifier, as are "a+?", "a??" and "a{1,4}?". Many other regex implementations, including the "regex" module, use an additional "+" to signify a possessive quantifier: "a*+", "a++", "a?+" and "a{1,4}+". That just leaves the additional "*", which is treated as an error in all the other regex implementations that I'm aware of. |
This appears to be a doc issue to clarify that * cannot directly follow a repetition code. I believe there have been other (non)bug reports like this before. |
Okay so it sounds like my usage is valid if I add the brackets. I will try to come up with a documentation patch as some stage. The reason why it is not supported without brackets is to maintain a bit of consistency with the question mark (?), which modifies the preceding quantifier, and with the plus sign (+), which is also a modifier in other implementations. For the record, Gnu grep does seem to accept my expression (although Posix says this is undefined, and neither support lazy or possessive quantifiers): $ grep -E -o 'a{2}*' <<< "aaaaa"
aaaa However pcregrep, which supports lazy (?) and possessive (+) quantifiers, doesn’t like my expression: $ pcregrep -o 'a{2}*' <<< "aaaaa"
pcregrep: Error in command-line regex at offset 4: nothing to repeat
[Exit 2]
$ pcregrep -o '(?:a{2})*' <<< "aaaaa"
aaaa |
Here is a patch for the documentation. |
LGTM. Thanks Martin. |
New changeset 5f7d7e079e39 by Martin Panter in branch '3.5': New changeset 1f2ca7e4b64e by Martin Panter in branch '3.6': New changeset 98456ab88ab0 by Martin Panter in branch 'default': New changeset 94f02193f00f by Martin Panter in branch '2.7': |
I committed my patch as it was. I understand Silent Ghost’s objection was mainly that they thought the new paragraph or its positioning wouldn’t be very useful, but hopefully it is better than nothing. Perhaps in the future, the documentation could be restructured with subsections for repetition qualifiers and other kinds of special codes, which may help. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: