Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular expressions with multiple repeat codes #71987

Closed
vadmium opened this issue Aug 19, 2016 · 9 comments
Closed

Regular expressions with multiple repeat codes #71987

vadmium opened this issue Aug 19, 2016 · 9 comments
Assignees
Labels
3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error

Comments

@vadmium
Copy link
Member

vadmium commented Aug 19, 2016

BPO 27800
Nosy @terryjreedy, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka
Files
  • multiple-repeat.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/vadmium'
    closed_at = <Date 2016-10-15.03:24:45.930>
    created_at = <Date 2016-08-19.12:07:00.104>
    labels = ['expert-regex', 'type-bug', '3.7']
    title = 'Regular expressions with multiple repeat codes'
    updated_at = <Date 2016-10-15.03:24:45.928>
    user = 'https://github.com/vadmium'

    bugs.python.org fields:

    activity = <Date 2016-10-15.03:24:45.928>
    actor = 'martin.panter'
    assignee = 'martin.panter'
    closed = True
    closed_date = <Date 2016-10-15.03:24:45.930>
    closer = 'martin.panter'
    components = ['Regular Expressions']
    creation = <Date 2016-08-19.12:07:00.104>
    creator = 'martin.panter'
    dependencies = []
    files = ['44356']
    hgrepos = []
    issue_num = 27800
    keywords = ['patch']
    message_count = 9.0
    messages = ['273107', '273133', '273147', '273148', '273178', '274344', '274347', '278682', '278690']
    nosy_count = 7.0
    nosy_names = ['terry.reedy', 'ezio.melotti', 'mrabarnett', 'r.david.murray', 'python-dev', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue27800'
    versions = ['Python 2.7', 'Python 3.5', 'Python 3.6', 'Python 3.7']

    @vadmium
    Copy link
    Member Author

    vadmium commented Aug 19, 2016

    In the documentation for the “re” module, it says repetition codes like {4} and “” operate on the preceding regular expression. But even though “a{4}” is a valid expression, the obvious way to apply a “” repetition to it fails:

    >>> re.compile("a{4}*")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/proj/python/cpython/Lib/re.py", line 223, in compile
        return _compile(pattern, flags)
      File "/home/proj/python/cpython/Lib/re.py", line 292, in _compile
        p = sre_compile.compile(pattern, flags)
      File "/home/proj/python/cpython/Lib/sre_compile.py", line 555, in compile
        p = sre_parse.parse(p, flags)
      File "/home/proj/python/cpython/Lib/sre_parse.py", line 792, in parse
        p = _parse_sub(source, pattern, 0)
      File "/home/proj/python/cpython/Lib/sre_parse.py", line 406, in _parse_sub
        itemsappend(_parse(source, state))
      File "/home/proj/python/cpython/Lib/sre_parse.py", line 610, in _parse
        source.tell() - here + len(this))
    sre_constants.error: multiple repeat at position 4

    As a workaround, I found I can wrap the inner repetition in (?:. . .):

    >>> re.compile("(?:a{4})*")
    re.compile('(?:a{4})*')

    The problems with the workaround are (a) it is far from obvious, and (b) it adds more complicated syntax. Either this limitation should be documented, or if there is no good reason for it, it should be lifted. It is not clear if my workaround is entirely valid, or if I just found a way to bypass some sanity check.

    My original use case was scanning a base-64 encoding for bpo-27799:

    # Without the second level of brackets, this raises a "multiple repeat" error
    chunk_re = br'(?: (?: [^A-Za-z0-9+/=]* [A-Za-z0-9+/=] ){4} )*'

    @vadmium vadmium added topic-regex type-bug An unexpected behavior, bug, or error labels Aug 19, 2016
    @bitdancer
    Copy link
    Member

    It seems perfectly logical and consistent to me. {4} is a repeat count, as is *. You get the same error if you do 'a?', and the same bypass if you do '(a?)' (though I haven't tested if that does anything useful :). You don't need the ?:, as far as I can tell, you just need to have the * modifying a group, making the group the "preceding regular expression".

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Aug 19, 2016

    "*" and the other quantifiers ("+", "?" and "{...}") operate on the preceding _item_, not the entire preceding expression. For example, "ab*" means "a" followed by zero or more repeats of "b".

    You're not allowed to use multiple quantifiers together. The proper way is to use the non-capturing "(?:...)".

    It's too late to change that because some of them already have a special meaning when used after another quantifier: "a*?" is a lazy quantifier, as are "a+?", "a??" and "a{1,4}?".

    Many other regex implementations, including the "regex" module, use an additional "+" to signify a possessive quantifier: "a*+", "a++", "a?+" and "a{1,4}+".

    That just leaves the additional "*", which is treated as an error in all the other regex implementations that I'm aware of.

    @terryjreedy
    Copy link
    Member

    This appears to be a doc issue to clarify that * cannot directly follow a repetition code. I believe there have been other (non)bug reports like this before.

    @vadmium
    Copy link
    Member Author

    vadmium commented Aug 20, 2016

    Okay so it sounds like my usage is valid if I add the brackets. I will try to come up with a documentation patch as some stage. The reason why it is not supported without brackets is to maintain a bit of consistency with the question mark (?), which modifies the preceding quantifier, and with the plus sign (+), which is also a modifier in other implementations.

    For the record, Gnu grep does seem to accept my expression (although Posix says this is undefined, and neither support lazy or possessive quantifiers):

    $ grep -E -o 'a{2}*' <<< "aaaaa"
    aaaa

    However pcregrep, which supports lazy (?) and possessive (+) quantifiers, doesn’t like my expression:

    $ pcregrep -o 'a{2}*' <<< "aaaaa"
    pcregrep: Error in command-line regex at offset 4: nothing to repeat
    [Exit 2]
    $ pcregrep -o '(?:a{2})*' <<< "aaaaa"
    aaaa

    @vadmium
    Copy link
    Member Author

    vadmium commented Sep 4, 2016

    Here is a patch for the documentation.

    @serhiy-storchaka
    Copy link
    Member

    LGTM. Thanks Martin.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 15, 2016

    New changeset 5f7d7e079e39 by Martin Panter in branch '3.5':
    Issue bpo-27800: Document limitation and workaround for multiple RE repetitions
    https://hg.python.org/cpython/rev/5f7d7e079e39

    New changeset 1f2ca7e4b64e by Martin Panter in branch '3.6':
    Issue bpo-27800: Merge RE repetition doc from 3.5 into 3.6
    https://hg.python.org/cpython/rev/1f2ca7e4b64e

    New changeset 98456ab88ab0 by Martin Panter in branch 'default':
    Issue bpo-27800: Merge RE repetition doc from 3.6
    https://hg.python.org/cpython/rev/98456ab88ab0

    New changeset 94f02193f00f by Martin Panter in branch '2.7':
    Issue bpo-27800: Document limitation and workaround for multiple RE repetitions
    https://hg.python.org/cpython/rev/94f02193f00f

    @vadmium
    Copy link
    Member Author

    vadmium commented Oct 15, 2016

    I committed my patch as it was. I understand Silent Ghost’s objection was mainly that they thought the new paragraph or its positioning wouldn’t be very useful, but hopefully it is better than nothing. Perhaps in the future, the documentation could be restructured with subsections for repetition qualifiers and other kinds of special codes, which may help.

    @vadmium vadmium closed this as completed Oct 15, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants