Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re module: number of named groups is limited to 100 max #66627

Closed
1st1 opened this issue Sep 18, 2014 · 10 comments
Closed

re module: number of named groups is limited to 100 max #66627

1st1 opened this issue Sep 18, 2014 · 10 comments
Assignees
Labels
expert-regex stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@1st1
Copy link
Member

1st1 commented Sep 18, 2014

BPO 22437
Nosy @pitrou, @vstinner, @ezio-melotti, @bitdancer, @serhiy-storchaka, @1st1
Files
  • re_maxgroups.patch
  • re_maxgroups_dynamic.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2014-09-29.20:15:38.324>
    created_at = <Date 2014-09-18.17:39:41.994>
    labels = ['expert-regex', 'type-feature', 'library']
    title = 're module: number of named groups is limited to 100 max'
    updated_at = <Date 2014-09-29.20:15:38.323>
    user = 'https://github.com/1st1'

    bugs.python.org fields:

    activity = <Date 2014-09-29.20:15:38.323>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2014-09-29.20:15:38.324>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)', 'Regular Expressions']
    creation = <Date 2014-09-18.17:39:41.994>
    creator = 'yselivanov'
    dependencies = []
    files = ['36654', '36682']
    hgrepos = []
    issue_num = 22437
    keywords = ['patch']
    message_count = 10.0
    messages = ['227055', '227058', '227060', '227063', '227064', '227066', '227237', '227635', '227820', '227825']
    nosy_count = 8.0
    nosy_names = ['pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'r.david.murray', 'python-dev', 'serhiy.storchaka', 'yselivanov']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue22437'
    versions = ['Python 3.5']

    @1st1
    Copy link
    Member Author

    1st1 commented Sep 18, 2014

    While writing a lexer for javascript language, I managed to hit the limit of named groups in one regexp, it's 100. The check is in sre_compile.py:compile() function, and there is even an XXX comment on this.

    Unfortunately, I'm not an expert in this module, so I'm not sure if this check can be lifted, or at least if the number can be bumped to 200 or 500 (why is 100 btw?)

    Please share your thoughts.

    @1st1 1st1 added stdlib Python modules in the Lib dir expert-regex type-feature A feature request or enhancement labels Sep 18, 2014
    @bitdancer
    Copy link
    Member

    bitdancer commented Sep 18, 2014

    It is 100 to avoid a syntactic ambiguity between numbered groups and octal numbers, if I remember correctly. I can't remember if that constraint still applies in python3, where the octal notation was made more strict in general.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Sep 18, 2014

    In the regex module, I borrowed the \g<...> escape from .sub's replacement string to provide an alternative way to refer to a group in a pattern, and that let me remove the limit.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 18, 2014

    There is two reasons for this limitation. First reason is mentioned by David. There is no syntax to backreference a group with number > 99 (but there is a syntax for conditional groups and for substitutions). Second reason is that current implementation of regexp engine uses an array of constant size for groups.

    Here is a patch which increases static limit to 1000 groups. It also allows to specify arbitrary group number in form of "(?P=number)". This is conformed to the syntax of conditional groups and for substitutions.

    @serhiy-storchaka serhiy-storchaka self-assigned this Sep 18, 2014
    @1st1
    Copy link
    Member Author

    1st1 commented Sep 18, 2014

    Serhiy,

    This is awesome!

    Is is possible to split the patch in two, and commit the one that just increases the groups limit to 3.4 as well?

    Thank you

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 18, 2014

    This is definitely not a bug fix. May be Matthew will commit it to the regex
    module and then you could use regex instead of re.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 21, 2014

    Here is a patch which removes static limit. It is much more complicated than the first patch and I prefer first apply the first patch. Aren't 1000 groups enough for everyone?

    @1st1
    Copy link
    Member Author

    1st1 commented Sep 26, 2014

    I'm fine with either one, Serhiy. The static one looks good to me.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 29, 2014

    New changeset 0b85ea4bd1af by Serhiy Storchaka in branch 'default':
    Issue bpo-22437: Number of capturing groups in regular expression is no longer
    https://hg.python.org/cpython/rev/0b85ea4bd1af

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Sep 29, 2014

    Thank you Antoine for your review.

    To avoid discrepancy between re and regex (and other engines), I have committed only a part of dynamic patch, without adding support of backreferences with index over 99. It is unlikely to achieve this limit in hand written regular expression, and in generated regular expression you can use named groups.

    I found that backreference syntax is one of most discrepant thing in regular expressions. There are at least 8 different variants (\N, \gN, \g<N>, \g{N}, \k<N>, \k'N', \k{N}, (?P=N)), and \g<N> in Perl have different meaning.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    expert-regex stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants