Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RE parser too loose with {m,n} construct #41987

Closed
smontanaro opened this issue May 15, 2005 · 19 comments
Closed

RE parser too loose with {m,n} construct #41987

smontanaro opened this issue May 15, 2005 · 19 comments

Comments

@smontanaro
Copy link
Contributor

BPO 1202493
Nosy @smontanaro, @birkenfeld, @rhettinger, @josiahcarlson
Files
  • sre-brace-diff-2
  • sre-brace-diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2005-09-15.06:12:02.000>
    created_at = <Date 2005-05-15.21:59:16.000>
    labels = ['expert-regex']
    title = 'RE parser too loose with {m,n} construct'
    updated_at = <Date 2005-09-15.06:12:02.000>
    user = 'https://github.com/smontanaro'

    bugs.python.org fields:

    activity = <Date 2005-09-15.06:12:02.000>
    actor = 'georg.brandl'
    assignee = 'niemeyer'
    closed = True
    closed_date = None
    closer = None
    components = ['Regular Expressions']
    creation = <Date 2005-05-15.21:59:16.000>
    creator = 'skip.montanaro'
    dependencies = []
    files = ['1701', '1702']
    hgrepos = []
    issue_num = 1202493
    keywords = []
    message_count = 19.0
    messages = ['25327', '25328', '25329', '25330', '25331', '25332', '25333', '25334', '25335', '25336', '25337', '25338', '25339', '25340', '25341', '25342', '25343', '25344', '25345']
    nosy_count = 5.0
    nosy_names = ['skip.montanaro', 'georg.brandl', 'rhettinger', 'niemeyer', 'josiahcarlson']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue1202493'
    versions = ['Python 2.5']

    @smontanaro
    Copy link
    Contributor Author

    This seems wrong to me:

    >>> re.match("(UNIX{})", "UNIX{}").groups()
    ('UNIX',)

    With no numbers or commas, "{}" should not be considered
    special in the pattern. The docs identify three numeric
    repetition possibilities: {m}, {m,} and {m,n}. There's no
    description of {} meaning anything. Either the docs should
    say {} implies {1,1}, {} should have no special meaning, or
    an exception should be raised during compilation of the
    regular expression.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    It's interesting what other RE implementations do with this
    ambiguity:
    Perl treats {} as literal in REs, as Skip proposes.
    Ruby does, too, but issues a warning about } being unescaped.
    GNU (e)grep v2.5.1 allows a bare {} only if it is at the
    start of a RE, but matches it literally then.
    GNU sed v4.1.4 does never allow it.
    GNU awk v3.1.4 is gracious and acts like Perl.

    Attached is a patch that fixes this behaviour in the
    appearing "common sense".

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    IMO, the simplest rule is that braces always be considered
    special. This accommodates future extensions, simplifies
    the re compiler, and makes it easier to know what needs to
    be escaped.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    So, should a {} raise an error, or warn like in Ruby?

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    I prefer Skip's third option, raising an exception during
    compilation. This is an re syntax error. Treat it the same
    way that we handle similar situations with regular Python:

    >>> a[]
    SyntaxError: invalid syntax

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Okay. Attaching patch which does that.

    BTW, these things are currently allowed too (treated as
    literals):

    "{"
    "{x"
    "{x}"
    "{x,y}"
    "{1,x}"
    etc.

    The patch changes that, too.

    @smontanaro
    Copy link
    Contributor Author

    Logged In: YES
    user_id=44345

    In the absence of strong technical reasons, I'd vote to do what Perl
    does. I believe the assumption all along has been that most people
    coming to Python who already know how to use regular expressions are
    Perl programmers. It wouldn't seem to make sense to throw little land
    mines in their paths. I realize that explicit is better than implicit, but
    practicality beats purity.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    I just realized that e.g. the string module uses unescaped
    braces, so I think we should not become overly strict as it
    would break much code...

    Perhaps the original patch (sre-brace-diff) is better...

    @smontanaro
    Copy link
    Contributor Author

    Logged In: YES
    user_id=44345

    Can you elaborate? I fail to see what the string module
    has to do with the re module. Can you give an example
    of code that would break?

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Raymond said that braces should always be considered
    special. This includes constructs like "{(?P<braces>.*)}"
    which the string module uses, and which would be a syntax
    error then.

    @rhettinger
    Copy link
    Contributor

    Logged In: YES
    user_id=80475

    Hmm, it looks like they cannot be treated differently
    without breaking backwards compatability.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Then, I think, we should follow Perl's behaviour and treat
    "{}" as a literal, just like every other brace construct
    that isn't a repeat specifier.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Any more objections against treating "{}" as literal?

    The impact on existing code will be minimal, as I presume no
    one will write "{}" in a RE instead of "{1,1}" (well, who
    writes "{1,1}" anyway...).

    @niemeyer
    Copy link
    Mannequin

    niemeyer mannequin commented Aug 31, 2005

    Logged In: YES
    user_id=7887

    I support Skip's opinion on following whatever perl is currently doing, if
    that won't lead to unexpected errors on current running code which was
    considered sane (expecting {} to behave like {1,1} is not sane :-).

    Your original patch looks under-optimal though (look at the tests around
    it). I'll fix it, or if you prefer to do it by yourself, I may apply the
    patch/review it/whatever. :-)

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    No, you're the expert, so you'll get the honor of fixing it. :P

    @niemeyer
    Copy link
    Mannequin

    niemeyer mannequin commented Sep 14, 2005

    Logged In: YES
    user_id=7887

    Fixed in:

    Lib/sre_parse.py: 1.64 -> 1.65
    Lib/test/test_re.py: 1.55 -> 1.56
    Misc/NEWS: 1.1360 -> 1.1361

    Notice that perl will also handle constructs like '{,2}' as
    literals, while Python will consider them as '{0,2}'. I
    think it's too late to change that one though, as this
    behavior may be relied upon in code out there.

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Will you backport the fix?

    @josiahcarlson
    Copy link
    Mannequin

    josiahcarlson mannequin commented Sep 15, 2005

    Logged In: YES
    user_id=341410

    Was it a bug, or was it merely confusing semantics?

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    I would say bug.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants