Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

Closed
pyos mannequin opened this issue Dec 14, 2012 · 17 comments
Closed

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

pyos mannequin opened this issue Dec 14, 2012 · 17 comments
Assignees
Labels
easy topic-regex type-bug An unexpected behavior, bug, or error

Comments

@pyos
Copy link
Mannequin

pyos mannequin commented Dec 14, 2012

BPO 16688
Nosy @birkenfeld, @pitrou, @vstinner, @ezio-melotti, @serhiy-storchaka
Files
  • issue16688.patch
  • issue16688#2.patch
  • issue16688#3.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2012-12-29.21:50:41.667>
    created_at = <Date 2012-12-14.22:19:33.947>
    labels = ['expert-regex', 'easy', 'type-bug']
    title = 'Backreferences make case-insensitive regex fail on non-ASCII strings.'
    updated_at = <Date 2012-12-30.09:22:27.096>
    user = 'https://bugs.python.org/pyos'

    bugs.python.org fields:

    activity = <Date 2012-12-30.09:22:27.096>
    actor = 'georg.brandl'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2012-12-29.21:50:41.667>
    closer = 'serhiy.storchaka'
    components = ['Regular Expressions']
    creation = <Date 2012-12-14.22:19:33.947>
    creator = 'pyos'
    dependencies = []
    files = ['28321', '28325', '28332']
    hgrepos = []
    issue_num = 16688
    keywords = ['patch', 'easy']
    message_count = 17.0
    messages = ['177518', '177519', '177523', '177532', '177556', '177572', '177573', '177574', '177576', '177580', '177614', '177616', '177618', '177620', '178540', '178541', '178562']
    nosy_count = 9.0
    nosy_names = ['georg.brandl', 'pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'serhiy.storchaka', 'pyos']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue16688'
    versions = ['Python 3.3', 'Python 3.4']

    @pyos
    Copy link
    Mannequin Author

    pyos mannequin commented Dec 14, 2012

    The title says it all: if a regular expression that makes use of backreferences is compiled with re.I flag, it will always fail when matched against a string that contains characters outside of U+0000-U+00FF range. I've been unable to further narrow the bug down.

    A simple example:

        >>> import re
        >>> r = re.compile(r'(a)\1', re.I)  # should match "aa", "aA", "Aa", or "AA"
        >>> r.findall('aa')  # works as expected
        ['a']
        >>> r.findall('aa bcd')  # still works
        ['a']
        >>> r.findall('aa Ā')  # ord('Ā') == 0x0100
        []

    The same code works as expected in Python 3.2:

        >>> r.findall('aa Ā')
        ['a']

    @pyos pyos mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Dec 14, 2012
    @ezio-melotti
    Copy link
    Member

    It works on 2.7 too, and fails on 3.3/3.x.
    Maybe it's related to PEP-393?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 15, 2012

    In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:

        while (p < e) {
            if (ctx->ptr >= end ||
                SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
                RETURN_FAILURE;
            p += state->charsize;
            ctx->ptr += state->charsize;
        }

    However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this:

        while (p < e) {
            if (ctx->ptr >= end ||
                state->lower(SRE_CHARGET(state, ctx->ptr, 0)) != state->lower(*p))
                RETURN_FAILURE;
            p++;
            ctx->ptr += state->charsize;
        }

    (In both cases 'p' is of type 'char*'.)

    The problem appears to be that the latter is still using '*p' and 'p++' and is thus always working with chars (it gets and advances 1 byte at a time instead of 1, 2 or 4 bytes for Unicode).

    @serhiy-storchaka
    Copy link
    Member

    Good analysis, Matthew. Are you want to submit a patch?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 15, 2012

    OK, here's a patch.

    @vstinner
    Copy link
    Member

    Can someone check if there is no other similar regression (introduced
    by the PEP-393)?

    2012/12/15 Serhiy Storchaka <report@bugs.python.org>:

    Changes by Serhiy Storchaka <storchaka@gmail.com>:

    ----------
    stage: needs patch -> patch review


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue16688\>


    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 16, 2012

    I found another bug while looking through the source.

    On line 495 in function SRE_COUNT:

    if (maxcount < end - ptr && maxcount != 65535)
        end = ptr + maxcount*state->charsize;
    

    where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

    If the byte after the end of the string is 0 then you get this:

    >>> # Good:
    >>> re.search(r"\x00{1,3}", "a\x00\x00").span()
    (1, 3)
    >>> # Bad:
    >>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
    (1, 4)

    I'll keep looking before submitting a patch.

    1 similar comment
    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 16, 2012

    I found another bug while looking through the source.

    On line 495 in function SRE_COUNT:

    if (maxcount < end - ptr && maxcount != 65535)
        end = ptr + maxcount*state->charsize;
    

    where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

    If the byte after the end of the string is 0 then you get this:

    >>> # Good:
    >>> re.search(r"\x00{1,3}", "a\x00\x00").span()
    (1, 3)
    >>> # Bad:
    >>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
    (1, 4)

    I'll keep looking before submitting a patch.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 16, 2012

    I haven't found any other issues, so here's the second patch.

    @serhiy-storchaka
    Copy link
    Member

    The patches LGTM. How about adding a test?

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 16, 2012

    Here are some tests for the issue.

    @serhiy-storchaka
    Copy link
    Member

    The second test pass on unpatched Python.

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Dec 16, 2012

    Oops! :-( Now corrected.

    @serhiy-storchaka
    Copy link
    Member

    LGTM.

    Matthew, can you please submit a contributor form?

    http://python.org/psf/contrib/contrib-form/
    http://python.org/psf/contrib/

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 29, 2012
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 29, 2012

    New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3':
    Issue bpo-16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
    http://hg.python.org/cpython/rev/44a4f9289faa

    New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default':
    Issue bpo-16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
    http://hg.python.org/cpython/rev/c59ee1ff6f27

    @serhiy-storchaka
    Copy link
    Member

    Fixed. Thank you for a patch, Matthew. I hope to see more your patches.

    @birkenfeld
    Copy link
    Member

    I think you will, Matthew being MRAB on the mailing lists :)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    easy topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants