Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

pyos · 2012-12-14T22:19:34Z

BPO	16688
Nosy	@birkenfeld, @pitrou, @vstinner, @ezio-melotti, @serhiy-storchaka
Files	issue16688.patch issue16688#2.patch issue16688#3.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/serhiy-storchaka'
closed_at = <Date 2012-12-29.21:50:41.667>
created_at = <Date 2012-12-14.22:19:33.947>
labels = ['expert-regex', 'easy', 'type-bug']
title = 'Backreferences make case-insensitive regex fail on non-ASCII strings.'
updated_at = <Date 2012-12-30.09:22:27.096>
user = 'https://bugs.python.org/pyos'

bugs.python.org fields:

activity = <Date 2012-12-30.09:22:27.096>
actor = 'georg.brandl'
assignee = 'serhiy.storchaka'
closed = True
closed_date = <Date 2012-12-29.21:50:41.667>
closer = 'serhiy.storchaka'
components = ['Regular Expressions']
creation = <Date 2012-12-14.22:19:33.947>
creator = 'pyos'
dependencies = []
files = ['28321', '28325', '28332']
hgrepos = []
issue_num = 16688
keywords = ['patch', 'easy']
message_count = 17.0
messages = ['177518', '177519', '177523', '177532', '177556', '177572', '177573', '177574', '177576', '177580', '177614', '177616', '177618', '177620', '178540', '178541', '178562']
nosy_count = 9.0
nosy_names = ['georg.brandl', 'pitrou', 'vstinner', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'python-dev', 'serhiy.storchaka', 'pyos']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue16688'
versions = ['Python 3.3', 'Python 3.4']

pyos · 2012-12-14T22:19:34Z

The title says it all: if a regular expression that makes use of backreferences is compiled with re.I flag, it will always fail when matched against a string that contains characters outside of U+0000-U+00FF range. I've been unable to further narrow the bug down.

A simple example:

    >>> import re
    >>> r = re.compile(r'(a)\1', re.I)  # should match "aa", "aA", "Aa", or "AA"
    >>> r.findall('aa')  # works as expected
    ['a']
    >>> r.findall('aa bcd')  # still works
    ['a']
    >>> r.findall('aa Ā')  # ord('Ā') == 0x0100
    []

The same code works as expected in Python 3.2:

    >>> r.findall('aa Ā')
    ['a']

ezio-melotti · 2012-12-14T22:30:04Z

It works on 2.7 too, and fails on 3.3/3.x.
Maybe it's related to PEP-393?

mrabarnett · 2012-12-15T00:41:27Z

In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this:

    while (p < e) {
        if (ctx->ptr >= end ||
            SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
            RETURN_FAILURE;
        p += state->charsize;
        ctx->ptr += state->charsize;
    }

However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this:

    while (p < e) {
        if (ctx->ptr >= end ||
            state->lower(SRE_CHARGET(state, ctx->ptr, 0)) != state->lower(*p))
            RETURN_FAILURE;
        p++;
        ctx->ptr += state->charsize;
    }

(In both cases 'p' is of type 'char*'.)

The problem appears to be that the latter is still using '*p' and 'p++' and is thus always working with chars (it gets and advances 1 byte at a time instead of 1, 2 or 4 bytes for Unicode).

serhiy-storchaka · 2012-12-15T08:37:27Z

Good analysis, Matthew. Are you want to submit a patch?

mrabarnett · 2012-12-15T20:00:26Z

OK, here's a patch.

vstinner · 2012-12-15T23:01:50Z

Can someone check if there is no other similar regression (introduced
by the PEP-393)?

2012/12/15 Serhiy Storchaka <report@bugs.python.org>:

Changes by Serhiy Storchaka <storchaka@gmail.com>:

----------
stage: needs patch -> patch review

Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue16688\>

mrabarnett · 2012-12-16T00:24:21Z

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount < end - ptr && maxcount != 65535)
    end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.

mrabarnett · 2012-12-16T00:33:54Z

I found another bug while looking through the source.

On line 495 in function SRE_COUNT:

if (maxcount < end - ptr && maxcount != 65535)
    end = ptr + maxcount*state->charsize;

where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters.

If the byte after the end of the string is 0 then you get this:

>>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4)

I'll keep looking before submitting a patch.

mrabarnett · 2012-12-16T01:04:56Z

I haven't found any other issues, so here's the second patch.

serhiy-storchaka · 2012-12-16T08:29:53Z

The patches LGTM. How about adding a test?

mrabarnett · 2012-12-16T17:33:38Z

Here are some tests for the issue.

serhiy-storchaka · 2012-12-16T18:08:16Z

The second test pass on unpatched Python.

mrabarnett · 2012-12-16T18:48:59Z

Oops! :-( Now corrected.

serhiy-storchaka · 2012-12-16T19:02:31Z

LGTM.

Matthew, can you please submit a contributor form?

http://python.org/psf/contrib/contrib-form/
http://python.org/psf/contrib/

python-dev · 2012-12-29T21:45:31Z

New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3':
Issue bpo-16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
http://hg.python.org/cpython/rev/44a4f9289faa

New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default':
Issue bpo-16688: Fix backreferences did make case-insensitive regex fail on non-ASCII strings.
http://hg.python.org/cpython/rev/c59ee1ff6f27

serhiy-storchaka · 2012-12-29T21:50:42Z

Fixed. Thank you for a patch, Matthew. I hope to see more your patches.

birkenfeld · 2012-12-30T09:22:27Z

I think you will, Matthew being MRAB on the mailing lists :)

pyos mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Dec 14, 2012

serhiy-storchaka added the easy label Dec 15, 2012

serhiy-storchaka self-assigned this Dec 29, 2012

serhiy-storchaka closed this as completed Dec 29, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

pyos mannequin commented Dec 14, 2012

pyos mannequin commented Dec 14, 2012

ezio-melotti commented Dec 14, 2012

mrabarnett mannequin commented Dec 15, 2012

serhiy-storchaka commented Dec 15, 2012

mrabarnett mannequin commented Dec 15, 2012

vstinner commented Dec 15, 2012

mrabarnett mannequin commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

python-dev mannequin commented Dec 29, 2012

serhiy-storchaka commented Dec 29, 2012

birkenfeld commented Dec 30, 2012

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

Backreferences make case-insensitive regex fail on non-ASCII strings. #60892

Comments

pyos mannequin commented Dec 14, 2012

pyos mannequin commented Dec 14, 2012

ezio-melotti commented Dec 14, 2012

mrabarnett mannequin commented Dec 15, 2012

serhiy-storchaka commented Dec 15, 2012

mrabarnett mannequin commented Dec 15, 2012

vstinner commented Dec 15, 2012

mrabarnett mannequin commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

mrabarnett mannequin commented Dec 16, 2012

serhiy-storchaka commented Dec 16, 2012

python-dev mannequin commented Dec 29, 2012

serhiy-storchaka commented Dec 29, 2012

birkenfeld commented Dec 30, 2012