New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backreferences make case-insensitive regex fail on non-ASCII strings. #60892
Comments
The title says it all: if a regular expression that makes use of backreferences is compiled with A simple example: >>> import re
>>> r = re.compile(r'(a)\1', re.I) # should match "aa", "aA", "Aa", or "AA"
>>> r.findall('aa') # works as expected
['a']
>>> r.findall('aa bcd') # still works
['a']
>>> r.findall('aa Ā') # ord('Ā') == 0x0100
[] The same code works as expected in Python 3.2: >>> r.findall('aa Ā')
['a'] |
It works on 2.7 too, and fails on 3.3/3.x. |
In function SRE_MATCH, the code for SRE_OP_GROUPREF (line 1290) contains this: while (p < e) {
if (ctx->ptr >= end ||
SRE_CHARGET(state, ctx->ptr, 0) != SRE_CHARGET(state, p, 0))
RETURN_FAILURE;
p += state->charsize;
ctx->ptr += state->charsize;
} However, the code for SRE_OP_GROUPREF_IGNORE (line 1316) contains this: while (p < e) {
if (ctx->ptr >= end ||
state->lower(SRE_CHARGET(state, ctx->ptr, 0)) != state->lower(*p))
RETURN_FAILURE;
p++;
ctx->ptr += state->charsize;
} (In both cases 'p' is of type 'char*'.) The problem appears to be that the latter is still using '*p' and 'p++' and is thus always working with chars (it gets and advances 1 byte at a time instead of 1, 2 or 4 bytes for Unicode). |
Good analysis, Matthew. Are you want to submit a patch? |
OK, here's a patch. |
Can someone check if there is no other similar regression (introduced 2012/12/15 Serhiy Storchaka <report@bugs.python.org>:
|
I found another bug while looking through the source. On line 495 in function SRE_COUNT:
where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters. If the byte after the end of the string is 0 then you get this: >>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4) I'll keep looking before submitting a patch. |
1 similar comment
I found another bug while looking through the source. On line 495 in function SRE_COUNT:
where 'end' and 'ptr' are of type 'char*'. That means that 'end - ptr' is the length in _bytes_, not characters. If the byte after the end of the string is 0 then you get this: >>> # Good:
>>> re.search(r"\x00{1,3}", "a\x00\x00").span()
(1, 3)
>>> # Bad:
>>> re.search(r"\x00{1,3}", "\u0100\x00\x00").span()
(1, 4) I'll keep looking before submitting a patch. |
I haven't found any other issues, so here's the second patch. |
The patches LGTM. How about adding a test? |
Here are some tests for the issue. |
The second test pass on unpatched Python. |
Oops! :-( Now corrected. |
LGTM. Matthew, can you please submit a contributor form? http://python.org/psf/contrib/contrib-form/ |
New changeset 44a4f9289faa by Serhiy Storchaka in branch '3.3': New changeset c59ee1ff6f27 by Serhiy Storchaka in branch 'default': |
Fixed. Thank you for a patch, Matthew. I hope to see more your patches. |
I think you will, Matthew being MRAB on the mailing lists :) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: