New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib re cannot match non-BMP ranges (all versions, all builds) #56958
Comments
On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception: if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE):
print("match 1 passed")
else:
print("match 2 failed") The best you can possibly do is to use both a wide build *and* symbolic literals, in which case it will pass. But remove either of both of those conditions and you fail. This is too restrictive for full Unicode use. There should never be any sitation where [a-z] fails to match c when a < c < z, and neither a nor z is something special in a character class. There is, or perhaps should be, no difference at all between "[a-z]" and "[𝒜-𝒵]", just as there is, or at least should b, no difference between "c" and "𝒞". You can’t have second-class citizens like this that can't be used. And no, this one is *not* fixed by Matthew Barnett's regex library. There is some dumb UCS-2 assumption lurking deep in Python somewhere that makes this break, even on wide builds, which is incomprehensible to me. |
On a wide 2.7 and 3.3 all the 3 tests pass. On a narrow 3.2 I get
match 1 passed
Traceback (most recent call last):
File "/home/wolf/dev/py/3.2/Lib/functools.py", line 176, in wrapper
result = cache[key]
KeyError: (<class 'str'>, '[𝒜-𝒵]', 32)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bigrange.py", line 16, in <module>
if re.search("[𝒜-𝒵]", "𝒞", flags):
File "/home/wolf/dev/py/3.2/Lib/re.py", line 158, in search
return _compile(pattern, flags).search(string)
File "/home/wolf/dev/py/3.2/Lib/re.py", line 255, in _compile
return _compile_typed(type(pattern), pattern, flags)
File "/home/wolf/dev/py/3.2/Lib/functools.py", line 180, in wrapper
result = user_function(*args, **kwds)
File "/home/wolf/dev/py/3.2/Lib/re.py", line 267, in _compile_typed
return sre_compile.compile(pattern, flags)
File "/home/wolf/dev/py/3.2/Lib/sre_compile.py", line 491, in compile
p = sre_parse.parse(p, flags)
File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 692, in parse
p = _parse_sub(source, pattern, 0)
File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 315, in _parse_sub
itemsappend(_parse(source, state))
File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 461, in _parse
raise error("bad character range")
sre_constants.error: bad character range |
On wide 3.2 it passes too, so the failure is limited to narrow builds (are you sure that it fails on wide builds for you?). On a narrow 2.7 I get a slightly different error though: match 1 passed
Traceback (most recent call last):
File "bigrange.py", line 16, in <module>
if re.search("[𝒜-𝒵]", "𝒞", flags):
File "/home/wolf/dev/py/2.7/Lib/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/home/wolf/dev/py/2.7/Lib/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range |
I haven't looked at the code, but I think that the re module is just trying to calculate the range between the low surrogate of 𝒜 and the high surrogate of 𝒵. Also note that re.search(u"[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]".encode('utf-8'), u"\N{MATHEMATICAL SCRIPT CAPITAL C}".encode('utf-8'), re.UNICODE) |
The error on 3.2 comes from the lru_cache, here's a minimal testcase to reproduce it:
>>> from functools import lru_cache
>>> @lru_cache()
... def func(arg): raise ValueError()
...
>>> func(3)
Traceback (most recent call last):
File "/home/wolf/dev/py/3.2/Lib/functools.py", line 176, in wrapper
result = cache[key]
KeyError: (3,)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/wolf/dev/py/3.2/Lib/functools.py", line 180, in wrapper
result = user_function(*args, **kwds)
File "<stdin>", line 2, in func
ValueError Raymond, is this expected or should I open another issue? |
You're right: my wide build is not Python3, just Python2. In fact, I have private builds that are 2.7 and 3.2, but those are both narrow. I'm remembering why I removed Python2 from my Unicode talk, because --tom |
And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3.
What is worse? FWIW on my system the default
3.3 is the version in development, not released yet. If you have an HG clone of Python you can make a wide build of 3.x with ./configure --with-wide-unicode andof 2.7 using ./configure --enable-unicode=ucs4.
I'm not sure what you are referring to here. |
I don't know if you *should*. But you can make one easily by passing |
Ezio Melotti <report@bugs.python.org> wrote on Sun, 14 Aug 2011 17:15:52 -0000:
Perhaps I am doing something wrong?
linux% python bigrange.py
match 1 passed
Traceback (most recent call last):
File "bigrange.py", line 16, in <module>
if re.search("[𝒜-𝒵]", "𝒞", flags):
File "/usr/lib64/python2.6/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/usr/lib64/python2.6/re.py", line 245, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
I meant that it was running 2.6 not 2.7.
And Antoine Pitrou <pitrou@free.fr> wrote:
Oh good. I need to read configure --help more carefully next time. Is there a way to easily have these co-exist on the same system? I'm sure Variant Perl builds can coexist on the same system with some directories
There seem to many more things to get wrong with Unicode in v2 than in v3. I don't know how much of this just my slowness at ramping up the learning Python2:
Python3: re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
"\N{MATHEMATICAL SCRIPT CAPITAL C}", re.UNICODE) The Python2 version is *much* noisier. (1) You have keep remembering to u"..." everything because neither (2) You have to manually encode every string, which is utterly bizarre to me. (3) Plus you then have turn around and tell re, "Hey by the way, you know those It's a very awkward model. Compare Perl's "\N{MATHEMATICAL SCRIPT CAPITAL C}" =~ /\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/ That's the kind of thing I'm used to. It knows those are Unicode pattern matches on
FWIW, I give Python major kudos for having \N{⋯} available so that people
* Requiring explicitly coded callouts to a library are at best tedious and
annoying. ICU4J's UCharacter and JDK7's Character classes both have
String getName(int codePoint)
but JDK7 has nothing that goes the other way around; for that, ICU4J has
int getCharFromName(String name)
and ICU4C has
UChar32 u_charFromName ( UCharNameChoice nameChoice,
const char * name,
UErrorCode * pErrorCode
)
Anybody can see how deathly unwieldy and of that. ICU4C's regex library admits \N{⋯} just as Perl and Python do, but that
As far as I know, nothing but Perl and Python allows \N{⋯} in interpolated One question: If one really must use code point numbers in strings, does Python You should somehow be able to specify only as many hex digits as you actually need. It's just a lot easier, which is why I miss it from regular Python strings. It
Perl only uses \x, not \x AND \u AND \U the way Python does, because
Thanks for your all your generous help and kindly patience. --tom |
On a narrow build, "\N{MATHEMATICAL SCRIPT CAPITAL A}" is stored as 2 code units, and neither re nor regex recombine them when compiling a regex or looking for a match. regex supports \xNN, \uNNNN and \UNNNNNNNN and \N{XYZ} itself, so they can be used in a raw string literal, but it doesn't recombine code units. I could add recombination to regex at some point if time has passed and no further progress has been made in the language's support for Unicode. |
That's weird, I tried on a wide Python 2.6.6 too and it works even there. Maybe a bug that got fixed between 2.6.2 and 2.6.6? Or maybe something else?
Here I have different HG clones, one for each release (2.7, 3.2, 3.3), and I run ./configure (--with-wide-unicode) && make -j2. Then I just run ./python from there without installing it in the system.
Yes, Python 3 fixed many of these things and it's a much "cleaner" language.
Before Unicode Python only had plain (byte)strings, when Unicode strings were introduced the u"..." syntax was chosen to distinguish them. On Python 3, "..." is a Unicode string, whereas b"..." is used for bytes.
re works with both bytes and Unicode strings, on both Python 2 and Python 3. I was encoding them to see if it was able to handle the range when it was in a UTF-8 encoded string, rather than a Unicode string. Even if it didn't fail with an exception, it failed with a wrong result (and that's even worse).
The re.UNICODE flags affects the behavior of e.g. \w and \d, it's not telling re that we are passing Unicode strings rather than bytes. By default on Python 2 those only match ASCII letters and digits. This is also fixed on Python 3, where by default they match non-ASCII letters and digits (unless you pass re.ASCII).
FWIW we have unicodedata.lookup('SNOWMAN')
Nope. OTOH it doesn't happen to often to use those (especially the \U version), so I'm not sure that it's worth adding something else just to save a few chars (also \x{12345} is only one char less than \U00012345). |
BTW, you can find more information about the one-dir-per-clone setup (and other useful info) here: http://docs.python.org/devguide/committing.html#using-several-working-copies |
We should at least get this fixed in 3.3. Then we can discuss the benefits of backporting the fixes to 2.7 and 3.2 (though it sounds to me like the backports will fix more than they will break, since it is pretty much impossible to do the right thing in those versions today). |
I tried bigrange.py on 3.3/3.4 and I got: PEP-393 probably fixed this issue. |
New changeset 489cfa062442 by Ezio Melotti in branch '3.3': New changeset c3a09c535001 by Ezio Melotti in branch 'default': |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: