Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib re cannot match non-BMP ranges (all versions, all builds) #56958

Closed
tchrist mannequin opened this issue Aug 14, 2011 · 15 comments
Closed

lib re cannot match non-BMP ranges (all versions, all builds) #56958

tchrist mannequin opened this issue Aug 14, 2011 · 15 comments
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@tchrist
Copy link
Mannequin

tchrist mannequin commented Aug 14, 2011

BPO 12749
Nosy @gvanrossum, @rhettinger, @terryjreedy, @pitrou, @jkloth, @ezio-melotti, @bitdancer
Files
  • bigrange.py: demo that python is buggily stuck with UCS-2 even on wide builds
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2013-02-23.06:29:30.600>
    created_at = <Date 2011-08-14.15:47:38.638>
    labels = ['expert-regex', 'type-bug']
    title = 'lib re cannot match non-BMP ranges (all versions, all builds)'
    updated_at = <Date 2013-02-23.06:40:52.851>
    user = 'https://bugs.python.org/tchrist'

    bugs.python.org fields:

    activity = <Date 2013-02-23.06:40:52.851>
    actor = 'python-dev'
    assignee = 'none'
    closed = True
    closed_date = <Date 2013-02-23.06:29:30.600>
    closer = 'ezio.melotti'
    components = ['Regular Expressions']
    creation = <Date 2011-08-14.15:47:38.638>
    creator = 'tchrist'
    dependencies = []
    files = ['22897']
    hgrepos = []
    issue_num = 12749
    keywords = []
    message_count = 15.0
    messages = ['142058', '142059', '142060', '142061', '142063', '142065', '142067', '142068', '142075', '142077', '142078', '142080', '143037', '182718', '182719']
    nosy_count = 11.0
    nosy_names = ['gvanrossum', 'rhettinger', 'terry.reedy', 'pitrou', 'jkloth', 'ezio.melotti', 'mrabarnett', 'Arfrever', 'r.david.murray', 'python-dev', 'tchrist']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12749'
    versions = ['Python 3.2']

    @tchrist
    Copy link
    Mannequin Author

    tchrist mannequin commented Aug 14, 2011

    On neither narrow nor wide builds does this UTF8-encoded bit run without raising an exception:

       if re.search("[𝒜-𝒵]", "𝒞", re.UNICODE): 
           print("match 1 passed")
       else:
           print("match 2 failed")

    The best you can possibly do is to use both a wide build *and* symbolic literals, in which case it will pass. But remove either of both of those conditions and you fail. This is too restrictive for full Unicode use.

    There should never be any sitation where [a-z] fails to match c when a < c < z, and neither a nor z is something special in a character class. There is, or perhaps should be, no difference at all between "[a-z]" and "[𝒜-𝒵]", just as there is, or at least should b, no difference between "c" and "𝒞". You can’t have second-class citizens like this that can't be used.

    And no, this one is *not* fixed by Matthew Barnett's regex library. There is some dumb UCS-2 assumption lurking deep in Python somewhere that makes this break, even on wide builds, which is incomprehensible to me.

    @tchrist tchrist mannequin added topic-regex type-bug An unexpected behavior, bug, or error labels Aug 14, 2011
    @ezio-melotti
    Copy link
    Member

    On a wide 2.7 and 3.3 all the 3 tests pass.

    On a narrow 3.2 I get 
    match 1 passed
    Traceback (most recent call last):
      File "/home/wolf/dev/py/3.2/Lib/functools.py", line 176, in wrapper
        result = cache[key]
    KeyError: (<class 'str'>, '[𝒜-𝒵]', 32)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "bigrange.py", line 16, in <module>
        if re.search("[𝒜-𝒵]", "𝒞", flags): 
      File "/home/wolf/dev/py/3.2/Lib/re.py", line 158, in search
        return _compile(pattern, flags).search(string)
      File "/home/wolf/dev/py/3.2/Lib/re.py", line 255, in _compile
        return _compile_typed(type(pattern), pattern, flags)
      File "/home/wolf/dev/py/3.2/Lib/functools.py", line 180, in wrapper
        result = user_function(*args, **kwds)
      File "/home/wolf/dev/py/3.2/Lib/re.py", line 267, in _compile_typed
        return sre_compile.compile(pattern, flags)
      File "/home/wolf/dev/py/3.2/Lib/sre_compile.py", line 491, in compile
        p = sre_parse.parse(p, flags)
      File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 692, in parse
        p = _parse_sub(source, pattern, 0)
      File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 315, in _parse_sub
        itemsappend(_parse(source, state))
      File "/home/wolf/dev/py/3.2/Lib/sre_parse.py", line 461, in _parse
        raise error("bad character range")
    sre_constants.error: bad character range

    @ezio-melotti
    Copy link
    Member

    On wide 3.2 it passes too, so the failure is limited to narrow builds (are you sure that it fails on wide builds for you?).

    On a narrow 2.7 I get a slightly different error though:

    match 1 passed
    Traceback (most recent call last):
      File "bigrange.py", line 16, in <module>
        if re.search("[𝒜-𝒵]", "𝒞", flags): 
      File "/home/wolf/dev/py/2.7/Lib/re.py", line 142, in search
        return _compile(pattern, flags).search(string)
      File "/home/wolf/dev/py/2.7/Lib/re.py", line 244, in _compile
        raise error, v # invalid expression
    sre_constants.error: bad character range

    @ezio-melotti
    Copy link
    Member

    I haven't looked at the code, but I think that the re module is just trying to calculate the range between the low surrogate of 𝒜 and the high surrogate of 𝒵.
    If this is the case, this is the "usual bug" that narrow builds have.

    Also note that re.search(u"[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]".encode('utf-8'), u"\N{MATHEMATICAL SCRIPT CAPITAL C}".encode('utf-8'), re.UNICODE)
    "works", but it returns a wrong result.

    @ezio-melotti
    Copy link
    Member

    The error on 3.2 comes from the lru_cache, here's a minimal testcase to reproduce it:
    >>> from functools import lru_cache
    >>> @lru_cache()
    ... def func(arg): raise ValueError()
    ... 
    >>> func(3)
    Traceback (most recent call last):
      File "/home/wolf/dev/py/3.2/Lib/functools.py", line 176, in wrapper
        result = cache[key]
    KeyError: (3,)
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/wolf/dev/py/3.2/Lib/functools.py", line 180, in wrapper
        result = user_function(*args, **kwds)
      File "<stdin>", line 2, in func
    ValueError

    Raymond, is this expected or should I open another issue?

    @tchrist
    Copy link
    Mannequin Author

    tchrist mannequin commented Aug 14, 2011

    Ezio Melotti <ezio.melotti@gmail.com> added the comment:

    On wide 3.2 it passes too, so the failure is limited to narrow builds (are =
    you sure that it fails on wide builds for you?).

    You're right: my wide build is not Python3, just Python2. In fact,
    it's even worse, because it's the stock build on Linux, which seems
    on this machine to be 2.6 not 2.7.

    I have private builds that are 2.7 and 3.2, but those are both narrow.
    I do not have a 3.3 build. Should I?

    I'm remembering why I removed Python2 from my Unicode talk, because
    of how it made me pull my hair out. People at the talk wanted to know
    what I meant, but I didn't have time to go into it. I think this
    gets added to the hairpulling list.

    --tom

    @ezio-melotti
    Copy link
    Member

    You're right: my wide build is not Python3, just Python2.

    And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3.

    In fact, it's even worse, because it's the stock build on Linux,
    which seems on this machine to be 2.6 not 2.7.

    What is worse? FWIW on my system the default python is a 2.7 wide. python3 is a 3.2 wide.

    I have private builds that are 2.7 and 3.2, but those are both narrow.
    I do not have a 3.3 build. Should I?

    3.3 is the version in development, not released yet. If you have an HG clone of Python you can make a wide build of 3.x with ./configure --with-wide-unicode andof 2.7 using ./configure --enable-unicode=ucs4.

    I'm remembering why I removed Python2 from my Unicode talk, because
    of how it made me pull my hair out. People at the talk wanted to know
    what I meant, but I didn't have time to go into it. I think this
    gets added to the hairpulling list.

    I'm not sure what you are referring to here.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 14, 2011

    I have private builds that are 2.7 and 3.2, but those are both narrow.
    I do not have a 3.3 build. Should I?

    I don't know if you *should*. But you can make one easily by passing
    "--with-wide-unicode" to ./configure.

    @tchrist
    Copy link
    Mannequin Author

    tchrist mannequin commented Aug 14, 2011

    Ezio Melotti <report@bugs.python.org> wrote on Sun, 14 Aug 2011 17:15:52 -0000:

    > You're right: my wide build is not Python3, just Python2.

    And is it failing? Here the tests pass on the wide builds, on both Python 2 and 3.

    Perhaps I am doing something wrong?

    linux% python --version
    Python 2.6.2
    
    linux% python -c 'import sys; print sys.maxunicode'
    1114111
    
    linux% cat -n bigrange.py
     1	#!/usr/bin/env python
     2	# -*- coding: UTF-8 -*-
     3	
     4	from __future__ import print_function
     5	from __future__ import unicode_literals
     6	
     7	import re
     8	
     9	flags = re.UNICODE
    10	
    11	if re.search("[a-z]", "c", flags): 
    12	    print("match 1 passed")
    13	else:
    14	    print("match 1 failed")
    15	
    16	if re.search("[𝒜-𝒵]", "𝒞", flags): 
    17	    print("match 2 passed")
    18	else:
    19	    print("match 2 failed")
    20	
    21	if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags): 
    22	    print("match 3 passed")
    23	else:
    24	    print("match 3 failed")
    25	
    26	if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
    27	    "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): 
    28	    print("match 4 passed")
    29	else:
    30	    print("match 4 failed")
    
        linux% python bigrange.py
        match 1 passed
        Traceback (most recent call last):
          File "bigrange.py", line 16, in <module>
    	if re.search("[𝒜-𝒵]", "𝒞", flags): 
          File "/usr/lib64/python2.6/re.py", line 142, in search
    	return _compile(pattern, flags).search(string)
          File "/usr/lib64/python2.6/re.py", line 245, in _compile
    	raise error, v # invalid expression
        sre_constants.error: bad character range

    > In fact, it's even worse, because it's the stock build on Linux,
    > which seems on this machine to be 2.6 not 2.7.

    What is worse? FWIW on my system the default python is a 2.7 wide. python3 is a 3.2 wide.

    I meant that it was running 2.6 not 2.7.

    > I have private builds that are 2.7 and 3.2, but those are both narrow.
    > I do not have a 3.3 build. Should I?

    3.3 is the version in development, not released yet. If you have an
    HG clone of Python you can make a wide build of 3.x with ./configure
    --with-wide-unicode andof 2.7 using ./configure --enable-
    unicode=ucs4.

    And Antoine Pitrou <pitrou@free.fr> wrote:

    > I have private builds that are 2.7 and 3.2, but those are both narrow.
    > I do not have a 3.3 build. Should I?

    I don't know if you *should*. But you can make one easily by passing
    "--with-wide-unicode" to ./configure.

    Oh good. I need to read configure --help more carefully next time.
    I have to some Lucene work this afternoon, so I can let several builds
    chug along.

    Is there a way to easily have these co-exist on the same system? I'm sure
    I have to rebuild all C extensions for the new builds, but I wonder what to
    about (for example) /usr/local/lib/python3.2 being able to be only one of
    narrow or wide. Probably I just to go reading the configure stuff better
    for alternate paths. Unsure.

    Variant Perl builds can coexist on the same system with some directories
    shared and others not, but I often find other systems aren't quite that
    flexible, usually requiring their own dedicated trees. Manpaths can get
    tricky, too.

    > I'm remembering why I removed Python2 from my Unicode talk, because
    > of how it made me pull my hair out. People at the talk wanted to know
    > what I meant, but I didn't have time to go into it. I think this
    > gets added to the hairpulling list.

    I'm not sure what you are referring to here.

    There seem to many more things to get wrong with Unicode in v2 than in v3.

    I don't know how much of this just my slowness at ramping up the learning
    curve, how much is due to historical defaults that don't work well for
    Unicode, and how much is

    Python2:

    re.search(u"[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]".encode('utf-8'), 
               u"\N{MATHEMATICAL SCRIPT CAPITAL C}".encode('utf-8'), re.UNICODE)
    

    Python3:

        re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
                   "\N{MATHEMATICAL SCRIPT CAPITAL C}", re.UNICODE)

    The Python2 version is *much* noisier.

    (1) You have keep remembering to u"..." everything because neither
    # -- coding: UTF-8 --
    nor even
    from __future__ import unicode_literals
    suffices.

    (2) You have to manually encode every string, which is utterly bizarre to me.

    (3) Plus you then have turn around and tell re, "Hey by the way, you know those
    Unicode strings I just passed you? Those are Unicode strings, you know."
    Like it couldn't tell that already by realizing it got Unicode not byte
    strings. So weird.

    It's a very awkward model. Compare Perl's

    "\N{MATHEMATICAL SCRIPT CAPITAL C}" =~ /\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]/

    That's the kind of thing I'm used to. It knows those are Unicode pattern matches on
    Unicode strings with Unicode semantics. After all, the \N{⋯} always evaluates to
    Unicode strings, so the regex engine of course is Unicodey then without being begged.
    To do bytewise processing, I would have to manually do all that encoding rigamorale
    like you show for Python2. And I never want do that, because looking for code units
    is way beneath the level of abstraction I strongly prefer to work with. Code points
    are as low as I go, and often not even there, since I often need graphemes or
    sometimes even linguistic collating elements (n-graphs), like the <ch> or <ll>
    digraphs in traditional Spanish or <dd> or <rh> in Welsh, or heaven help us the
    <dz> or <sz> digraphs, the <dzs> or <tty> trigraphs, or the <ddsz> tetragraph
    in Hungarian.

    (Yes, only Hungarian alone has a tetragraph, and there are no pentagraphs;
     small solace that, though.)
    

    FWIW, I give Python major kudos for having \N{⋯} available so that people
    no longer have to embed non-ASCII or magic numbers or ugly function
    calls all over their code.

    • Non-ASCII sometimes has the advantages of legibility but paradoxically
      also sometimes has the disadvantage of illegibility, bizarre as that
      sounds. It is too easy to be tricked by lookalikey font issues.

      16 if re.search("[𝒜-𝒵]", "𝒞", flags):

    • Magic numbers quite simply suck. Nobody knows what they are.

      21 if re.search("[\U0001D49C-\U0001D4B5]", "\U0001D49E", flags):

      *  Requiring explicitly coded callouts to a library are at best tedious and
         annoying.  ICU4J's UCharacter and JDK7's Character classes both have
             String  getName(int codePoint)
         but JDK7 has nothing that goes the other way around; for that, ICU4J has
             int     getCharFromName(String name)
         and ICU4C has 
             UChar32 u_charFromName  (   UCharNameChoice     nameChoice, 
                                         const char *        name, 
                                         UErrorCode *        pErrorCode 
                    )
         Anybody can see how deathly unwieldy and of that.  

    ICU4C's regex library admits \N{⋯} just as Perl and Python do, but that
    class is not available in ICU4J, so you have to JNI to it as Android does.
    This is really much cleaner and clearer for maintenance:

        26	if re.search("[\N{MATHEMATICAL SCRIPT CAPITAL A}-\N{MATHEMATICAL SCRIPT CAPITAL Z}]",
        27	    "\N{MATHEMATICAL SCRIPT CAPITAL C}", flags): 
    

    As far as I know, nothing but Perl and Python allows \N{⋯} in interpolated
    literals — even for those few languages that *have* interolated literals.

    One question: If one really must use code point numbers in strings, does Python
    have any clean uniform way to enter them besides having to choose the clunky \uHHHH
    vs \UHHHHHHHH thing? The goal is to be able to specify any (legal) number of hex
    digits without having to zero-pad them, which is especially with Pyton's \U, since
    you usually only need 5 hex digits and only very rarely 6, but the dumb thing makes
    you type all 8 of them every time anyway.

    You should somehow be able to specify only as many hex digits as you actually need.
    Ruby, and now also recent Unicode tech reports like current tr18, tend to use \u{⋯}
    for that, The \x{⋯} flavor is used by Perl strings and regexes, plus also the regexes
    in ICU, JDK7, and Matthew's regex library for Python.

    It's just a lot easier, which is why I miss it from regular Python strings. It
    occurs to me that you could add it completely backwards compatibily, since it is
    currently a syntax error:

    % python3.2 -c 'print("\x65")'
    e
    
    % python3.2 -c 'print("\u0065")'
    e
    
    % python3.2 -c 'print("\u03B1")'
    α
    
    % python3.2 -c 'print("\U0001D4E9")'
    𝓩
    
    % python3.2 -c 'print("\u{1D4E9}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \uXXXX escape
    Exit 1
    
    % python3.2 -c 'print("\x{1D4E9}")'
      File "<string>", line 1
    SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape
    Exit 1
    

    Perl only uses \x, not \x AND \u AND \U the way Python does, because
    ahem, it seems a bit silly to have three different ways to do it. :)

    % perl -le 'print "\x9"' | cat -t
    ^I
    % perl -le 'print "\x65"'
    e
    
    % perl -le 'print "\x{9}"' | cat -t
    ^I
    % perl -le 'print "\x{65}"'
    e
    % perl -le 'print "\x{3B1}"'
    α
    % perl -le 'print "\x{FB03}"'
    ffi
    % perl -le 'print "\x{1D4E9}"'
    𝓩
    % perl -le 'print "\x{1FFFF}"   lt "\x{100000}"'
    1
    % perl -le 'print "\x{10_FFFF}" gt "\x{01_FFFF}"'
    1
    

    Thanks for your all your generous help and kindly patience.

    --tom

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Aug 14, 2011

    On a narrow build, "\N{MATHEMATICAL SCRIPT CAPITAL A}" is stored as 2 code units, and neither re nor regex recombine them when compiling a regex or looking for a match.

    regex supports \xNN, \uNNNN and \UNNNNNNNN and \N{XYZ} itself, so they can be used in a raw string literal, but it doesn't recombine code units.

    I could add recombination to regex at some point if time has passed and no further progress has been made in the language's support for Unicode.

    @ezio-melotti
    Copy link
    Member

    Perhaps I am doing something wrong?

    That's weird, I tried on a wide Python 2.6.6 too and it works even there. Maybe a bug that got fixed between 2.6.2 and 2.6.6? Or maybe something else?

    Is there a way to easily have these co-exist on the same system?

    Here I have different HG clones, one for each release (2.7, 3.2, 3.3), and I run ./configure (--with-wide-unicode) && make -j2. Then I just run ./python from there without installing it in the system.
    You might do the same or look at "make altinstall". If you run "make install" it will install it as the default Python, so that's probably what you want. Another option is to use virtualenv.

    The Python2 version is *much* noisier.

    Yes, Python 3 fixed many of these things and it's a much "cleaner" language.

    (1) You have keep remembering to u"..." everything because neither
    # -- coding: UTF-8 --
    nor even
    from __future__ import unicode_literals
    suffices.

    Before Unicode Python only had plain (byte)strings, when Unicode strings were introduced the u"..." syntax was chosen to distinguish them. On Python 3, "..." is a Unicode string, whereas b"..." is used for bytes.
    "# -- coding: UTF-8 --" is only about the encoding used to save the file, and doesn't affect other things. Also this is the default on Python 3 so it's not necessary anymore (it's ASCII (or iso-8859-1?) on Python2).
    "from __future__ import unicode_literals" allows you to use "..." and b"..." instead of u"..." and "..." on Python 2. In my example I used u"..." to be explicit and because I was running from the terminal without using unicode_literals.

    (2) You have to manually encode every string, which is utterly
    bizarre to me.

    re works with both bytes and Unicode strings, on both Python 2 and Python 3. I was encoding them to see if it was able to handle the range when it was in a UTF-8 encoded string, rather than a Unicode string. Even if it didn't fail with an exception, it failed with a wrong result (and that's even worse).

    (3) Plus you then have turn around and tell re, "Hey by the way, you
    know those Unicode strings I just passed you? Those are Unicode
    strings, you know."
    Like it couldn't tell that already by realizing it got Unicode not
    byte strings. So weird.

    The re.UNICODE flags affects the behavior of e.g. \w and \d, it's not telling re that we are passing Unicode strings rather than bytes. By default on Python 2 those only match ASCII letters and digits. This is also fixed on Python 3, where by default they match non-ASCII letters and digits (unless you pass re.ASCII).

    • Requiring explicitly coded callouts to a library are at best
      tedious and annoying. ICU4J's UCharacter and JDK7's Character
      classes both have
      String getName(int codePoint)

    FWIW we have unicodedata.lookup('SNOWMAN')

    One question: If one really must use code point numbers in strings,
    does Python have any clean uniform way to enter them besides having
    to choose the clunky \uHHHH vs \UHHHHHHHH thing?

    Nope. OTOH it doesn't happen to often to use those (especially the \U version), so I'm not sure that it's worth adding something else just to save a few chars (also \x{12345} is only one char less than \U00012345).

    @ezio-melotti
    Copy link
    Member

    BTW, you can find more information about the one-dir-per-clone setup (and other useful info) here: http://docs.python.org/devguide/committing.html#using-several-working-copies

    @gvanrossum
    Copy link
    Member

    We should at least get this fixed in 3.3. Then we can discuss the benefits of backporting the fixes to 2.7 and 3.2 (though it sounds to me like the backports will fix more than they will break, since it is pretty much impossible to do the right thing in those versions today).

    @ezio-melotti
    Copy link
    Member

    I tried bigrange.py on 3.3/3.4 and I got:
    match 1 passed
    match 2 passed
    match 3 passed

    PEP-393 probably fixed this issue.
    I don't think it's worth attempting to backport this on 2.7/3.2, so I'm closing this issue.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 23, 2013

    New changeset 489cfa062442 by Ezio Melotti in branch '3.3':
    bpo-12749: add a test for non-BMP ranges in character classes.
    http://hg.python.org/cpython/rev/489cfa062442

    New changeset c3a09c535001 by Ezio Melotti in branch 'default':
    bpo-12749: merge with 3.3.
    http://hg.python.org/cpython/rev/c3a09c535001

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-regex type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants