Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') #55512

Closed
abalkin opened this issue Feb 23, 2011 · 61 comments
Closed

b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') #55512

abalkin opened this issue Feb 23, 2011 · 61 comments
Assignees
Labels
performance Performance or resource usage

Comments

@abalkin
Copy link
Member

abalkin commented Feb 23, 2011

BPO 11303
Nosy @malemburg, @rhettinger, @jcea, @abalkin, @pitrou, @vstinner, @ezio-melotti, @merwok
Superseder
  • bpo-11322: encoding package's normalize_encoding() function is too slow
  • Files
  • latin1.diff
  • issue11303.diff
  • aggressive_normalization.patch
  • issue11303.diff: Proof of concept that implements Steffen algorithm.
  • more_aggressive_normalization.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/abalkin'
    closed_at = <Date 2011-02-25.23:00:39.837>
    created_at = <Date 2011-02-23.22:34:55.125>
    labels = ['performance']
    title = "b'x'.decode('latin1')\tis\tmuch\tslower\tthan\tb'x'.decode('latin-1')"
    updated_at = <Date 2011-03-04.17:56:32.919>
    user = 'https://github.com/abalkin'

    bugs.python.org fields:

    activity = <Date 2011-03-04.17:56:32.919>
    actor = 'lemburg'
    assignee = 'belopolsky'
    closed = True
    closed_date = <Date 2011-02-25.23:00:39.837>
    closer = 'lemburg'
    components = []
    creation = <Date 2011-02-23.22:34:55.125>
    creator = 'belopolsky'
    dependencies = []
    files = ['20871', '20872', '20876', '20878', '20880']
    hgrepos = []
    issue_num = 11303
    keywords = ['patch']
    message_count = 61.0
    messages = ['129227', '129232', '129234', '129253', '129259', '129261', '129270', '129271', '129272', '129273', '129274', '129275', '129276', '129278', '129279', '129280', '129281', '129282', '129283', '129284', '129285', '129286', '129287', '129288', '129289', '129290', '129291', '129292', '129293', '129294', '129295', '129296', '129306', '129308', '129309', '129322', '129323', '129360', '129383', '129385', '129387', '129404', '129452', '129454', '129456', '129457', '129461', '129464', '129465', '129466', '129485', '129486', '129488', '129490', '129491', '129492', '129493', '129494', '129537', '129539', '130065']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'rhettinger', 'jcea', 'belopolsky', 'pitrou', 'vstinner', 'ezio.melotti', 'eric.araujo', 'sdaoden']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = '11322'
    type = 'performance'
    url = 'https://bugs.python.org/issue11303'
    versions = ['Python 3.3']

    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 23, 2011

    $ ./python.exe -m timeit "b'x'.decode('latin1')"
    100000 loops, best of 3: 2.57 usec per loop
    $ ./python.exe -m timeit "b'x'.decode('latin-1')"
    1000000 loops, best of 3: 0.336 usec per loop

    The reason for this behavior is that 'latin-1' is short-circuited in C code while 'latin1' has to be looked up in aliases.py. Attached patch fixes this issue.

    @abalkin abalkin added the performance Performance or resource usage label Feb 23, 2011
    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 24, 2011

    In bpo-11303.diff, I add similar optimization for encode('latin1') and for 'utf8' variant of utf-8. I don't think dash-less variants of utf-16 and utf-32 are common enough to justify special-casing.

    @merwok
    Copy link
    Member

    merwok commented Feb 24, 2011

    +1 for the patch.

    @malemburg
    Copy link
    Member

    Alexander Belopolsky wrote:

    Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:

    In bpo-11303.diff, I add similar optimization for encode('latin1') and for 'utf8' variant of utf-8. I don't think dash-less variants of utf-16 and utf-32 are common enough to justify special-casing.

    Looks good.

    Given that we are starting to have a whole set of such aliases
    in the C code, I wonder whether it would be better to make the
    string comparisons more efficient, e.g.
    if "utf" matches, the checks could then continue with "8" or "-8"
    instead of trying to match "utf" again and again.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    I wonder what this normalize_encoding() does! Here is a pretty standard version of mine which is a bit more expensive but catches match more cases! This is stripped, of course, and can be rewritten very easily to Python's needs (i.e. using char[32] instead of char[11].

    • @@Li If a character is either ::s_char_is_space() or ::s_char_is_punct():
    •  @@li    Replace with ASCII space (0x20).
      
    •  @@li    Squeeze adjacent spaces to a single one.
      
    • @@Li Else if a character is ::s_char_is_alnum():
    •  @@li    ::s_char_to_lower() characters.
      
    •  @@li    Separate groups of alphas and digits with ASCII space (0x20).
      
    • @@Li Else discard character.
    • E.g. "ISO_8859---1" becomes "iso 8859 1"
    • and "ISO8859-1" also becomes "iso 8859 1".
    s_textcodec_normalize_name(s_CString *_name) {
            enum { C_NONE, C_WS, C_ALPHA, C_DIGIT } c_type = C_NONE;
            char *name, c;
            auto s_CString input;
        s_cstring_swap(s_cstring_init(&input), _name);
        _name = s_cstring_reserve(_name, 31, s_FAL0);
        name = s_cstring_cstr(&input);
    
            while ((c = *(name++)) != s_NUL) {
                    s_si8 sep = s_FAL0;
    
                    if (s_char_is_space(c) || s_char_is_punct(c)) {
                            if (c_type == C_WS)
                                    continue;
                            c_type = C_WS;
                            c = ' ';
                    } else if (s_char_is_alpha(c)) {
                            sep = (c_type == C_DIGIT);
                            c_type = C_ALPHA;
                            c = s_char_to_lower(c);
                    } else if (s_char_is_digit(c)) {
                            sep = (c_type == C_ALPHA);
                            c_type = C_DIGIT;
                    } else
                            continue;
                do
                        _name = s_cstring_append_char(_name, (sep ? ' ' : c));
                while (--sep >= s_FAL0);
        }
    
            s_cstring_destroy(&input);
            return _name;
    }

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    (That is to say, i would do it. But not if _cpython is thrown to trash ,-); i.e. not if there is not a slight chance that it gets actually patched in because this performance issue probably doesn't mean a thing in real life. You know, i'm a slow programmer, i would need *at least* two hours to rewrite that in plain C in a way that can make it as a replacement of normalize_encoding().)

    @ezio-melotti
    Copy link
    Member

    See also discussion on bpo-5902.

    Steffen, your normalization function looks similar to encodings.normalize_encoding, with just a few differences (it uses spaces instead of dashes, it divides alpha chars from digits).

    If it doesn't slow down the normal cases (i.e. 'utf-8', 'utf8', 'latin-1', etc.), a more flexible normalization done earlier might be a valid alternative.

    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 24, 2011

    On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti <report@bugs.python.org> wrote:
    ..

    See also discussion on bpo-5902.

    Mark has closed bpo-5902 and indeed the discussion of how to efficiently
    normalize encoding names (without changing what is accepted) is beyond
    the scope of that or the current issue. Can someone open a separate
    issue to see if we can improve the current situation? I don't think
    having three slightly different normalize functions is optimal. See
    msg129248.

    @abalkin abalkin changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    .. i don't have actually invented this algorithm (but don't ask me where i got the idea from years ago), i've just implemented the function you see. The algorithm itself avoids some pitfalls in respect to combining numerics and significantly reduces the number of possible normalization cases:

        "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"
        (+ think of additional mispellings)
    

    all become
    "iso 8859 1", "latin 1"
    in the end

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    (Everything else is beyond my scope. But normalizing _ to - is possibly a bad idea as far as i can remember the situation three years ago.)

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    P.P.S.: separating alphanumerics is a win for things like, e.g. UTF-16BE: it gets 'utf 16 be' - think about the possible mispellings here and you see this algorithm is a good thing....

    @malemburg
    Copy link
    Member

    Alexander Belopolsky wrote:

    Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:

    On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti <report@bugs.python.org> wrote:
    ..
    > See also discussion on bpo-5902.

    Mark has closed bpo-5902 and indeed the discussion of how to efficiently
    normalize encoding names (without changing what is accepted) is beyond
    the scope of that or the current issue. Can someone open a separate
    issue to see if we can improve the current situation? I don't think
    having three slightly different normalize functions is optimal. See
    msg129248.

    Please see my reply on this ticket: those three functions have
    different application areas.

    On this ticker, we're discussing just one application area: that
    of the builtin short cuts.

    To have more encoding name variants benefit from the optimization,
    we might want to enhance that particular normalization function
    to avoid having to compare against "utf8" and "utf-8" in the
    encode/decode functions.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @malemburg
    Copy link
    Member

    Steffen Daode Nurpmeso wrote:

    Steffen Daode Nurpmeso <sdaoden@googlemail.com> added the comment:

    .. i don't have actually invented this algorithm (but don't ask me where i got the idea from years ago), i've just implemented the function you see. The algorithm itself avoids some pitfalls in respect to combining numerics and significantly reduces the number of possible normalization cases:

        "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"
        (+ think of additional mispellings)
    

    all become
    "iso 8859 1", "latin 1"
    in the end

    Please don't forget that the shortcuts in questions are *optimizations*.

    Programmers who don't use the encoding names triggering those
    optimizations will still have a running program, it'll only be
    a bit slower and that's perfectly fine.

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 24, 2011

    So, well, a-ha, i will boot my laptop this evening and (try to) write a patch for normalize_encoding(), which will match the standart conforming LATIN1 and also will continue to support the illegal latin-1 without actually changing the two users PyUnicode_Decode() and PyUnicode_AsEncodedString(), from which i better keep the hands off. But i'm slow, it may take until tomorrow...

    @ezio-melotti
    Copy link
    Member

    If the first normalization function is flexible enough to match most of the spellings of the optimized encodings, they will all benefit of the optimization without having to go through the long path.

    (If the normalized encoding name is then passed through, the following normalization functions will also have to do less work, but this is out of the scope of this issue.)

    @vstinner
    Copy link
    Member

    I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:

    >>> import re
    >>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
    ... 
    >>> normalize("UTF-8")
    'utf8'
    >>> normalize("ISO-8859-1")
    'iso88591'
    >>> normalize("latin1")
    'latin1'

    So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.

    I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

    @vstinner
    Copy link
    Member

    Patch implementing my suggestion.

    @ezio-melotti
    Copy link
    Member

    That will also accept invalid names like 'iso88591' that are not valid now, 'iso 8859 1' is already accepted.

    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 24, 2011

    On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    ..

    On this ticker, we're discussing just one application area: that
    of the builtin short cuts.

    Fair enough. I was hoping to close this ticket by simply committing
    the posted patch, but it looks like people want to do more. I don't
    think we'll get measurable performance gains but may improve code
    understandability.

    To have more encoding name variants benefit from the optimization,
    we might want to enhance that particular normalization function
    to avoid having to compare against "utf8" and "utf-8" in the
    encode/decode functions.

    Which function are you talking about?

    1. normalize_encoding() in unicodeobject.c
    2. normalizestring() in codecs.c

    The first is s.lower().replace('-', '_') and the second is
    s.lower().replace(' ', '_'). (Note space vs. dash difference.)

    Why do we need both? And why should they be different?

    @abalkin abalkin changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @malemburg
    Copy link
    Member

    As promised, here's the list of places where the wrong Latin-1 encoding spelling is used:

    Lib//test/test_cmd_line.py:
    --         for encoding in ('ascii', 'latin1', 'utf8'):
    Lib//test/test_codecs.py:
    --         ef = codecs.EncodedFile(f, 'utf-8', 'latin1')
    Lib//test/test_shelve.py:
    --         shelve.Shelf(d, keyencoding='latin1')[key] = [1]
    --         self.assertIn(key.encode('latin1'), d)
    Lib//test/test_uuid.py:
    --             os.write(fds[1], value.hex.encode('latin1'))
    --             child_value = os.read(fds[0], 100).decode('latin1')
    Lib//test/test_xml_etree.py:
    --     >>> ET.tostring(ET.PI('test', '<testing&>\xe3'), 'latin1')
    --     b"<?xml version='1.0' encoding='latin1'?>\\n<?test <testing&>\\xe3?>"
    Lib//urllib/request.py:
    --             data = base64.decodebytes(data.encode('ascii')).decode('latin1')
    Lib//asynchat.py:
    --     encoding                = 'latin1'
    Lib//sre_parse.py:
    --         encode = lambda x: x.encode('latin1')
    Lib//distutils/command/bdist_wininst.py:
    --             # convert back to bytes. "latin1" simply avoids any possible
    --                 encoding="latin1") as script:
    --                 script_data = script.read().encode("latin1")
    Lib//test/test_bigmem.py:
    --         return s.encode("latin1")
    --         return bytearray(s.encode("latin1"))
    Lib//test/test_bytes.py:
    --         self.assertRaises(UnicodeEncodeError, self.type2test, sample, "latin1")
    --         b = self.type2test(sample, "latin1", "ignore")
    --         b = self.type2test(sample, "latin1")
    Lib//test/test_codecs.py:
    --         self.assertEqual("\udce4\udceb\udcef\udcf6\udcfc".encode("latin1", "surrogateescape"),
    Lib//test/test_io.py:
    --     with open(__file__, "r", encoding="latin1") as f:
    --         t.__init__(b, encoding="latin1", newline="\r\n")
    --         self.assertEqual(t.encoding, "latin1")
    --             for enc in "ascii", "latin1", "utf8" :# , "utf-16-be", "utf-16-le":
    Lib//ftplib.py:
    --     encoding = "latin1"

    I'll fix those later today or tomorrow.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:
    > 
    > STINNER Victor <victor.stinner@haypocalc.com> added the comment:
    > 
    > I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:
    > 
    >>>> import re
    >>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
    > ... 
    >>>> normalize("UTF-8")
    > 'utf8'
    >>>> normalize("ISO-8859-1")
    > 'iso88591'
    >>>> normalize("latin1")
    > 'latin1'
    > 
    > So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.
    > 
    > I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-)

    I think rather than removing any hyphens, spaces, etc. the
    function should additionally:

    • add hyphens whenever (they are missing and) there's switch
      from [a-z] to [0-9]

    That way you end up with the correct names for the given set of
    optimized encoding names.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @malemburg
    Copy link
    Member

    Alexander Belopolsky wrote:

    Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:

    On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    ..
    > On this ticker, we're discussing just one application area: that
    > of the builtin short cuts.
    >
    Fair enough. I was hoping to close this ticket by simply committing
    the posted patch, but it looks like people want to do more. I don't
    think we'll get measurable performance gains but may improve code
    understandability.

    > To have more encoding name variants benefit from the optimization,
    > we might want to enhance that particular normalization function
    > to avoid having to compare against "utf8" and "utf-8" in the
    > encode/decode functions.

    Which function are you talking about?

    1. normalize_encoding() in unicodeobject.c
    2. normalizestring() in codecs.c

    The first one, since that's being used by the shortcuts.

    The first is s.lower().replace('-', '_') and the second is

    It does this: s.lower().replace('_', '-')

    s.lower().replace(' ', '_'). (Note space vs. dash difference.)

    Why do we need both? And why should they be different?

    Because the first is specifically used for the shortcuts
    (which can do more without breaking anything, since it's
    only used internally) and the second prepares the encoding
    names for lookup in the codec registry (which has a PEP-100
    defined behavior we cannot easily change).

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 24, 2011
    @vstinner
    Copy link
    Member

    Ooops, I attached the wrong patch. Here is the new fixed patch.

    Without the patch:

    >>> import timeit
    >>> timeit.Timer("'a'.encode('latin1')").timeit()
    3.8540711402893066
    >>> timeit.Timer("'a'.encode('latin-1')").timeit()
    1.4946870803833008

    With the patch:

    >>> import timeit
    >>> timeit.Timer("'a'.encode('latin1')").timeit()
    1.4461820125579834
    >>> timeit.Timer("'a'.encode('latin-1')").timeit()
    1.463456153869629
    
    >>> timeit.Timer("'a'.encode('UTF-8')").timeit()
    0.9479248523712158
    >>> timeit.Timer("'a'.encode('UTF8')").timeit()
    0.9208409786224365

    @malemburg
    Copy link
    Member

    Marc-Andre Lemburg wrote:

    I don't know who changed the encoding's package normalize_encoding() function (wasn't me), but it's a really slow implementation.

    The original version used the .translate() method which is a lot faster.

    I guess that's one of the reasons why Alexander found such a dramatic
    difference between the shortcut variant of the names and the ones
    going through the registry.

    I'll open a new issue for that part.

    bpo-11322

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 25, 2011
    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 25, 2011

    Committed bpo-11303.diff and doc change in revision 88602.

    I think the remaining ideas are best addressed in bpo-11322.

    Given that we are starting to have a whole set of such aliases
    in the C code, I wonder whether it would be better to make the
    string comparisons more efficient, e.g.

    I don't think we can do much better than a string of strcmp()s. Even if a more efficient algorithm can be found, it will certainly be less readable. Moving strcmp()s before normalize_encoding() (and either forgoing optimization for alternative capitalizations or using case insensitive comparison) may be a more promising optimization strategy. In any case all these micro-optimizations are dwarfed by that of bypassing Python calls and are probably not worth pursuing.

    @abalkin abalkin self-assigned this Feb 25, 2011
    @vstinner
    Copy link
    Member

    r88586: Normalized the encoding names for Latin-1 and UTF-8 to
    'latin-1' and 'utf-8' in the stdlib.

    Why did you do that? We are trying to find a solution together, and you change directly the code without any review. Your commit doesn't solve this issue.

    Your commit is now useless, can you please revert it?

    @rhettinger
    Copy link
    Contributor

    What's wrong with Marc's commit? He's using the standard names.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > r88586: Normalized the encoding names for Latin-1 and UTF-8 to
    > 'latin-1' and 'utf-8' in the stdlib.

    Why did you do that? We are trying to find a solution together, and you change directly the code without any review. Your commit doesn't solve this issue.

    As discussed on python-dev, the stdlib should use Python's
    default names for encodings and that's what I changed.

    Your commit is now useless, can you please revert it?

    This ticket was mainly discussing use cases in
    3rd party applications, not code that we have control over
    in the stdlib - we can easily fix that and that's what I did
    with the above checkin.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 25, 2011
    @malemburg
    Copy link
    Member

    Closing the ticket again.

    The problem in question is solved.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 25, 2011

    What's wrong with Marc's commit? He's using the standard names.

    That's a pretty useless commit and it will make applying patches and backports more tedious, for no obvious benefit.
    Of course that concern will be removed if Marc-André also backports it to 3.2 and 2.7.

    @malemburg
    Copy link
    Member

    I guess you could regard the wrong encoding name use as bug - it
    slows down several stdlib modules for no apparent reason.

    If you agree, Raymond, I'll backport the patch.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 25, 2011
    @ezio-melotti
    Copy link
    Member

    +1 on the backport.

    @malemburg
    Copy link
    Member

    Marc-Andre Lemburg wrote:

    Marc-Andre Lemburg <mal@egenix.com> added the comment:

    I guess you could regard the wrong encoding name use as bug - it
    slows down several stdlib modules for no apparent reason.

    If you agree, Raymond, I'll backport the patch.

    We might actually backport Alexander's patch as well - for much
    the same reason.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 25, 2011
    @rhettinger
    Copy link
    Contributor

    If you agree, Raymond, I'll backport the patch.

    Yes. That will address Antoine's legitimate concern about making other backports harder, and it will get all the Python's to use the canonical spelling.

    For other spellings like "utf8" or "latin1", I wonder if it would be useful to emit a warning/suggestion to use the standard spelling.

    @merwok
    Copy link
    Member

    merwok commented Feb 26, 2011

    Such warnings about performance seem to me to be the domain of code analysis or lint tools, not the interpreter.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 26, 2011

    For other spellings like "utf8" or "latin1", I wonder if it would be
    useful to emit a warning/suggestion to use the standard spelling.

    No, it would be an useless annoyance.

    @vstinner
    Copy link
    Member

    For other spellings like "utf8" or "latin1", I wonder
    if it would be useful to emit a warning/suggestion to use
    the standard spelling.

    Why do you want to emit a warning? utf8 is now as fast as utf-8.

    @ezio-melotti
    Copy link
    Member

    For other spellings like "utf8" or "latin1", I wonder if it would be
    useful to emit a warning/suggestion to use the standard spelling.

    It would prefer to see the note added by Alexander in the doc mention *only* the preferred spellings (i.e. 'utf-8' and 'iso-8859-1') rather than all the variants that are actually optimized. One of the reasons that lead me to open bpo-5902 is that I didn't like the inconsistencies in the encoding names (utf-8 vs utf8 vs UTF8 etc). Suggesting only one spelling per encoding will fix the problem.

    FWIW, the correct spelling is 'latin1', not 'latin-1', but I still prefer 'iso-8859-1' over the two.

    (The note could also use some more 'markup' for the encoding names.)

    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 26, 2011

    On Fri, Feb 25, 2011 at 8:29 PM, Antoine Pitrou <report@bugs.python.org> wrote:
    ..

    > For other spellings like "utf8" or "latin1", I wonder if it would be
    > useful to emit a warning/suggestion to use the standard spelling.

    No, it would be an useless annoyance.

    If we ever decide to get rid of codec aliases in the core and require
    users to translate names found in various internet standards to
    canonical Python spellings, we will have to issue deprecation warnings
    before that.

    As long as we recommend using say XML encoding metadata as is, we
    cannot standardize on Python spellings because they differ from XML
    standard. (For example, Python uses "latin-1" and proper XML only
    accepts "latin1". Of course, we can ask everyone to use iso-8859-1
    instead, but how many users can remember that name?)

    @abalkin abalkin changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 26, 2011
    @pitrou
    Copy link
    Member

    pitrou commented Feb 26, 2011

    If we ever decide to get rid of codec aliases in the core

    "If".

    @abalkin
    Copy link
    Member Author

    abalkin commented Feb 26, 2011

    On Fri, Feb 25, 2011 at 8:39 PM, Ezio Melotti <report@bugs.python.org> wrote:
    ..

    It would prefer to see the note added by Alexander in the doc mention *only* the preferred spellings
    (i.e. 'utf-8' and 'iso-8859-1') rather than all the variants that are actually optimized. One of the reasons
    that lead me to open bpo-5902 is that I didn't like the inconsistencies in the encoding names (utf-8 vs utf8 vs
    UTF8 etc). Suggesting only one spelling per encoding will fix the problem.

    I am fine with trimming the list. In fact I deliberately did not
    mention say UTF-8 variant even though it is also optimized.
    Unfortunately, I don't think we have a choice between 'latin-1',
    'latin1', and 'iso-8859-1'. I don't think we should recommend
    'latin-1' because this may cause people adding '-' to a very popular
    and IANA registered 'latin1' variant and while 'iso-8859-1' is the
    most pedantically correct spelling, it is very user unfriendly.

    @sdaoden
    Copy link
    Mannequin

    sdaoden mannequin commented Feb 26, 2011

    On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote:

    Marc-Andre Lemburg <mal@egenix.com> added the comment:

    r88586: Normalized the encoding names for Latin-1 and UTF-8 to
    'latin-1' and 'utf-8' in the stdlib.

    Even though - or maybe exactly because - i'm a newbie, i really
    want to add another message after all this biting is over.
    I've just read PEP-100 and msg129257 (on bpo-5902), and i feel
    a bit confused.

    Marc-Andre Lemburg <mal@egenix.com> added the comment:
    It turns out that there are three "normalize" functions that are
    successively applied to the encoding name during evaluation of
    str.encode/str.decode.

    1. normalize_encoding() in unicodeobject.c

    This was added to have the few shortcuts we have in the C code
    for commonly used codecs match more encoding aliases.

    The shortcuts completely bypass the codec registry and also
    bypass the function call overhead incurred by codecs
    run via the codec registry.

    The thing that i don't understand the most is that illegal
    (according to IANA standarts) names are good on the one hand
    (latin-1, utf-16-be), but bad on the other, i.e. in my
    group-preserving code or haypos very fast but name-joining patch
    (the first): a *local* change in unicodeobject.c, which' result is
    *only* used for the two users PyUnicode_Decode() and
    PyUnicode_AsEncodedString(). However:

    Marc-Andre Lemburg <mal@egenix.com> added the comment:
    Programmers who don't use the encoding names triggering those
    optimizations will still have a running program, it'll only be
    a bit slower and that's perfectly fine.

    Marc-Andre Lemburg <mal@egenix.com> added the comment:
    think rather than removing any hyphens, spaces, etc. the
    function should additionally:

    • add hyphens whenever (they are missing and) there's switch
      from [a-z] to [0-9]

    That way you end up with the correct names for the given set
    of optimized encoding names.

    haypos patch can easily be adjusted to reflect this, resulting in
    a much cleaner code in the two mentioned users, because
    normalize_encoding() did the job it was ment for.
    (Hmmm, and my own code could also be adjusted to match Python
    semantics (using hyphen instead of space as a group-separator),
    so that an end-user has the choice in between *all* IANA standart
    names (e.g. "ISO-8859-1", "ISO8859-1", "ISO_8859-1", "LATIN1"),
    and would gain the full optimization benefit of using latin-1,
    which seems to be pretty useful for limburger.)

    Ezio Melotti wrote:
    Marc-Andre Lemburg wrote:
    > That won't work, Victor, since it makes invalid encoding
    > names valid, e.g. 'utf(=)-8'.

    That already works in Python (thanks to encodings.normalize_encoding)

    *However*: in PEP-100 Python has decided to go its own way
    a decade ago.

    Marc-Andre Lemburg <mal@egenix.com> added the comment:
    2. normalizestring() in codecs.c

    This is the normalization applied by the codec registry. See PEP-100
    for details:

    """
    Search functions are expected to take one argument,
    the encoding name in all lower case letters and with hyphens
    and spaces converted to underscores, ...
    """

    1. normalize_encoding() in encodings/init.py

    This is part of the stdlib encodings package's codec search function.

    First: *i* go for haypo:

    It's funny: we have 3 functions to normalize an encoding name, and
    each function does something else :-)

    (that's bpo-11322:)

    We should first implement the same algorithm of the 3 normalization
    functions and add tests for them

    And *i* don't understand anything else (i do have *my* - now
    furtherly optimized, thanks - s_textcodec_normalize_name()).
    However, two different ones (very fast thing which is enough to
    meet unicodeobject.c and a global one for anything else) may also do.
    Isn't anything else a maintenance mess? Where is that database,
    are there any known dependencies which are exposed to end-users?
    Or the like.

    I'm much too loud, and have a nice weekend.

    @malemburg
    Copy link
    Member

    Raymond Hettinger wrote:

    Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:

    > If you agree, Raymond, I'll backport the patch.

    Yes. That will address Antoine's legitimate concern about making other backports harder, and it will get all the Python's to use the canonical spelling.

    Ok, I'll backport both the normalization and Alexander's patch.

    For other spellings like "utf8" or "latin1", I wonder if it would be useful to emit a warning/suggestion to use the standard spelling.

    While it would make sense for Python programs, it would not for
    cases where the encoding is read from some other source, e.g.
    an XML encoding declaration.

    However, perhaps we could have a warning which is disabled
    per default and can be enabled using the -W option.

    @malemburg malemburg changed the title b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') Feb 26, 2011
    @malemburg
    Copy link
    Member

    M.-A. Lemburg wrote:
    > Raymond Hettinger wrote:
    >>
    >> Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:
    >>
    >>> If you agree, Raymond, I'll backport the patch.
    >>
    >> Yes.  That will address Antoine's legitimate concern about making other backports harder, and it will get all the Python's to use the canonical spelling.
    > 
    > Ok, I'll backport both the normalization and Alexander's patch.

    Hmm, I wanted to start working on this just now and then saw
    Georg's mail about the hg transition today, so I guess the
    backport will have to wait until Monday... will be interesting
    to see whether hg is really so much better than svn ;-)

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants