New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b'x'.decode('latin1') is much slower than b'x'.decode('latin-1') #55512
Comments
$ ./python.exe -m timeit "b'x'.decode('latin1')"
100000 loops, best of 3: 2.57 usec per loop
$ ./python.exe -m timeit "b'x'.decode('latin-1')"
1000000 loops, best of 3: 0.336 usec per loop The reason for this behavior is that 'latin-1' is short-circuited in C code while 'latin1' has to be looked up in aliases.py. Attached patch fixes this issue. |
In bpo-11303.diff, I add similar optimization for encode('latin1') and for 'utf8' variant of utf-8. I don't think dash-less variants of utf-16 and utf-32 are common enough to justify special-casing. |
+1 for the patch. |
Alexander Belopolsky wrote:
Looks good. Given that we are starting to have a whole set of such aliases |
I wonder what this normalize_encoding() does! Here is a pretty standard version of mine which is a bit more expensive but catches match more cases! This is stripped, of course, and can be rewritten very easily to Python's needs (i.e. using char[32] instead of char[11].
s_textcodec_normalize_name(s_CString *_name) {
enum { C_NONE, C_WS, C_ALPHA, C_DIGIT } c_type = C_NONE;
char *name, c;
auto s_CString input;
while ((c = *(name++)) != s_NUL) {
s_si8 sep = s_FAL0;
if (s_char_is_space(c) || s_char_is_punct(c)) {
if (c_type == C_WS)
continue;
c_type = C_WS;
c = ' ';
} else if (s_char_is_alpha(c)) {
sep = (c_type == C_DIGIT);
c_type = C_ALPHA;
c = s_char_to_lower(c);
} else if (s_char_is_digit(c)) {
sep = (c_type == C_ALPHA);
c_type = C_DIGIT;
} else
continue;
s_cstring_destroy(&input);
return _name;
} |
(That is to say, i would do it. But not if _cpython is thrown to trash ,-); i.e. not if there is not a slight chance that it gets actually patched in because this performance issue probably doesn't mean a thing in real life. You know, i'm a slow programmer, i would need *at least* two hours to rewrite that in plain C in a way that can make it as a replacement of normalize_encoding().) |
See also discussion on bpo-5902. Steffen, your normalization function looks similar to encodings.normalize_encoding, with just a few differences (it uses spaces instead of dashes, it divides alpha chars from digits). If it doesn't slow down the normal cases (i.e. 'utf-8', 'utf8', 'latin-1', etc.), a more flexible normalization done earlier might be a valid alternative. |
On Thu, Feb 24, 2011 at 10:30 AM, Ezio Melotti <report@bugs.python.org> wrote:
Mark has closed bpo-5902 and indeed the discussion of how to efficiently |
.. i don't have actually invented this algorithm (but don't ask me where i got the idea from years ago), i've just implemented the function you see. The algorithm itself avoids some pitfalls in respect to combining numerics and significantly reduces the number of possible normalization cases:
all become |
(Everything else is beyond my scope. But normalizing _ to - is possibly a bad idea as far as i can remember the situation three years ago.) |
P.P.S.: separating alphanumerics is a win for things like, e.g. UTF-16BE: it gets 'utf 16 be' - think about the possible mispellings here and you see this algorithm is a good thing.... |
Alexander Belopolsky wrote:
Please see my reply on this ticket: those three functions have On this ticker, we're discussing just one application area: that To have more encoding name variants benefit from the optimization, |
Steffen Daode Nurpmeso wrote:
Please don't forget that the shortcuts in questions are *optimizations*. Programmers who don't use the encoding names triggering those |
So, well, a-ha, i will boot my laptop this evening and (try to) write a patch for normalize_encoding(), which will match the standart conforming LATIN1 and also will continue to support the illegal latin-1 without actually changing the two users PyUnicode_Decode() and PyUnicode_AsEncodedString(), from which i better keep the hands off. But i'm slow, it may take until tomorrow... |
If the first normalization function is flexible enough to match most of the spellings of the optimized encodings, they will all benefit of the optimization without having to go through the long path. (If the normalized encoding name is then passed through, the following normalization functions will also have to do less work, but this is out of the scope of this issue.) |
I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like: >>> import re
>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
...
>>> normalize("UTF-8")
'utf8'
>>> normalize("ISO-8859-1")
'iso88591'
>>> normalize("latin1")
'latin1' So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8. I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-) |
Patch implementing my suggestion. |
That will also accept invalid names like 'iso88591' that are not valid now, 'iso 8859 1' is already accepted. |
On Thu, Feb 24, 2011 at 11:01 AM, Marc-Andre Lemburg
Fair enough. I was hoping to close this ticket by simply committing
Which function are you talking about?
The first is s.lower().replace('-', '_') and the second is Why do we need both? And why should they be different? |
As promised, here's the list of places where the wrong Latin-1 encoding spelling is used: Lib//test/test_cmd_line.py:
-- for encoding in ('ascii', 'latin1', 'utf8'):
Lib//test/test_codecs.py:
-- ef = codecs.EncodedFile(f, 'utf-8', 'latin1')
Lib//test/test_shelve.py:
-- shelve.Shelf(d, keyencoding='latin1')[key] = [1]
-- self.assertIn(key.encode('latin1'), d)
Lib//test/test_uuid.py:
-- os.write(fds[1], value.hex.encode('latin1'))
-- child_value = os.read(fds[0], 100).decode('latin1')
Lib//test/test_xml_etree.py:
-- >>> ET.tostring(ET.PI('test', '<testing&>\xe3'), 'latin1')
-- b"<?xml version='1.0' encoding='latin1'?>\\n<?test <testing&>\\xe3?>"
Lib//urllib/request.py:
-- data = base64.decodebytes(data.encode('ascii')).decode('latin1')
Lib//asynchat.py:
-- encoding = 'latin1'
Lib//sre_parse.py:
-- encode = lambda x: x.encode('latin1')
Lib//distutils/command/bdist_wininst.py:
-- # convert back to bytes. "latin1" simply avoids any possible
-- encoding="latin1") as script:
-- script_data = script.read().encode("latin1")
Lib//test/test_bigmem.py:
-- return s.encode("latin1")
-- return bytearray(s.encode("latin1"))
Lib//test/test_bytes.py:
-- self.assertRaises(UnicodeEncodeError, self.type2test, sample, "latin1")
-- b = self.type2test(sample, "latin1", "ignore")
-- b = self.type2test(sample, "latin1")
Lib//test/test_codecs.py:
-- self.assertEqual("\udce4\udceb\udcef\udcf6\udcfc".encode("latin1", "surrogateescape"),
Lib//test/test_io.py:
-- with open(__file__, "r", encoding="latin1") as f:
-- t.__init__(b, encoding="latin1", newline="\r\n")
-- self.assertEqual(t.encoding, "latin1")
-- for enc in "ascii", "latin1", "utf8" :# , "utf-16-be", "utf-16-le":
Lib//ftplib.py:
-- encoding = "latin1" I'll fix those later today or tomorrow. |
STINNER Victor wrote:
>
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> I think that the normalization function in unicodeobject.c (only used for internal functions) can skip any character different than a-z, A-Z and 0-9. Something like:
>
>>>> import re
>>>> def normalize(name): return re.sub("[^a-z0-9]", "", name.lower())
> ...
>>>> normalize("UTF-8")
> 'utf8'
>>>> normalize("ISO-8859-1")
> 'iso88591'
>>>> normalize("latin1")
> 'latin1'
>
> So ISO-8859-1, ISO885-1, LATIN-1, latin1, UTF-8, utf8, etc. will be normalized to iso88591, latin1 and utf8.
>
> I don't know any encoding name where a character outside a-z, A-Z, 0-9 means anything special. But I don't know all encoding names! :-) I think rather than removing any hyphens, spaces, etc. the
That way you end up with the correct names for the given set of |
Alexander Belopolsky wrote:
The first one, since that's being used by the shortcuts.
It does this: s.lower().replace('_', '-')
Because the first is specifically used for the shortcuts |
Ooops, I attached the wrong patch. Here is the new fixed patch. Without the patch: >>> import timeit
>>> timeit.Timer("'a'.encode('latin1')").timeit()
3.8540711402893066
>>> timeit.Timer("'a'.encode('latin-1')").timeit()
1.4946870803833008 With the patch: >>> import timeit
>>> timeit.Timer("'a'.encode('latin1')").timeit()
1.4461820125579834
>>> timeit.Timer("'a'.encode('latin-1')").timeit()
1.463456153869629
>>> timeit.Timer("'a'.encode('UTF-8')").timeit()
0.9479248523712158
>>> timeit.Timer("'a'.encode('UTF8')").timeit()
0.9208409786224365 |
Marc-Andre Lemburg wrote:
I guess that's one of the reasons why Alexander found such a dramatic
|
Committed bpo-11303.diff and doc change in revision 88602. I think the remaining ideas are best addressed in bpo-11322.
I don't think we can do much better than a string of strcmp()s. Even if a more efficient algorithm can be found, it will certainly be less readable. Moving strcmp()s before normalize_encoding() (and either forgoing optimization for alternative capitalizations or using case insensitive comparison) may be a more promising optimization strategy. In any case all these micro-optimizations are dwarfed by that of bypassing Python calls and are probably not worth pursuing. |
Why did you do that? We are trying to find a solution together, and you change directly the code without any review. Your commit doesn't solve this issue. Your commit is now useless, can you please revert it? |
What's wrong with Marc's commit? He's using the standard names. |
STINNER Victor wrote:
As discussed on python-dev, the stdlib should use Python's
This ticket was mainly discussing use cases in |
Closing the ticket again. The problem in question is solved. |
That's a pretty useless commit and it will make applying patches and backports more tedious, for no obvious benefit. |
I guess you could regard the wrong encoding name use as bug - it If you agree, Raymond, I'll backport the patch. |
+1 on the backport. |
Marc-Andre Lemburg wrote:
We might actually backport Alexander's patch as well - for much |
Yes. That will address Antoine's legitimate concern about making other backports harder, and it will get all the Python's to use the canonical spelling. For other spellings like "utf8" or "latin1", I wonder if it would be useful to emit a warning/suggestion to use the standard spelling. |
Such warnings about performance seem to me to be the domain of code analysis or lint tools, not the interpreter. |
No, it would be an useless annoyance. |
Why do you want to emit a warning? utf8 is now as fast as utf-8. |
It would prefer to see the note added by Alexander in the doc mention *only* the preferred spellings (i.e. 'utf-8' and 'iso-8859-1') rather than all the variants that are actually optimized. One of the reasons that lead me to open bpo-5902 is that I didn't like the inconsistencies in the encoding names (utf-8 vs utf8 vs UTF8 etc). Suggesting only one spelling per encoding will fix the problem. FWIW, the correct spelling is 'latin1', not 'latin-1', but I still prefer 'iso-8859-1' over the two. (The note could also use some more |
On Fri, Feb 25, 2011 at 8:29 PM, Antoine Pitrou <report@bugs.python.org> wrote:
If we ever decide to get rid of codec aliases in the core and require As long as we recommend using say XML encoding metadata as is, we |
"If". |
On Fri, Feb 25, 2011 at 8:39 PM, Ezio Melotti <report@bugs.python.org> wrote:
I am fine with trimming the list. In fact I deliberately did not |
On Fri, Feb 25, 2011 at 03:43:06PM +0000, Marc-Andre Lemburg wrote:
Even though - or maybe exactly because - i'm a newbie, i really
The thing that i don't understand the most is that illegal
haypos patch can easily be adjusted to reflect this, resulting in
*However*: in PEP-100 Python has decided to go its own way
First: *i* go for haypo:
(that's bpo-11322:)
And *i* don't understand anything else (i do have *my* - now I'm much too loud, and have a nice weekend. |
Raymond Hettinger wrote:
Ok, I'll backport both the normalization and Alexander's patch.
While it would make sense for Python programs, it would not for However, perhaps we could have a warning which is disabled |
M.-A. Lemburg wrote:
> Raymond Hettinger wrote:
>>
>> Raymond Hettinger <rhettinger@users.sourceforge.net> added the comment:
>>
>>> If you agree, Raymond, I'll backport the patch.
>>
>> Yes. That will address Antoine's legitimate concern about making other backports harder, and it will get all the Python's to use the canonical spelling.
>
> Ok, I'll backport both the normalization and Alexander's patch. Hmm, I wanted to start working on this just now and then saw |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: