-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ascii and latin1 decoder with surrogateescape and surrogatepass error handlers #69058
Comments
surrogateescape is recommended way to mix binary data in string protocol. One actual problem is: PyMySQL/PyMySQL#366 surrogateescape is slow because errorhandler is called with UnicodeError object. surrogateescape is used with ASCII and UTF-8 encoding in ordinal. I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem |
On MacBook Pro (Core i5 2.6GHz), surrogateescape 1MB data takes 250ms. In [1]: bs = bytes(range(256)) * (4 * 1024) In [2]: len(bs) In [3]: %timeit x = bs.decode('ascii', 'surrogateescape') |
Why are bytes being escaped in a binary blob? The reason to use surrogateescape is when you have data that is mostly text, should be processed as text, but can have occasional binary data. That wouldn't seem to apply to a database binary blob. But that aside, if you want to submit a patch to speed up surrogateescape without changing its functionality, I'm sure it would be considered. It would certainly be useful for the email library, which currently does do the stupid thing of encoding binary message attachments using surrogateescape (and I'm guessing the reason pymysql does it is something similar to why email does it: the code would need to be significantly reorganized to do things right). |
Few months ago I wrote a patch that drastically speeds up encoding and decoding with surrogateescape and surrogatepass error handlers. However it causes 25% regression in decoding some UTF-8 data (U+0100-U+07FF if I remember correct) with strict error handler, so it needs some work. I hope that it is possible to rewrite UTF-8 decoder so that avoid a regression. The patch was postponed until 3.5 is released. In any case the patch is large and complex enough to be new feature that can appear only in 3.6. |
Serhiy: maybe we can start with ascii? |
Oh. I restored the old title because i replied by email with an old email. |
I've stripped Serhiy's patch for ascii. Here is benchmark result: Is there chance for applying this patch to 3.5.1? |
Since SQL may contain binary data. data = bytes(range(256))
cursor.execute(u"INSERT INTO t (blob_col) values (%(data)s)", {"data": data}) DB driver should escape properly for SQL syntax, then decode with surrogateescape for % operator. bytes in Python 3.5 supports % operator so I can use it instead of unicode %. |
Since you already have to rewrite the string to do the escaping, I would judge it worth the extra effort to piece string together as binary, but I can understand wanting to use % notation. The performance issue seems to prevent that, though, and there's no guarantee the proposed optimization will get applied to 3.4, or even 3.5. |
I rebased patch because faster-decode-ascii-surrogateescape.patch was generated in git format, and the git format is not accepted by Rietveld (the Review button). |
New changeset 3c430259873e by Victor Stinner in branch 'default': |
I pushed a change to optimize the ASCII decoder. Attached bench.py script: microbenchmark on the ASCII decoder. My results follows. Common platform: Platform of campaign before: Platform of campaign after: ------------------+-------------+--------------- ------------------+-------------+--------------- ------------------+-------------+--------------- ------------------+-------------+-------- -----------------+--------------+--------------- |
New changeset 2cf85e2834c2 by Victor Stinner in branch 'default': New changeset aa247150a8b1 by Victor Stinner in branch 'default': |
Ok, I prepared the code for the UTF-8 optimization. @serhiy: would you like to rebase your patch faster_surrogates_hadling.patch? Attached utf8.patch is a less optimal implementation which only changes PyUnicode_DecodeUTF8Stateful(). Maybe it's enough? I would like to see a benchmark here to choose the good compromise between performance and code complexity. |
New changeset 8317796ca004 by Victor Stinner in branch 'default': |
I pushed utf8.patch by mistake :-/ The advantage is that buildbots found bugs. Attached utf8-2.patch fixed bugs. The bug was how the "s" variable was set in the error handler. It's now set with: s = starts + endinpos; Bugs found by the buildbots: ====================================================================== Traceback (most recent call last):
File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_unicode.py", line 1897, in test_invalid_cb_for_3bytes_seq
'invalid continuation byte')
File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_unicode.py", line 1772, in assertCorrectUTF8Decoding
self.assertEqual(seq.decode('utf-8', 'replace'), res)
AssertionError: '��\x00' != '�\x00'
- ��
? -
+ � ====================================================================== Traceback (most recent call last):
File "/opt/python/3.x.langa-ubuntu/build/Lib/test/test_urllib.py", line 1016, in test_unquote_with_unicode
"using unquote(): %r != %r" % (expect, result))
AssertionError: '�' != '��'
- �
+ ��
? +
: using unquote(): '�' != '��' |
Ok, here is a patch which optimizes surrogatepass too. Result of bench_utf8.py. Common platform: Platform of campaign before: Platform of campaign after: ------------------+-------------+--------------- ------------------+-------------+--------------- ------------------+-------------+--------------- ------------------+-------------+--------------- ------------------+-------------+-------------- -----------------+-------------+--------------- |
I created the issue bpo-25227 to optimize the ASCII and Latin1 *encoders* for surrogateescape. |
I worked on UTF-16 and UTF-32 encoders, but now I'm off my developing computer. I'll provide updated patch soon. I think that only "surrogateescape" and "surrogatepass" error handlers have need in optimization, because they are used to interpolate with other programs, including old Python versions. "strict" stops processing, an optimization is not needed here. All other error handlers lose information and can't be used per se for transcoding bytes as string or string as bytes. They are used together with other slow code (for example for encoding string in XML or HTML you first need to escape '&', '<' and quotes). It is easy to add fast handling for 'ignore' and 'replace', but these error handlers are used largely for produce human-readable output, and adding it can slow down common case (no errors). That is why I limit my patch for "surrogateescape" and "surrogatepass" only. |
Serhiy wrote: "All other error handlers lose information and can't be used per se for transcoding bytes as string or string as bytes." Well, it was very simple to implement replace and ignore in decoders. I believe that the error handlers are commonly used. "(...) adding it can slow down common case (no errors). That is why I limit my patch for "surrogateescape" and "surrogatepass" only." We can start with benchmarks and see if modifying Objects/stringlib/ has a real impact on performances, or if modifying the "slower" decoder in Objects/unicodeobject.c is enough. IMHO it's fine to implement many error handlers in Objects/unicodeobject.c: it's the "slow" path when at least one error occurred, so it doesn't impact the path to decode valid UTF-8 strings. |
I just pushed my patch to optimize the UTF-8 encoder with error handlers: see the issue bpo-25267. It's up to 70 times as fast. The patch was based on Serhiy's work: faster_surrogates_hadling.patch attached to this issue. |
I created issue bpo-25301: "Optimize UTF-8 decoder with error handlers". |
Short summary. Ok, I optimized ASCII, Latin1 and UTF-8 codecs (encoders and decoders) for the most common error handlers.
The code to handle other error handlers in encoders has also be optimized. Surrogateescape has now an efficent implementation for ASCII, Latin1 and UTF-8 encoders and decoders. |
INADA Naoki: "I want to Python 3.4 and Python 3.5 solve this issue since it's critical problem for some people." On microbenchmarks, the optimization that I just implemented in Python 3.6 are impressive. The problem is that the implementation is quite complex. If I understood correctly, you are asking to optimize decoders and encoders for ASCII and UTF-8 (modify 4 functions) for the surrogateescape error handler. Is that right? Would UTF-8 be enough or not? I don't like backporting optimizations which are not well tested right now. To optimize encoders, I wrote a full new _PyBytesWriter API. We cannot backport this new API, even if it's private. So the backport may be more complex than the code in the default branch. |
UTF-8 and Latin1 are typical encoding for MySQL query. # Decode binary data
x = data.decode('ascii', 'surrogateescape')
# %-format query
psql = sql % (escape(x),) # sql is unicode
# Encode sql to connection encoding (latin1 or utf8)
query_bytes = psql.encode('utf-8', 'surrogateescape') So decoder speedup is required only for ascii, but encoder speedup is required for utf-8 and latin1. I'll consider other ways including creating speedup module and register it on PyPI. |
FYI, I found a workaround. _table = [chr(i) for i in range(128)] + [chr(i) for i in range(0xdc80, 0xdd00)]
def decode_surroundescape(s):
return s.decode('latin1').translate(_table) In [15]: data = b'\xff' * 1024 * 1024 In [16]: data.decode('ascii', 'surrogateescape') == decode_surroundescape(data) In [17]: %timeit data.decode('ascii', 'surrogateescape') In [18]: %timeit decode_surroundescape(data) |
Cool! Good job. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: