Amazingly faster UTF-8 decoding #58943

serhiy-storchaka · 2012-05-06T18:00:54Z

BPO	14738
Nosy	@loewis, @jcea, @ronaldoussoren, @mdickinson, @pitrou, @vstinner, @ned-deily, @ezio-melotti, @serhiy-storchaka
Files	decode_utf8_4.patch decode_utf8_5.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2012-05-10.14:38:47.809>
created_at = <Date 2012-05-06.18:00:54.170>
labels = ['interpreter-core', 'expert-unicode', 'performance']
title = 'Amazingly faster UTF-8 decoding'
updated_at = <Date 2012-05-12.07:09:09.219>
user = 'https://github.com/serhiy-storchaka'

bugs.python.org fields:

activity = <Date 2012-05-12.07:09:09.219>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2012-05-10.14:38:47.809>
closer = 'pitrou'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2012-05-06.18:00:54.170>
creator = 'serhiy.storchaka'
dependencies = []
files = ['25484', '25485']
hgrepos = []
issue_num = 14738
keywords = ['patch']
message_count = 15.0
messages = ['160103', '160107', '160110', '160112', '160305', '160306', '160307', '160308', '160309', '160311', '160312', '160346', '160347', '160447', '160462']
nosy_count = 12.0
nosy_names = ['loewis', 'jcea', 'ronaldoussoren', 'mark.dickinson', 'janssen', 'pitrou', 'vstinner', 'ned.deily', 'ezio.melotti', 'Arfrever', 'python-dev', 'serhiy.storchaka']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue14738'
versions = ['Python 3.3']

serhiy-storchaka · 2012-05-06T18:00:51Z

I propose a complex patch, which significantly speeds up UTF-8 decoding. Now decoder faster even decoder in 3.2 (except in a few unreal patological cases).

Also the decoder code reduced and simplified (formerly decoding code was repeated in at least three places).

As a side effect ASCII decoding now faster on some platforms (bpo-14419).

Related issues:
[bpo-4868] Faster utf-8 decoding
[bpo-13417] faster utf-8 decoding
[bpo-14419] Faster ascii decoding
[bpo-14624] Faster utf-16 decoder
[bpo-14625] Faster utf-32 decoder
[bpo-14654] Faster utf-8 decoding

Here are the results of benchmarking (numbers is speed in MB/s).

On 32-bit Linux, AMD Athlon 64 X2 4600+ @ 2.4GHz:

                                      3.2           3.3(vanilla)  patched

utf-8 'A'*10000 1199 (+69%) 1721 (+18%) 2032
utf-8 'A'*9999+'\x80' 1189 (+25%) 996 (+49%) 1488
utf-8 'A'*9999+'\u0100' 1192 (-25%) 887 (+1%) 894
utf-8 'A'*9999+'\u8000' 1178 (-24%) 888 (+0%) 890
utf-8 'A'*9999+'\U00010000' 1177 (-29%) 872 (-4%) 837
utf-8 '\x80'*10000 220 (+74%) 172 (+122%) 382
utf-8 '\x80'+'A'*9999 1192 (+5%) 376 (+232%) 1250
utf-8 '\x80'*9999+'\u0100' 220 (+54%) 160 (+112%) 339
utf-8 '\x80'*9999+'\u8000' 220 (+54%) 160 (+112%) 339
utf-8 '\x80'*9999+'\U00010000' 221 (+49%) 176 (+88%) 330
utf-8 '\u0100'*10000 220 (+74%) 163 (+134%) 382
utf-8 '\u0100'+'A'*9999 1177 (+4%) 382 (+219%) 1220
utf-8 '\u0100'+'\x80'*9999 220 (+74%) 163 (+134%) 382
utf-8 '\u0100'*9999+'\u8000' 220 (+74%) 163 (+134%) 382
utf-8 '\u0100'*9999+'\U00010000' 220 (+50%) 180 (+83%) 330
utf-8 '\u8000'*10000 261 (+66%) 191 (+126%) 432
utf-8 '\u8000'+'A'*9999 1197 (+1%) 384 (+216%) 1212
utf-8 '\u8000'+'\x80'*9999 216 (+77%) 163 (+134%) 382
utf-8 '\u8000'+'\u0100'*9999 215 (+77%) 164 (+132%) 381
utf-8 '\u8000'*9999+'\U00010000' 261 (+46%) 201 (+89%) 380
utf-8 '\U00010000'*10000 248 (+44%) 198 (+80%) 357
utf-8 '\U00010000'+'A'*9999 1192 (-5%) 383 (+196%) 1135
utf-8 '\U00010000'+'\x80'*9999 220 (+73%) 180 (+111%) 380
utf-8 '\U00010000'+'\u0100'*9999 220 (+73%) 180 (+111%) 380
utf-8 '\U00010000'+'\u8000'*9999 261 (+54%) 201 (+100%) 403

ascii 'A'*10000 233 (+971%) 1876 (+33%) 2496

On 32-bit Linux, Intel Atom N570 @ 1.66GHz:

                                      3.2           3.3(vanilla)  patched

utf-8 'A'*10000 345 (+81%) 596 (+5%) 623
utf-8 'A'*9999+'\x80' 335 (+41%) 303 (+56%) 474
utf-8 'A'*9999+'\u0100' 336 (-23%) 123 (+110%) 258
utf-8 'A'*9999+'\u8000' 337 (-24%) 123 (+108%) 256
utf-8 'A'*9999+'\U00010000' 336 (-24%) 261 (-3%) 254
utf-8 '\x80'*10000 88 (+66%) 65 (+125%) 146
utf-8 '\x80'+'A'*9999 334 (+8%) 124 (+190%) 360
utf-8 '\x80'*9999+'\u0100' 88 (+43%) 65 (+94%) 126
utf-8 '\x80'*9999+'\u8000' 88 (+43%) 65 (+94%) 126
utf-8 '\x80'*9999+'\U00010000' 89 (+40%) 65 (+92%) 125
utf-8 '\u0100'*10000 88 (+85%) 65 (+151%) 163
utf-8 '\u0100'+'A'*9999 336 (+2%) 77 (+345%) 343
utf-8 '\u0100'+'\x80'*9999 88 (+86%) 65 (+152%) 164
utf-8 '\u0100'*9999+'\u8000' 88 (+86%) 65 (+152%) 164
utf-8 '\u0100'*9999+'\U00010000' 88 (+57%) 65 (+112%) 138
utf-8 '\u8000'*10000 98 (+79%) 69 (+154%) 175
utf-8 '\u8000'+'A'*9999 339 (+3%) 77 (+353%) 349
utf-8 '\u8000'+'\x80'*9999 89 (+84%) 66 (+148%) 164
utf-8 '\u8000'+'\u0100'*9999 88 (+86%) 65 (+152%) 164
utf-8 '\u8000'*9999+'\U00010000' 98 (+58%) 69 (+125%) 155
utf-8 '\U00010000'*10000 104 (+46%) 79 (+92%) 152
utf-8 '\U00010000'+'A'*9999 339 (-5%) 124 (+160%) 323
utf-8 '\U00010000'+'\x80'*9999 88 (+84%) 68 (+138%) 162
utf-8 '\U00010000'+'\u0100'*9999 88 (+83%) 68 (+137%) 161
utf-8 '\U00010000'+'\u8000'*9999 98 (+63%) 72 (+122%) 160

ascii 'A'*10000 132 (+499%) 758 (+4%) 791

pitrou · 2012-05-06T20:01:02Z

64-bit Linux, Intel Core i5 2500K:

                                      3.2           3.3             patched

utf-8 'A'*10000 2550 (+198%) 6828 (+11%) 7607
utf-8 'A'*9999+'\x80' 2501 (+118%) 2415 (+126%) 5456
utf-8 'A'*9999+'\u0100' 2501 (-20%) 2297 (-13%) 1996
utf-8 'A'*9999+'\u8000' 2494 (-14%) 2291 (-7%) 2133
utf-8 'A'*9999+'\U00010000' 2494 (-11%) 2293 (-3%) 2219
utf-8 '\x80'*10000 422 (+135%) 517 (+92%) 991
utf-8 '\x80'+'A'*9999 2513 (+12%) 860 (+228%) 2820
utf-8 '\x80'*9999+'\u0100' 426 (+102%) 525 (+64%) 862
utf-8 '\x80'*9999+'\u8000' 426 (+104%) 538 (+62%) 871
utf-8 '\x80'*9999+'\U00010000' 428 (+105%) 523 (+68%) 878
utf-8 '\u0100'*10000 425 (+140%) 517 (+97%) 1019
utf-8 '\u0100'+'A'*9999 2488 (+2%) 820 (+211%) 2549
utf-8 '\u0100'+'\x80'*9999 426 (+139%) 517 (+97%) 1019
utf-8 '\u0100'*9999+'\u8000' 426 (+139%) 529 (+93%) 1019
utf-8 '\u0100'*9999+'\U00010000' 426 (+106%) 509 (+72%) 876
utf-8 '\u8000'*10000 573 (+28%) 490 (+50%) 733
utf-8 '\u8000'+'A'*9999 2500 (+1%) 822 (+208%) 2528
utf-8 '\u8000'+'\x80'*9999 426 (+139%) 530 (+92%) 1018
utf-8 '\u8000'+'\u0100'*9999 428 (+138%) 509 (+100%) 1018
utf-8 '\u8000'*9999+'\U00010000' 573 (+17%) 447 (+51%) 673
utf-8 '\U00010000'*10000 562 (+24%) 552 (+26%) 696
utf-8 '\U00010000'+'A'*9999 2512 (+3%) 939 (+175%) 2584
utf-8 '\U00010000'+'\x80'*9999 423 (+140%) 553 (+84%) 1017
utf-8 '\U00010000'+'\u0100'*9999 426 (+139%) 549 (+85%) 1017
utf-8 '\U00010000'+'\u8000'*9999 572 (+18%) 479 (+41%) 674

serhiy-storchaka · 2012-05-06T21:48:11Z

Thank your, Antoine. Finally Intel Core is defeated!

If someone wants to repeat tests, see benchmark tools in bpo-14624.

serhiy-storchaka · 2012-05-06T22:11:07Z

The patch updated in accordance with Antoine cosmetic comments.

pitrou · 2012-05-09T16:50:51Z

There's a Mac-specific portion in the patch, it would be nice if someone could check that it works.

serhiy-storchaka · 2012-05-09T18:05:09Z

It would be good if someone checked on Macs work with command line arguments, including non-valid utf8. The difficulty is that you need to check on both Macs with 16-bit and with 32-bit wchar_t.

serhiy-storchaka · 2012-05-09T18:32:10Z

bpo-4388 is related to this Mac-specific portion of the patch.

pitrou · 2012-05-09T18:41:16Z

It would be good if someone checked on Macs work with command line
arguments, including non-valid utf8. The difficulty is that you need
to check on both Macs with 16-bit and with 32-bit wchar_t.

Actually, it should be enough to run the test suite, since we should
have tests for this.
As for different wchar_t widths, that's the kind of thing we can leave
to the buildbots (assuming our OS X buildbots come back alive some
day :-)).

serhiy-storchaka · 2012-05-09T19:29:54Z

I hacked the code (commented out "#if __APPLE__" in
Objects/unicodeobject.c and Modules/python.c) to start this branch on
Linux and ran the test (test_cmd_line) with C locale. It passed. Then I
broke decoder and ran the test again to get the error. I can now confirm
that the code works correctly on a platform with a 32-bit wchar_t.

mdickinson · 2012-05-09T20:13:57Z

Actually, it should be enough to run the test suite, since we should
have tests for this.

I just ran the test suite ("python -m test") on OS X 10.6.8 with 'decode_utf8_5.patch' applied. (64-bit --with-pydebug build of Python.) No test failures.

test header:

== CPython 3.3.0a3+ (default:840cb46d0395+, May 9 2012, 20:55:18) [GCC 4.2.1 (Apple Inc. build 5664)]
== Darwin-10.8.0-i386-64bit little-endian
== /Users/mdickinson/Python/cpython/build/test_python_39794

Fragment of configure output relevant to wchar looked like this:

checking wchar.h usability... yes
checking wchar.h presence... yes
checking for wchar.h... yes
checking size of wchar_t... 4
checking for UCS-4 tcl... no
checking whether wchar_t is signed... yes
no usable wchar_t found

vstinner · 2012-05-09T20:18:21Z

The difficulty is that you need to check on both Macs
with 16-bit and with 32-bit wchar_t.

I don't think that the size of wchar_t is configurable: it should always be 32 bits on Mac OS X.

python-dev · 2012-05-10T14:38:11Z

New changeset e08c3791f035 by Antoine Pitrou in branch 'default':
Issue bpo-14738: Speed-up UTF-8 decoding on non-ASCII data. Patch by Serhiy Storchaka.
http://hg.python.org/cpython/rev/e08c3791f035

pitrou · 2012-05-10T14:38:48Z

The patch is now committed. Well done and thanks for your contribution.

serhiy-storchaka · 2012-05-11T19:45:44Z

Thanks Martin for review, which has allowed me to make a quality patch, and for promotion of further research. Thanks Antoine for review, benchmarks, commit, and for the original optimization, which served as the basis for my patch.

vstinner · 2012-05-12T07:09:09Z

If the commit makes Python 3.3 faster than Python 3.2, it is an
optimisation that should be documented in the What's New in Python 3.3
document.

serhiy-storchaka added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels May 6, 2012

ezio-melotti added the topic-unicode label May 6, 2012

pitrou closed this as completed May 10, 2012

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazingly faster UTF-8 decoding #58943

Amazingly faster UTF-8 decoding #58943

serhiy-storchaka commented May 6, 2012

serhiy-storchaka commented May 6, 2012

pitrou commented May 6, 2012

serhiy-storchaka commented May 6, 2012

serhiy-storchaka commented May 6, 2012

pitrou commented May 9, 2012

serhiy-storchaka commented May 9, 2012

serhiy-storchaka commented May 9, 2012

pitrou commented May 9, 2012

serhiy-storchaka commented May 9, 2012

mdickinson commented May 9, 2012

vstinner commented May 9, 2012

python-dev mannequin commented May 10, 2012

pitrou commented May 10, 2012

serhiy-storchaka commented May 11, 2012

vstinner commented May 12, 2012

Amazingly faster UTF-8 decoding #58943

Amazingly faster UTF-8 decoding #58943

Comments

serhiy-storchaka commented May 6, 2012

serhiy-storchaka commented May 6, 2012

pitrou commented May 6, 2012

serhiy-storchaka commented May 6, 2012

serhiy-storchaka commented May 6, 2012

pitrou commented May 9, 2012

serhiy-storchaka commented May 9, 2012

serhiy-storchaka commented May 9, 2012

pitrou commented May 9, 2012

serhiy-storchaka commented May 9, 2012

mdickinson commented May 9, 2012

vstinner commented May 9, 2012

python-dev mannequin commented May 10, 2012

pitrou commented May 10, 2012

serhiy-storchaka commented May 11, 2012

vstinner commented May 12, 2012