Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster utf-8 decoding #49118

Closed
pitrou opened this issue Jan 7, 2009 · 14 comments
Closed

Faster utf-8 decoding #49118

pitrou opened this issue Jan 7, 2009 · 14 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@pitrou
Copy link
Member

pitrou commented Jan 7, 2009

BPO 4868
Nosy @malemburg, @loewis, @amauryfa, @pitrou, @ezio-melotti
Files
  • utf8decode3.patch
  • utf8decode4.patch
  • decode5.patch
  • decode6.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-01-10.15:46:28.154>
    created_at = <Date 2009-01-07.15:25:02.749>
    labels = ['interpreter-core', 'performance']
    title = 'Faster utf-8 decoding'
    updated_at = <Date 2010-04-04.03:26:17.054>
    user = 'https://github.com/pitrou'

    bugs.python.org fields:

    activity = <Date 2010-04-04.03:26:17.054>
    actor = 'ezio.melotti'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-01-10.15:46:28.154>
    closer = 'pitrou'
    components = ['Interpreter Core']
    creation = <Date 2009-01-07.15:25:02.749>
    creator = 'pitrou'
    dependencies = []
    files = ['12645', '12650', '12655', '12656']
    hgrepos = []
    issue_num = 4868
    keywords = ['patch']
    message_count = 14.0
    messages = ['79338', '79353', '79356', '79358', '79360', '79397', '79409', '79416', '79430', '79431', '79432', '79434', '79447', '79549']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'pitrou', 'kevinwatters', 'ezio.melotti']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'patch review'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue4868'
    versions = ['Python 3.1']

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 7, 2009

    Here is a patch to speedup utf8 decoding. On a 64-bit build, the maximum
    speedup is around 30%, and on a 32-bit build around 15%. (*)

    The patch may look disturbingly trivial, and I haven't studied the
    assembler output, but I think it is explained by the fact that having a
    separate loop counter breaks the register dependencies (when the 's'
    pointer was incremented, other operations had to wait for the
    incrementation to be committed).

    [side note: utf8 encoding is still much faster than decoding, but it may
    be because it allocates a smaller object, regardless of the iteration count]

    The same principle can probably be applied to the other decoding
    functions in unicodeobject.c, but first I wanted to know whether the
    principle is ok to apply. Marc-André, what is your take?

    (*) the benchmark I used is:

    ./python -m timeit -s "import
    codecs;c=codecs.utf_8_decode;s=b'abcde'*1000" "c(s)"

    More complex input also gets a speedup, albeit a smaller one (~10%):

    ./python -m timeit -s "import
    codecs;c=codecs.utf_8_decode;s=b'\xc3\xa9\xe7\xb4\xa2'*1000" "c(s)"

    @pitrou pitrou added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 7, 2009
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jan 7, 2009

    Can you please upload it to Rietveld?

    I'm skeptical about changes that merely rely on the compiler's register
    allocator to do a better job. This kind of change tends to pessimize the
    code for other compilers, and also may pessimize it for future versions
    of the same compiler.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 7, 2009

    As I said I don't think it's due to register allocation, but simply
    avoiding register write-to-read dependencies by using separate variables
    for the loop count and the pointer. I'm gonna try under Windows (in a
    virtual machine, but it shouldn't make much difference since the
    workload is CPU-bound).

    I've open a Rietveld issue here: http://codereview.appspot.com/11681

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 7, 2009

    Ha, the patch makes things slower on MSVC. The patch can probably be
    rejected, then.

    (and interestingly, MSVC produces 40% faster code than gcc on my
    mini-bench, despite the virtual machine overhead)

    @pitrou pitrou closed this as completed Jan 7, 2009
    @malemburg
    Copy link
    Member

    On 2009-01-07 16:25, Antoine Pitrou wrote:

    New submission from Antoine Pitrou <pitrou@free.fr>:

    Here is a patch to speedup utf8 decoding. On a 64-bit build, the maximum
    speedup is around 30%, and on a 32-bit build around 15%. (*)

    The patch may look disturbingly trivial, and I haven't studied the
    assembler output, but I think it is explained by the fact that having a
    separate loop counter breaks the register dependencies (when the 's'
    pointer was incremented, other operations had to wait for the
    incrementation to be committed).

    [side note: utf8 encoding is still much faster than decoding, but it may
    be because it allocates a smaller object, regardless of the iteration count]

    The same principle can probably be applied to the other decoding
    functions in unicodeobject.c, but first I wanted to know whether the
    principle is ok to apply. Marc-André, what is your take?

    I'm +1 on anything that makes codecs faster :-)

    However, the patch should be checked with some other compilers
    as well, e.g. using MS VC++.

    (*) the benchmark I used is:

    ./python -m timeit -s "import
    codecs;c=codecs.utf_8_decode;s=b'abcde'*1000" "c(s)"

    More complex input also gets a speedup, albeit a smaller one (~10%):

    ./python -m timeit -s "import
    codecs;c=codecs.utf_8_decode;s=b'\xc3\xa9\xe7\xb4\xa2'*1000" "c(s)"

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 8, 2009

    Reopening and attaching a more ambitious patch, based on the
    optimization of runs of ASCII characters. This time the speedup is much
    more impressive, up to 75% faster on pure ASCII input -- actually faster
    than latin1.

    The worst case (tight interleaving of ASCII and non-ASCII chars) shows a
    8% slowdown.

    (performance measured with gcc and MSVC)

    @pitrou pitrou reopened this Jan 8, 2009
    @amauryfa
    Copy link
    Member

    amauryfa commented Jan 8, 2009

    Very nice! It seems that you can get slightly faster by not copying the
    initial char first: 's' is often already aligned at the beginning of the
    string, but not after the first copy... Attached patch
    (utf8decode4.patch) changes this and may enter the fast loop on the
    first character.

    Does this idea apply to the encode function as well?

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 8, 2009

    Attached patch
    (utf8decode4.patch) changes this and may enter the fast loop on the
    first character.

    Thanks!

    Does this idea apply to the encode function as well?

    Probably, although with less efficiency (a long can hold 1, 2 or 4
    unicode characters depending on the build).
    The unrolling part also applies to simple codecs such as latin1.
    Unrolling PyUnicode_DecodeLatin1 a bit (4 copies per iteration) makes it
    twice faster on non-tiny strings. I'll experiment with utf16.

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 8, 2009

    Attached patch adds acceleration for latin1 and utf16 decoding as well.

    All three codecs (utf8, utf16, latin1) are now in the same ballpark
    performance-wise on favorable input: on my machine, they are able to
    decode at almost 1GB/s.

    (unpatched, it is between 150 and 500MB/s. depending on the codec)

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 8, 2009

    (PS : performance measured on UCS-2 and UCS-4 builds with gcc, and under
    Windows with MSVC)

    @malemburg
    Copy link
    Member

    Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Attached patch adds acceleration for latin1 and utf16 decoding as well.

    All three codecs (utf8, utf16, latin1) are now in the same ballpark
    performance-wise on favorable input: on my machine, they are able to
    decode at almost 1GB/s.

    (unpatched, it is between 150 and 500MB/s. depending on the codec)

    Added file: http://bugs.python.org/file12655/decode5.patch

    A few style comments:

     * please use indented #pre-processor directives whenever possible, e.g.
       #if
       # define
       #else
       # define
       #endif
    • the conditions should only accept SIZE_OF_LONG == 4 and 8 and
      fail with an #error for any other value

    • you should use unsigned longs instead of signed ones

    • please use spaces around arithmetic operators, e.g. not a+b, but
      a + b

    • when calling functions with lots of parameters, put each parameter on
      a new line (e.g. for unicode_decode_call_errorhandler())

    Please also add a comment somewhere to the bit masks explaining what
    they do and how they are used. Verbose comments are always good to
    have when doing optimizations such as these. Have a look at the
    dictionary implementation for what I mean by this.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 8, 2009

    Marc-Andre, this patch should address your comments.

    @malemburg
    Copy link
    Member

    Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Marc-Andre, this patch should address your comments.

    Added file: http://bugs.python.org/file12656/decode6.patch

    Thanks. Much better !

    BTW: I'd also change the variable name "word" to something
    different, e.g. bitmap or just data. It looks too much like
    a reserved word (which it isn't) ;-)

    @pitrou
    Copy link
    Member Author

    pitrou commented Jan 10, 2009

    I committed the patch with the last suggested change (word -> data) in
    py3k (r68483). I don't intend to backport it to trunk, but I suppose it
    wouldn't be too much work to do.

    @pitrou pitrou closed this as completed Jan 10, 2009
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants