Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up oparg decoding on little-endian machines #70009

Closed
rhettinger opened this issue Dec 8, 2015 · 21 comments
Closed

Speed-up oparg decoding on little-endian machines #70009

rhettinger opened this issue Dec 8, 2015 · 21 comments
Assignees
Labels
interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@rhettinger
Copy link
Contributor

rhettinger commented Dec 8, 2015

BPO 25823
Nosy @tim-one, @arigo, @rhettinger, @mdickinson, @pitrou, @vstinner, @skrah, @serhiy-storchaka, @serprex
Superseder
  • bpo-27097: ceval: Wordcode follow up, explicit unsigned short read
  • Files
  • improve_arg_decoding.diff
  • improve_arg_decoding2.diff: Post-increment version
  • improve_arg_decoding3.diff
  • exercise_oparg.py: Simple timing
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2016-05-25.17:07:00.606>
    created_at = <Date 2015-12-08.09:34:11.591>
    labels = ['interpreter-core', 'performance']
    title = 'Speed-up oparg decoding on little-endian machines'
    updated_at = <Date 2016-05-25.17:07:00.589>
    user = 'https://github.com/rhettinger'

    bugs.python.org fields:

    activity = <Date 2016-05-25.17:07:00.589>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2016-05-25.17:07:00.606>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2015-12-08.09:34:11.591>
    creator = 'rhettinger'
    dependencies = []
    files = ['41268', '41269', '41274', '41289']
    hgrepos = []
    issue_num = 25823
    keywords = ['patch']
    message_count = 21.0
    messages = ['256106', '256108', '256114', '256117', '256118', '256120', '256146', '256149', '256151', '256152', '256236', '256275', '256276', '256277', '256283', '256338', '264221', '264254', '264255', '264263', '266376']
    nosy_count = 9.0
    nosy_names = ['tim.peters', 'arigo', 'rhettinger', 'mark.dickinson', 'pitrou', 'vstinner', 'skrah', 'serhiy.storchaka', 'Demur Rumed']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = '27097'
    type = 'performance'
    url = 'https://bugs.python.org/issue25823'
    versions = ['Python 3.6']

    @rhettinger
    Copy link
    Contributor Author

    rhettinger commented Dec 8, 2015

    On little-endian machines, the decoding of an oparg can be sped-up by using a single 16-bit pointer deference.

    Current decoding:
    leaq 2(%rcx), %rbp
    movzbl -1(%rbp), %eax
    movzbl -2(%rbp), %r14d
    sall $8, %eax
    addl %eax, %r14d

    New decoding:
    leaq 2(%rdx), %r12
    movzwl -2(%r12), %r8d

    The patch uses (unsigned short *) like the struct module does, but it could use uint16_t if necessary.

    If next_instr can be advanced after the lookup rather than before, the generated code would be tighter still (removing the data dependency and shortening the movzwl instruction to drop the offset byte):

    movzwl  (%rdx), %r8d
    leaq    2(%rdx), %rbp
    

    @rhettinger rhettinger added interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Dec 8, 2015
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Dec 8, 2015

    You have forgot to attach a patch.

    @rhettinger
    Copy link
    Contributor Author

    rhettinger commented Dec 8, 2015

    Added patch.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Dec 8, 2015

    In about a half cases next_instr points to unaligned 16-bit value. Not all platforms allow access to unaligned data. We need other test in additional to PY_LITTLE_ENDIAN to allow this optimization.

    @mdickinson
    Copy link
    Member

    mdickinson commented Dec 8, 2015

    Hmm. Doesn't this introduce undefined behaviour? The new code looks like a violation of the strict aliasing rule. (Or do we compile with -fno-strict-aliasing or the like?)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Dec 8, 2015

    Usually unaligned accesses are believed to carry a big performance
    penalty, though rumor has it that for the latest generation of CPUs
    this is no longer an issue.

    @mdickinson
    Copy link
    Member

    mdickinson commented Dec 9, 2015

    The latest patch still replaces valid C with code that has undefined behaviour. That's not good! Introducing undefined behaviour is something we should avoid without a really good reason.

    Benchmarks showing dramatic real-world speed improvements from this change might count as a good reason, of course. :-)

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Dec 9, 2015

    I think following patch doesn't introduce undefined behavior. With this patch GCC on 32-bit i386 Linux produces the same code as for simple unsigned short read.

    I don't know wherever the benefit worth such complication. I don't know wherever the patch can cause performance regression on other platforms or compilers.

    @mdickinson
    Copy link
    Member

    mdickinson commented Dec 9, 2015

    I think following patch doesn't introduce undefined behavior.

    Agreed. As I recall, the memcpy trick is a fairly standard way around this issue, for compilers that are smart enough to compile away the actual memcpy call.

    I don't know wherever the patch can cause performance regression on other platforms or compilers.

    Me neither.

    @mdickinson
    Copy link
    Member

    mdickinson commented Dec 9, 2015

    Benchmarks would help here: if none of the patches gives a significant real-world speedup, then the whole UB discussion is moot.

    @rhettinger
    Copy link
    Contributor Author

    rhettinger commented Dec 11, 2015

    I verified that Clang and GCC both give the expected disassembly with Serhiy's patch. We ought to restrict the #if to just the compilers that are known to optimize away the memcpy.

    Clang (for 'BUILD_LIST_UNPACK')
    -------------------------------
    .loc 10 2525 9 ## Python/ceval.c:2525:9
    movzwl (%r13), %r9d
    addq $2, %r13
    Ltmp2042:
    ##DEBUG_VALUE: PyEval_EvalFrameEx:next_instr <- R13

    GCC (for 'BUILD_LIST_UNPACK')
    -----------------------------
    LM1275:
    movzwl (%rdx), %r8d
    LVL1147:
    leaq 2(%rdx), %rbp

    [Mark]

    Benchmarks showing dramatic real-world speed improvements ...

    Much of the doubling of speed for core Python that has occurred over the last ten decade has occurred one little step at a time, none of the them being individually "dramatic". In general, if we have a chance to reduce the work load in the ceval inner-loop, we should take it.

    A simple benchmark on clang shows a roughly 10+% speedup in code exercising simple and common opcodes that that have a oparg (there is no point of benchmarking the effect on opcodes like IMPORT_NAME where the total eval-loop overhead is already an insignificant proportion of the total work).

    Baseline version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81)
    $ ./python.exe exercise_oparg.py
    0.22484053499647416
    $ ./python.exe exercise_oparg.py
    0.22687773499637842
    $ ./python.exe exercise_oparg.py
    0.22026274001109414

    Patched version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81)
    $ ./python.exe exercise_oparg.py
    0.19516360601119231
    $ ./python.exe exercise_oparg.py
    0.20087355599389412
    $ ./python.exe exercise_oparg.py
    0.1980393300036667

    To better isolate the effect, I suppose you could enable the READ_TIMESTAMP macros to precisely measure the effect of converting five sequentially dependent instructions with two independent instructions, but likely all it would show you is that the two are cheaper than the five.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Dec 12, 2015

    Fwiw, I made a trivial benchmark in C that loads aligned and misaligned shorts ( http://paste.pound-python.org/show/HwnbCI3Pqsj8bx25Yfwp/ ). It shows that the memcpy() version takes only 65% of the time taken by the two-bytes-loaded version on a 2010 laptop. It takes 75% of the time on a modern server. On a recent little-endian PowerPC machine, 96%. On aarch64, only 45% faster (i.e. more than twice faster). This is all with gcc. It seems that using memcpy() is definitely a win nowadays.

    @arigo
    Copy link
    Mannequin

    arigo mannequin commented Dec 12, 2015

    (Typo: "only 45% faster" should be "only 45% of the time")

    @pitrou
    Copy link
    Member

    pitrou commented Dec 12, 2015

    Or just bite the bullet and make all opcodes 16-bit (and make sure the bytecode string is aligned ;-)).

    Still, someone ought to run the benchmarks suite on this. Tuple-unpacking nano-benchmarks are not very enlightening ;-)

    @mdickinson
    Copy link
    Member

    mdickinson commented Dec 12, 2015

    Raymond: I only used the word "dramatic" in the context of deliberately introducing new *undefined behaviour* into the core of CPython, which seems like something that should be avoided without a really good reason.

    I have no objections to Serhiy's patch, which doesn't introduce undefined behaviour.

    @vstinner
    Copy link
    Member

    vstinner commented Dec 13, 2015

    I verified that Clang and GCC both give the expected disassembly with Serhiy's patch. We ought to restrict the #if to just the compilers that are known to optimize away the memcpy.

    Yes, it makes sense.

    Note: Python already has Py_MEMCPY() to work around slow memcpy() for small copies with Visual Studio.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 26, 2016

    I there are no objections I'm inclined to push the patch in hope that this will make the Wordcode patch (bpo-26647) simpler and more efficient (yes, this will add more work for Demur for synchronization).

    @vstinner
    Copy link
    Member

    vstinner commented Apr 26, 2016

    I dislike the usage of union to use fetch 16-bit but only in little endian. I would prefer to modify the PyCodeObject to ensure that co_code is aligned to 16-bit and use an uint16_t* pointer in ceval.c. It would be simpler no? In the worst case, we should overallocate 1 null byte in PyCodeObject to align bytecode to 16-bit.

    @vstinner
    Copy link
    Member

    vstinner commented Apr 26, 2016

    "I there are no objections I'm inclined to push the patch in hope that this will make the Wordcode patch (bpo-26647) simpler and more efficient (yes, this will add more work for Demur for synchronization)."

    I would prefer to be kind with Demur and wait until his patch is merged (to not introduce conflicts). Wordcode change is quite big, whereas this one is short. Raymond already approved his patch on the python-dev mailing list, I waiting for the feedback of Yury and Serhiy until Sunday to push the wordcode change.

    IMHO it will be simpler to optimize the (oparg, opvalue) fetch in ceval.c after wordcode will be merged.

    What do you think?

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Apr 26, 2016

    I agree that implementing fast fetch 16-bit is simpler with wordcodes.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented May 25, 2016

    Corresponding patch was committed in bpo-27097.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants