Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with should be as fast as try/finally #46432

Closed
jyasskin mannequin opened this issue Feb 24, 2008 · 10 comments
Closed

with should be as fast as try/finally #46432

jyasskin mannequin opened this issue Feb 24, 2008 · 10 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@jyasskin
Copy link
Mannequin

jyasskin mannequin commented Feb 24, 2008

BPO 2179
Nosy @rhettinger, @jcea, @amauryfa, @ncoghlan, @abalkin, @tiran, @benjaminp
Files
  • faster_with.patch: proof of concept
  • faster_with.patch: complete
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-03-07.14:17:51.663>
    created_at = <Date 2008-02-24.23:10:40.175>
    labels = ['interpreter-core', 'type-feature']
    title = 'with should be as fast as try/finally'
    updated_at = <Date 2008-03-09.17:03:40.472>
    user = 'https://bugs.python.org/jyasskin'

    bugs.python.org fields:

    activity = <Date 2008-03-09.17:03:40.472>
    actor = 'jyasskin'
    assignee = 'none'
    closed = True
    closed_date = <Date 2008-03-07.14:17:51.663>
    closer = 'ncoghlan'
    components = ['Interpreter Core']
    creation = <Date 2008-02-24.23:10:40.175>
    creator = 'jyasskin'
    dependencies = []
    files = ['9589', '9590']
    hgrepos = []
    issue_num = 2179
    keywords = ['patch']
    message_count = 10.0
    messages = ['62951', '63165', '63167', '63197', '63200', '63323', '63353', '63355', '63356', '63425']
    nosy_count = 10.0
    nosy_names = ['nnorwitz', 'collinwinter', 'rhettinger', 'jcea', 'amaury.forgeotdarc', 'ncoghlan', 'belopolsky', 'christian.heimes', 'jyasskin', 'benjamin.peterson']
    pr_nums = []
    priority = 'normal'
    resolution = 'accepted'
    stage = None
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue2179'
    versions = ['Python 2.6']

    @jyasskin
    Copy link
    Mannequin Author

    jyasskin mannequin commented Feb 24, 2008

    Currently, 'with' costs about .2us over try/finally:

    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
    lock.release()'
    1000000 loops, best of 3: 0.617 usec per loop
    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'with lock: pass'
    1000000 loops, best of 3: 0.774 usec per loop

    Since it's doing the same thing (and calling the same C functions to do
    the lock manipulation), it shouldn't take more time. The bytecodes
    associated with the two options look like:

    1. with lock:
    2.  pass
      

    2 0 LOAD_GLOBAL 0 (lock)
    3 DUP_TOP
    4 LOAD_ATTR 1 (exit)
    7 STORE_FAST 0 (_[1])
    10 LOAD_ATTR 2 (enter)
    13 CALL_FUNCTION 0
    16 POP_TOP
    17 SETUP_FINALLY 4 (to 24)

    3 20 POP_BLOCK
    21 LOAD_CONST 0 (None)
    >> 24 LOAD_FAST 0 ([1])
    27 DELETE_FAST 0 (
    [1])
    30 WITH_CLEANUP
    31 END_FINALLY
    32 LOAD_CONST 0 (None)
    35 RETURN_VALUE

    1. lock.acquire()
    2. try:
    3.  pass
      
    4. finally:
    5. lock.release()
      

    6 0 LOAD_GLOBAL 0 (lock)
    3 LOAD_ATTR 1 (acquire)
    6 CALL_FUNCTION 0
    9 POP_TOP

    7 10 SETUP_FINALLY 4 (to 17)

    8 13 POP_BLOCK
    14 LOAD_CONST 0 (None)

    10 >> 17 LOAD_GLOBAL 0 (lock)
    20 LOAD_ATTR 2 (release)
    23 CALL_FUNCTION 0
    26 POP_TOP
    27 END_FINALLY
    28 LOAD_CONST 0 (None)
    31 RETURN_VALUE

    The major difference I see is the extra local variable (_[1]) used by
    with. It looks like the compiler ought to be able to save that on the
    stack instead, and save 3 opcodes. Neal Norwitz suggested that, if that
    optimization is impossible, WITH_CLEANUP could be modified to take the
    variable as a parameter, which would let it absorb the LOAD_FAST and
    DELETE_FAST instructions.

    I've added everyone on the previous bug to the nosy list. Sorry if you
    don't care. :)

    @jyasskin jyasskin mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Feb 24, 2008
    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 1, 2008

    A closer approximation of what the with statement is doing would be:

    exit = lock.release()
    lock.acquire()
    try:
        pass
    finally:
        exit()

    The problem with trying to store the result of the retrieval of __exit__
    on the stack is that we need to leave the context manager itself on top
    of the stack for the next LOAD_ATTR opcode (when we're retrieving the
    __enter__ method).

    However, changing WITH_CLEANUP to take an argument indicating which
    local variable holds the bound __exit__ method sounds like it might work.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 1, 2008

    Scratch the parentheses on that first line of sample code in my previous
    comment (isn't copy and paste wonderful?)

    @jyasskin
    Copy link
    Mannequin Author

    jyasskin mannequin commented Mar 3, 2008

    Here's a proof-of-concept patch that keeps the __exit__ method on the
    stack. It uses ROT_TWO to stuff it under the context object instead of
    storing it into a temporary. (Thanks Nick for pointing out that problem
    before I had to waste time on it.) test_with passes, although I need to
    update several more things and maybe fix a refleak.

    The patch changes the compilation of:

        def with_(l):
            with l:
                pass

    from

    4 0 LOAD_FAST 0 (l)
    3 DUP_TOP
    4 LOAD_ATTR 0 (exit)
    7 STORE_FAST 1 (_[1])
    10 LOAD_ATTR 1 (enter)
    13 CALL_FUNCTION 0
    16 POP_TOP
    17 SETUP_FINALLY 4 (to 24)

    5 20 POP_BLOCK
    21 LOAD_CONST 0 (None)
    >> 24 LOAD_FAST 1 ([1])
    27 DELETE_FAST 1 (
    [1])
    30 WITH_CLEANUP
    31 END_FINALLY
    32 LOAD_CONST 0 (None)
    35 RETURN_VALUE

    to

    4 0 LOAD_FAST 0 (l)
    3 DUP_TOP
    4 LOAD_ATTR 0 (exit)
    7 ROT_TWO
    8 LOAD_ATTR 1 (enter)
    11 CALL_FUNCTION 0
    14 POP_TOP
    15 SETUP_FINALLY 4 (to 22)

    5 18 POP_BLOCK
    19 LOAD_CONST 0 (None)
    >> 22 WITH_CLEANUP
    23 END_FINALLY
    24 LOAD_CONST 0 (None)
    27 RETURN_VALUE

    And speeds it up from:

    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'with lock: pass'
    1000000 loops, best of 3: 0.832 usec per loop

    to:

    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'with lock: pass'
    1000000 loops, best of 3: 0.762 usec per loop

    That's only half of the way to parity with try/finally:

    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
    lock.release()'
    1000000 loops, best of 3: 0.638 usec per loop

    What's strange is that calling __enter__ and __exit__ in a try/finally
    block brings the speed back to the faster 'with' speed, even though they
    call the same C functions:

    $ ./python.exe -m timeit -s 'import thread; lock =
    thread.allocate_lock()' 'lock.__enter__()' 'try: pass' 'finally:
    lock.__exit__()'
    1000000 loops, best of 3: 0.754 usec per loop

    Any ideas?

    @jyasskin
    Copy link
    Mannequin Author

    jyasskin mannequin commented Mar 3, 2008

    Now with documentation, a working test_compile, and one less refleak.

    @amauryfa
    Copy link
    Member

    amauryfa commented Mar 6, 2008

    What's strange is that calling __enter__ and __exit__ in a
    try/finally block brings the speed back to the faster 'with' speed,
    even though they call the same C functions

    Looking carefully at the code, there are two reasons for this:

    • LockType has no methods! try "dir(thread.LockType)". Instead, LockType
      defines a tp_getattr which does a *linear* search in a PyMethodDef
      array. First items are served faster...
    • After converting this to use the usual tp_methods slot, there is still
      a performance difference, due to the fact that release() is a
      METH_NOARGS method, while __exit__() uses METH_VARARGS.

    Phew.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 7, 2008

    Patch applied cleanly for me and all tests pass.

    It also looked good on a visual scan over the diff text.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Mar 7, 2008

    I went ahead and committed the change to the bytecode generation as r61290.

    The deficiencies in the lock implementation should probably be raised as
    a separate issue, so I'm closing this one.

    @ncoghlan ncoghlan closed this as completed Mar 7, 2008
    @amauryfa
    Copy link
    Member

    amauryfa commented Mar 7, 2008

    Hm, my tests do not see any speedup with this patch.
    I used VS2005 on win2K, and VS2008 on winXP.
    Timings are very similar before and after this patch.

    Maybe the optimization is only useful with gcc?

    @jyasskin
    Copy link
    Mannequin Author

    jyasskin mannequin commented Mar 9, 2008

    Thanks Nick and Amaury!

    Amaury, what times are you seeing? It could be a just-gcc speedup, but I
    wouldn't have thought so since the patch eliminates 2 times around the
    eval loop. It could be that the overhead of WITH_CLEANUP is high enough
    to drown out those iterations. You had mentioned optimizing the
    PyObject_CallFunctionObjArgs() call before. If you're still seeing with
    significantly slower than try, that's probably the right place to look.

    Here are my current timings. To avoid the lock issues, I wrote
    simple_cm.py containing:

    class CM(object):
        def __enter__(self):
            pass
        def __exit__(self, *args):
            pass
    $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
    'with cm: pass'
    1000000 loops, best of 3: 0.885 usec per loop
    $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
    'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
    1000000 loops, best of 3: 0.858 usec per loop

    If __exit__ doesn't contain *args (making it not a context manager), the
    try/finally time goes down to:
    $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
    'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
    1000000 loops, best of 3: 0.619 usec per loop

    I think in theory, with could be slightly faster than finally with the
    same argument list, but it's pretty close now.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants