with should be as fast as try/finally #46432

jyasskin · 2008-02-24T23:10:40Z

BPO	2179
Nosy	@rhettinger, @jcea, @amauryfa, @ncoghlan, @abalkin, @tiran, @benjaminp
Files	faster_with.patch: proof of concept faster_with.patch: complete

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2008-03-07.14:17:51.663>
created_at = <Date 2008-02-24.23:10:40.175>
labels = ['interpreter-core', 'type-feature']
title = 'with should be as fast as try/finally'
updated_at = <Date 2008-03-09.17:03:40.472>
user = 'https://bugs.python.org/jyasskin'

bugs.python.org fields:

activity = <Date 2008-03-09.17:03:40.472>
actor = 'jyasskin'
assignee = 'none'
closed = True
closed_date = <Date 2008-03-07.14:17:51.663>
closer = 'ncoghlan'
components = ['Interpreter Core']
creation = <Date 2008-02-24.23:10:40.175>
creator = 'jyasskin'
dependencies = []
files = ['9589', '9590']
hgrepos = []
issue_num = 2179
keywords = ['patch']
message_count = 10.0
messages = ['62951', '63165', '63167', '63197', '63200', '63323', '63353', '63355', '63356', '63425']
nosy_count = 10.0
nosy_names = ['nnorwitz', 'collinwinter', 'rhettinger', 'jcea', 'amaury.forgeotdarc', 'ncoghlan', 'belopolsky', 'christian.heimes', 'jyasskin', 'benjamin.peterson']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = None
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue2179'
versions = ['Python 2.6']

jyasskin · 2008-02-24T23:10:39Z

Currently, 'with' costs about .2us over try/finally:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
lock.release()'
1000000 loops, best of 3: 0.617 usec per loop
$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.774 usec per loop

Since it's doing the same thing (and calling the same C functions to do
the lock manipulation), it shouldn't take more time. The bytecodes
associated with the two options look like:

with lock:
```
 pass
```

2 0 LOAD_GLOBAL 0 (lock)
3 DUP_TOP
4 LOAD_ATTR 1 (exit)
7 STORE_FAST 0 (_[1])
10 LOAD_ATTR 2 (enter)
13 CALL_FUNCTION 0
16 POP_TOP
17 SETUP_FINALLY 4 (to 24)

3 20 POP_BLOCK
21 LOAD_CONST 0 (None)
>> 24 LOAD_FAST 0 ([1])
27 DELETE_FAST 0 ([1])
30 WITH_CLEANUP
31 END_FINALLY
32 LOAD_CONST 0 (None)
35 RETURN_VALUE

lock.acquire()
try:
```
 pass
```
finally:
```
lock.release()
```

6 0 LOAD_GLOBAL 0 (lock)
3 LOAD_ATTR 1 (acquire)
6 CALL_FUNCTION 0
9 POP_TOP

7 10 SETUP_FINALLY 4 (to 17)

8 13 POP_BLOCK
14 LOAD_CONST 0 (None)

10 >> 17 LOAD_GLOBAL 0 (lock)
20 LOAD_ATTR 2 (release)
23 CALL_FUNCTION 0
26 POP_TOP
27 END_FINALLY
28 LOAD_CONST 0 (None)
31 RETURN_VALUE

The major difference I see is the extra local variable (_[1]) used by
with. It looks like the compiler ought to be able to save that on the
stack instead, and save 3 opcodes. Neal Norwitz suggested that, if that
optimization is impossible, WITH_CLEANUP could be modified to take the
variable as a parameter, which would let it absorb the LOAD_FAST and
DELETE_FAST instructions.

I've added everyone on the previous bug to the nosy list. Sorry if you
don't care. :)

ncoghlan · 2008-03-01T16:24:17Z

A closer approximation of what the with statement is doing would be:

exit = lock.release()
lock.acquire()
try:
    pass
finally:
    exit()

The problem with trying to store the result of the retrieval of __exit__
on the stack is that we need to leave the context manager itself on top
of the stack for the next LOAD_ATTR opcode (when we're retrieving the
__enter__ method).

However, changing WITH_CLEANUP to take an argument indicating which
local variable holds the bound __exit__ method sounds like it might work.

ncoghlan · 2008-03-01T16:34:03Z

Scratch the parentheses on that first line of sample code in my previous
comment (isn't copy and paste wonderful?)

jyasskin · 2008-03-03T00:29:43Z

Here's a proof-of-concept patch that keeps the __exit__ method on the
stack. It uses ROT_TWO to stuff it under the context object instead of
storing it into a temporary. (Thanks Nick for pointing out that problem
before I had to waste time on it.) test_with passes, although I need to
update several more things and maybe fix a refleak.

The patch changes the compilation of:

    def with_(l):
        with l:
            pass

from

4 0 LOAD_FAST 0 (l)
3 DUP_TOP
4 LOAD_ATTR 0 (exit)
7 STORE_FAST 1 (_[1])
10 LOAD_ATTR 1 (enter)
13 CALL_FUNCTION 0
16 POP_TOP
17 SETUP_FINALLY 4 (to 24)

5 20 POP_BLOCK
21 LOAD_CONST 0 (None)
>> 24 LOAD_FAST 1 ([1])
27 DELETE_FAST 1 ([1])
30 WITH_CLEANUP
31 END_FINALLY
32 LOAD_CONST 0 (None)
35 RETURN_VALUE

to

4 0 LOAD_FAST 0 (l)
3 DUP_TOP
4 LOAD_ATTR 0 (exit)
7 ROT_TWO
8 LOAD_ATTR 1 (enter)
11 CALL_FUNCTION 0
14 POP_TOP
15 SETUP_FINALLY 4 (to 22)

5 18 POP_BLOCK
19 LOAD_CONST 0 (None)
>> 22 WITH_CLEANUP
23 END_FINALLY
24 LOAD_CONST 0 (None)
27 RETURN_VALUE

And speeds it up from:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.832 usec per loop

to:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.762 usec per loop

That's only half of the way to parity with try/finally:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
lock.release()'
1000000 loops, best of 3: 0.638 usec per loop

What's strange is that calling __enter__ and __exit__ in a try/finally
block brings the speed back to the faster 'with' speed, even though they
call the same C functions:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.__enter__()' 'try: pass' 'finally:
lock.__exit__()'
1000000 loops, best of 3: 0.754 usec per loop

Any ideas?

jyasskin · 2008-03-03T02:34:09Z

Now with documentation, a working test_compile, and one less refleak.

amauryfa · 2008-03-06T15:54:00Z

What's strange is that calling __enter__ and __exit__ in a
try/finally block brings the speed back to the faster 'with' speed,
even though they call the same C functions

Looking carefully at the code, there are two reasons for this:

LockType has no methods! try "dir(thread.LockType)". Instead, LockType
defines a tp_getattr which does a *linear* search in a PyMethodDef
array. First items are served faster...
After converting this to use the usual tp_methods slot, there is still
a performance difference, due to the fact that release() is a
METH_NOARGS method, while __exit__() uses METH_VARARGS.

Phew.

ncoghlan · 2008-03-07T12:01:48Z

Patch applied cleanly for me and all tests pass.

It also looked good on a visual scan over the diff text.

ncoghlan · 2008-03-07T14:17:51Z

I went ahead and committed the change to the bytecode generation as r61290.

The deficiencies in the lock implementation should probably be raised as
a separate issue, so I'm closing this one.

amauryfa · 2008-03-07T14:43:02Z

Hm, my tests do not see any speedup with this patch.
I used VS2005 on win2K, and VS2008 on winXP.
Timings are very similar before and after this patch.

Maybe the optimization is only useful with gcc?

jyasskin · 2008-03-09T17:03:39Z

Thanks Nick and Amaury!

Amaury, what times are you seeing? It could be a just-gcc speedup, but I
wouldn't have thought so since the patch eliminates 2 times around the
eval loop. It could be that the overhead of WITH_CLEANUP is high enough
to drown out those iterations. You had mentioned optimizing the
PyObject_CallFunctionObjArgs() call before. If you're still seeing with
significantly slower than try, that's probably the right place to look.

Here are my current timings. To avoid the lock issues, I wrote
simple_cm.py containing:

class CM(object):
    def __enter__(self):
        pass
    def __exit__(self, *args):
        pass

$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'with cm: pass'
1000000 loops, best of 3: 0.885 usec per loop
$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
1000000 loops, best of 3: 0.858 usec per loop

If __exit__ doesn't contain *args (making it not a context manager), the
try/finally time goes down to:
$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
1000000 loops, best of 3: 0.619 usec per loop

I think in theory, with could be slightly faster than finally with the
same argument list, but it's pretty close now.

jyasskin mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Feb 24, 2008

ncoghlan closed this as completed Mar 7, 2008

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

with should be as fast as try/finally #46432

with should be as fast as try/finally #46432

jyasskin mannequin commented Feb 24, 2008

jyasskin mannequin commented Feb 24, 2008

ncoghlan commented Mar 1, 2008

ncoghlan commented Mar 1, 2008

jyasskin mannequin commented Mar 3, 2008

jyasskin mannequin commented Mar 3, 2008

amauryfa commented Mar 6, 2008

ncoghlan commented Mar 7, 2008

ncoghlan commented Mar 7, 2008

amauryfa commented Mar 7, 2008

jyasskin mannequin commented Mar 9, 2008

with should be as fast as try/finally #46432

with should be as fast as try/finally #46432

Comments

jyasskin mannequin commented Feb 24, 2008

jyasskin mannequin commented Feb 24, 2008

ncoghlan commented Mar 1, 2008

ncoghlan commented Mar 1, 2008

jyasskin mannequin commented Mar 3, 2008

jyasskin mannequin commented Mar 3, 2008

amauryfa commented Mar 6, 2008

ncoghlan commented Mar 7, 2008

ncoghlan commented Mar 7, 2008

amauryfa commented Mar 7, 2008

jyasskin mannequin commented Mar 9, 2008