-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. #61811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The handling of "pseudo exceptions" (return, break and continue) are currently handled in the interpreter. This make the interpreter loop more complex and slower than it needs to be. This change moves the handling of pseudo exceptions into the compiler. The net effects of this patch are: Simpler interpreter loop: no 'psuedo-exceptions', fewer bytecodes and some simplifed bytecodes. Eliminate the 'why_code' state variable in the interpreter. Execution is always in the 'normal' state except during explicit exception handling. Small increase in size and complexity of compiler. Speedup of 1.5% (Intel i7); this should be considered a happy side-effect rather than a motivation for the change. |
Simplifying the eval loop is a rather good thing. Could you post examples of bytecode disassembly before and after the patch? |
The change seems to slow down the pi float benchmark. Run this and hit ./python Modules/_decimal/tests/bench.py I've run the benchmark seven times and the average for float is |
Stefan, With patch: Without patch: I'm somewhat surprised by the slowdown on your machine. There should be no difference in the bytecodes executed or their implementation in the pi_float loop. |
Antoine, For the following function:
>>> def f(x):
... try:
... return x
... except:
... pass Without the patch: 3 3 LOAD_FAST 0 (x) 4 >> 11 POP_TOP 5 14 POP_EXCEPT With the patch: 3 3 LOAD_FAST 0 (x) 4 >> 12 POP_TOP 5 15 POP_EXCEPT The difference is the 'POP_BLOCK' inserted at offset 7 *before* the 'RETURN_VALUE', so that the RETURN_VALUE bytecode does not have to do any unwinding of the block stack. For loops the SETUP_LOOP and the matching POP_BLOCK are eliminated: >>> def f(x):
... for y in x:
... g(x) Without the patch: 3 13 LOAD_GLOBAL 0 (g) With the patch: 3 10 LOAD_GLOBAL 0 (g) END_ITER is a synonym for JUMP_ABSOLUTE. It is needed so that frame.set_lineno() can identify loop blocks. |
I'm surprised, too, but after a couple of retries the results stay the With an increased loop count the difference is also the same (10% on Core2 Duo, 3.16GHz, Linux 64-bit patched: unpatched: (i7, 2.80GHz, Linux 64-bit) patched: unpatched: |
BTW, there's no general slowdown on the Core2 Duo: In the patched version, |
It's customary for even innocent changes in ceval to produce small |
Antoine Pitrou <report@bugs.python.org> wrote:
Yes, the float benchmark appears to be particularly sensitive. In the I'm trying to keep track of changes because in 2.7 the (float) benchmark runs |
Here is an updated patch keeping up with recent changes in the source tree. |
I was thinking about this. Isn't it enough to consider all backwards edges as ending a loop block? |
Ah, no, I guess "continue" would also create a backwards edge. Nevermind, sorry. |
I like the idea. I have not make faultfinding review yet, but one thing is questionable to me. Is it necessary to get rid of the END_FINALLY opcode? Currently Python code
is compiled to:
L1: // stmt2 With the patch it is compiled to:
L1: // stmt2 Finally body is duplicated. In case of large finally body this increases the size of the bytecode. Line numbers are duplicated in disassembly listing, this is confused. And I afraid that due to increasing the size of the bytecode, some relative jumps can became too long. If it is absolutely necessary to remove the END_FINALLY opcode, I suggest to generate following bytecode:
L1: // stmt2 But it looks to me that special END_FINALLY opcode which combines POP_JUMP_IF_FALSE/RERAISE would be better for performance and compiled code size. |
FYI The issue bpo-26107 changed code.co_lnotab to support negative line number deltas. |
Java duplicates finally bytecode too: http://cliffhacks.blogspot.ca/2008/02/java-6-tryfinally-compilation-without.html Tho they avoid jsr instruction because it causes non trivial bytecode validation. Still, they would've had to concluded that finally blocks are often small enough to be alright to inline |
The patch no longer applied cleanly. Mark, can you please update your patch? |
This looks to be a good idea and a good time to merge it now the bytecode has changed to 16-bit. The increase in complexity to compile.c is not bad and reducing the complexity of the eval loop is worth it, IMHO. |
I updated the patch for git master. Some cleanup is in order. |
So, the approach of duplicating finally blocks tends to lead to a non-trivial bytecode increase. There is even a potential combinatorial explosion with nested "try..finally" block: def f():
try:
...
finally:
try:
...
finally:
# etc. Such a chain of N nested "finally"s will emit O(2**N) opcodes. I'm not sure it's a pratical concern but I find it is detrimental to the readibility of bytecode (which is a nice thing to have when doing low-level debugging or tweaking). I think we can massage the PR to remove the "finally" block duplication while keeping the predictable stack effect. |
For example we could generate the following bytecode:
L1: JUMP_FINALLY would push 3 Nones on the stack. RERAISE would raise only if non-None values are popped. |
Ok, I've pushed an update with removes the finally block duplication. |
Great! I tried to update the patch myself, but there were too much conflicts. Thank you very much Antoine for taking this! Seems there are bugs in frame_setlineno() related to code duplications. And deduplicating the 'finally' blocks doesn't solve all of them. See comments on GitHub. I don't like the new END_ITER instruction. This complicates (and slows down) the evaluation loop and the peepholer. This also adds a limitation on the peepholer (END_ITER can't be optimized out). We can use other way for determaining the end of the 'for' block. The argument of the FOR_ITER instruction points to the end of the 'for' block. Just scan all addresses from 0 to max_addr and count all FOR_ITER instructions and their targets in the range between min_addr and max_addr. |
I don't think it slows down anything in the eval loop. I agree it adds a bit of complexity.
It's an unconditional backedge, how could you optimize it? |
Ok, I've now gotten rid of END_ITER. |
The attached pyperformance report compares cpython:master to nascheme:unwind_stack. If I've done the tests correctly, the unwind_stack version is slightly faster. |
I like this change, and think we should go ahead with it, but just wanted to note that I suspect it may make the "Badly timed signals may lead to __exit__ being skipped even after __enter__ succeeded" problem slightly easier to hit: https://bugs.python.org/issue29988 That's not a new problem though, and this change should make it easier to apply the conventional solutions (i.e. only checking for signals when execution jumps backwards within a function, as well as in function pre-or-post-ambles) |
At this point in the release cycle, it would be nice to get this PR applied soon so that there is more time see if there are any minor follow-on issues. |
Le 29/12/2017 à 12:48, Mark Shannon a écrit :
The original patch had a number of issues which needed to be fixed (in
Actually, when looking at your original patch
Speed Same Insignificantly faster That said, if you want to submit an updated version of the code |
When I said "significant", I meant from a statistically, not as a judgement meaning "useful" or "worthwhile". The code duplication approach is significantly faster in the tests. Whether the small speed difference is worth worrying about is a different matter. Also, the comparisons were with each other, not the current interpreter. Both approaches are better than the current implementation. One point I didn't cover is jumping to a new line in the debugger. If we are going to use the JSR approach, we should do it properly. |
I don't think so. Though one could run the benchmarks suite to get an idea of the impact on real-world code. |
I apologize if my extra PR is causing confusion. My original intention was to merely forward port Antoine changes so they could compile with the 'master' version of Python. Over time I have made some fixes to it. I have kept it open because I'm not sure which approach is better (JSR or duplication). I did a quick test on the .pyc code size increase. It is actually tiny for the Python code in Lib/. PR 5006: 21259894 bytes Running Serhiy's microbenchmarks: for i in a:
try: pass
finally: pass PR 4682: Mean +- std dev: 11.0 us +- 0.1 us for i in a:
try: continue
finally: pass PR 4682: Mean +- std dev: 9.46 us +- 0.85 us for i in a:
while True:
try: break
finally: pass PR 4682: Mean +- std dev: 9.20 us +- 0.09 us |
I consider PR 5006 as just a first step. I had two goals:
Maybe finally we will came to the duplication of the code. But this will |
Another point in favour of the JSR approach is that it should make it I also like Serhiy's proposed strategy of separating the initial |
So we now have three competing PRs... each of which is not trivial. It would be nice if somehow there could be some conciliation, or at least a synthetic comparison. I don't think I want to review all three proposals (and I already spent time reviewing previous versions of Serhiy's). |
It shouldn't be a surprise that I have submitted a PR. The original patch was my work and a month ago I said I would update it over the Christmas break. To avoid "competing" PRs in future, simply asking before appropriating other peoples work would help. I've signed the CLA, so I don't think there are any legal issues with Serhiy's or Neil's PRs containing my code, but it does seem a bit presumptive. Having said that, if someone can convince me that Serhiy's PR is better, then I have no problem with it being merged. |
New opcodes in PR 5071 look interesting, but I think it is better to leave this experiment for a separate issue. Complexities of PRs (ignoring tests, documentation, and generated files): PR 4682: PR 5006: PR 5071: PR 4682: +4 opcodes, -4 opcodes, and changed numerical values of 2 opcodes It looks to me that PR 5006 has the lowest complexity and is the most easy for reviewing. I'm working on it since taking Antoine's PR 5 months ago. |
I took a quick look at Mark's PR. The main thing going for it IMHO is that it produces simpler bytecode: it removes the complicated with/finally-related opcodes, while Serhiy's has non-trivial END_FINALLY and POP_FINALLY. It would be nice to know if it's able to fix the stack size issues as in the following tests: |
I just worked on it. See bpo-24340. |
I meant Mark's PR :-) |
On 01/01/18 17:54, Antoine Pitrou wrote:
Yes, it can. The key to determining a correct stack size is that each With fixed deltas, it is not too hard to construct an algorithm to An outline algorithm would be:
|
Any chance you can put that patch somewhere to get an idea of the additional complexity? I'd like to gauge the total complexity of each approach including the stack size computation fix (since fixing that issue is one of the end goals). Also, that would help compare your PR and Serhiy's on more equal terms. |
I've added a commit to compute the stack size. |
I failed to account for the extra stuff left on the stack when handling exceptions and jumping out of a finally block due to a I'm now convinced that code duplication is the only way to do this properly. The only real advantage of JSR-style finally blocks is the ease of implementation of Rather than go around in circles any more, I suggest that we merge Serhiy's PR (PR 5006). I can then cherry-pick my commits to clean up the implementation of the I then plan to implement |
Le 02/01/2018 à 18:13, Mark Shannon a écrit :
Fair enough. I'm making a final review round on Serhiy's PR and Also looking forward to the cleanup followup. |
On 2017-12-29, Mark Shannon wrote:
Could we virtually execute the code, forwards or backwards, I started implementing this as a proof of concept but didn't finish. |
PR 5143 is yet one alternative. It is a variant of PR 5006 that restores SETUP_EXCEPT and gets rid of the new opcode BEGIN_FINALLY. SETUP_FINALLY, SETUP_WITH and SETUP_ASYNC_WITH now push NULL on the stack instead. PR 5143 removes and add fewer opcodes than PR 5006, but changes the behavior of more opcodes. I don't know what of them looks simpler. |
Can we stick to the one PR, please? Also, I don't like leaving NULLs on the stack, they are an invitation to seg-fault. PR 5143 leaves more NULLs on the stack for longer. |
Please let's stick with PR 5006. |
What should be done for PR 5006? Changes of this PR have been minimized for easier review. But many interesting things can made after merging it. For example simplify 'with'-related opcodes. Or implement static blocks (this will make the 'try' statement zero-overhead, make checks in frame.f_lineno setter more reliable, and unblock some bytecode optimizations). It would be better to have more time before a feature freeze. |
Apparently PR 5006 has a conflict. Otherwise, is it ready for review? |
Simplifying "with"-related opcodes (and perhaps "finally" ones) sounds nice. "try" blocks are already zero-overhead for practical purposes (i.e. I don't think I've met any workload where moving a "try" around did improve performance significantly). |
Finally it is merged, and seems successfully. Thank you all involved, and first at all Mark, the original author. This unblocked further changes, bpo-32489 and bpo-32949. And we can try implement other ideas. Changes in the f_lineno setter fixed crashes in bpo-17288 and bpo-28883. But while writing tests for it I found other crashers, will open new issues soon. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: