Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. #61811

Closed
markshannon opened this issue Apr 1, 2013 · 81 comments
Closed
Labels
3.8 interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@markshannon
Copy link
Member

markshannon commented Apr 1, 2013

BPO 17611
Nosy @nascheme, @rhettinger, @mdickinson, @ncoghlan, @pitrou, @tiran, @benjaminp, @tpn, @markshannon, @serhiy-storchaka, @serprex
PRs
  • [WIP] bpo-17611: Move unwinding of stack from interpreter to compiler #2827
  • bpo-17611: Move unwinding of stack from interpreter to compiler #4682
  • bpo-17611. Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. #5006
  • bpo-17611. Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. #5071
  • bpo-17611. Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. (alt) #5143
  • Dependencies
  • bpo-24340: co_stacksize estimate can be highly off
  • bpo-32416: Refactor and add new tests for the f_lineno setter
  • Files
  • b16527f84774.diff
  • unwinding.patch
  • perf_unwind_stack.txt
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-02-25.14:02:36.230>
    created_at = <Date 2013-04-01.16:31:20.149>
    labels = ['interpreter-core', 'type-feature', '3.8']
    title = 'Move unwinding of stack for "pseudo exceptions" from interpreter to compiler.'
    updated_at = <Date 2018-02-25.14:02:36.229>
    user = 'https://github.com/markshannon'

    bugs.python.org fields:

    activity = <Date 2018-02-25.14:02:36.229>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-02-25.14:02:36.230>
    closer = 'serhiy.storchaka'
    components = ['Interpreter Core']
    creation = <Date 2013-04-01.16:31:20.149>
    creator = 'Mark.Shannon'
    dependencies = ['24340', '32416']
    files = ['29646', '35094', '47253']
    hgrepos = ['181']
    issue_num = 17611
    keywords = ['patch']
    message_count = 81.0
    messages = ['185741', '185909', '185916', '185946', '185950', '185951', '185952', '185954', '185955', '217534', '228816', '228817', '231250', '258673', '267329', '267418', '268456', '298910', '298934', '298935', '298943', '298949', '298960', '298972', '305684', '307477', '307478', '307479', '307497', '307548', '307549', '307555', '307558', '307564', '307570', '307572', '307577', '307582', '307584', '307585', '307656', '307671', '307680', '307687', '307827', '308928', '308939', '308969', '309015', '309138', '309139', '309141', '309161', '309163', '309169', '309170', '309181', '309182', '309185', '309202', '309211', '309333', '309335', '309339', '309341', '309348', '309349', '309363', '309366', '309372', '309385', '309388', '309424', '309740', '309742', '309771', '310043', '310044', '310046', '312596', '312815']
    nosy_count = 11.0
    nosy_names = ['nascheme', 'rhettinger', 'mark.dickinson', 'ncoghlan', 'pitrou', 'christian.heimes', 'benjamin.peterson', 'trent', 'Mark.Shannon', 'serhiy.storchaka', 'Demur Rumed']
    pr_nums = ['2827', '4682', '5006', '5071', '5143']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue17611'
    versions = ['Python 3.8']

    @markshannon
    Copy link
    Member Author

    markshannon commented Apr 1, 2013

    The handling of "pseudo exceptions" (return, break and continue) are currently handled in the interpreter. This make the interpreter loop more complex and slower than it needs to be. This change moves the handling of pseudo exceptions into the compiler.

    The net effects of this patch are:

    Simpler interpreter loop: no 'psuedo-exceptions', fewer bytecodes and some simplifed bytecodes.

    Eliminate the 'why_code' state variable in the interpreter. Execution is always in the 'normal' state except during explicit exception handling.

    Small increase in size and complexity of compiler.

    Speedup of 1.5% (Intel i7); this should be considered a happy side-effect rather than a motivation for the change.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 3, 2013

    Simplifying the eval loop is a rather good thing. Could you post examples of bytecode disassembly before and after the patch?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 3, 2013

    The change seems to slow down the pi float benchmark. Run this and hit
    Ctrl-C after the first decimal result appears:

    ./python Modules/_decimal/tests/bench.py

    I've run the benchmark seven times and the average for float is
    something like 8-10% slower (64-bit Linux, Core 2 Duo). Decimal
    is not affected.

    @markshannon
    Copy link
    Member Author

    markshannon commented Apr 3, 2013

    Stefan,
    I tried running your pi_bench (increasing the numer of iterations by x10 to reduce jitter). The run to run variation exceed the difference between the two versions. Here are the fastest of 7 runs on my machine.
    i3-2370M CPU @ 2.40GHz × 4. linux 64bit.

    With patch:
    1.154152s

    Without patch:
    1.157730s

    I'm somewhat surprised by the slowdown on your machine. There should be no difference in the bytecodes executed or their implementation in the pi_float loop.

    @markshannon
    Copy link
    Member Author

    markshannon commented Apr 3, 2013

    Antoine,
    Two bytecode examples.

    For the following function:
    >>> def f(x):
    ...     try:
    ...         return x
    ...     except:
    ...         pass

    Without the patch:
    2 0 SETUP_EXCEPT 8 (to 11)

    3 3 LOAD_FAST 0 (x)
    6 RETURN_VALUE
    7 POP_BLOCK
    8 JUMP_FORWARD 8 (to 19)

    4 >> 11 POP_TOP
    12 POP_TOP
    13 POP_TOP

    5 14 POP_EXCEPT
    15 JUMP_FORWARD 1 (to 19)
    18 END_FINALLY
    >> 19 LOAD_CONST 0 (None)
    22 RETURN_VALUE

    With the patch:
    2 0 SETUP_EXCEPT 9 (to 12)

    3 3 LOAD_FAST 0 (x)
    6 POP_BLOCK
    7 RETURN_VALUE
    8 POP_BLOCK
    9 JUMP_FORWARD 8 (to 20)

    4 >> 12 POP_TOP
    13 POP_TOP
    14 POP_TOP

    5 15 POP_EXCEPT
    16 JUMP_FORWARD 1 (to 20)
    19 RERAISE
    >> 20 LOAD_CONST 0 (None)
    23 RETURN_VALUE

    The difference is the 'POP_BLOCK' inserted at offset 7 *before* the 'RETURN_VALUE', so that the RETURN_VALUE bytecode does not have to do any unwinding of the block stack.

    For loops the SETUP_LOOP and the matching POP_BLOCK are eliminated:

    >>> def f(x):
    ...    for y in x:
    ...        g(x)

    Without the patch:
    2 0 SETUP_LOOP 24 (to 27)
    3 LOAD_FAST 0 (x)
    6 GET_ITER
    >> 7 FOR_ITER 16 (to 26)
    10 STORE_FAST 1 (y)

    3 13 LOAD_GLOBAL 0 (g)
    16 LOAD_FAST 1 (y)
    19 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
    22 POP_TOP
    23 JUMP_ABSOLUTE 7
    >> 26 POP_BLOCK
    >> 27 LOAD_CONST 0 (None)
    30 RETURN_VALUE

    With the patch:
    2 0 LOAD_FAST 0 (x)
    3 GET_ITER
    >> 4 FOR_ITER 16 (to 23)
    7 STORE_FAST 1 (y)

    3 10 LOAD_GLOBAL 0 (g)
    13 LOAD_FAST 0 (x)
    16 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
    19 POP_TOP
    20 END_ITER 4
    >> 23 LOAD_CONST 0 (None)
    26 RETURN_VALUE

    END_ITER is a synonym for JUMP_ABSOLUTE. It is needed so that frame.set_lineno() can identify loop blocks.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 3, 2013

    I'm surprised, too, but after a couple of retries the results stay the
    same. On an i7 there's also a difference, but not quite as large. I'm
    using b16527f84774.diff, which applies cleanly apart from importlib.h
    (which I just regenerated).

    With an increased loop count the difference is also the same (10% on
    the Core2 Duo).

    Core2 Duo, 3.16GHz, Linux 64-bit
    ================================

    patched:
    --------
    (0.10855+0.10901+0.1168+0.1119+0.1095+0.1090+0.1084+0.1130+0.1235+0.1183) / 10 = 0.113

    unpatched:
    ----------
    (0.0995+0.1049+0.1009+0.1021+0.0986+0.1005+0.0988+0.10171+0.0987+0.0992) / 10 = 0.10

    (i7, 2.80GHz, Linux 64-bit)
    ============================

    patched:
    --------
    (0.1021+0.1078+0.1072+0.1058+0.1091+0.1067+0.1064+0.1071+0.1069+0.1067) / 10 = 0.107

    unpatched:
    ----------
    (0.1029+0.1033+0.1023+0.1029+0.1050+0.1020+0.1031+0.1024+0.10149+0.10247) / 10 = 0.103

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 3, 2013

    BTW, there's no general slowdown on the Core2 Duo: In the patched version,
    the telco benchmark is consistently 4% faster.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 3, 2013

    BTW, there's no general slowdown on the Core2 Duo: In the patched version,
    the telco benchmark is consistently 4% faster.

    It's customary for even innocent changes in ceval to produce small
    changes in some benchmarks. It's only really important to consider the
    global average.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Apr 3, 2013

    Antoine Pitrou <report@bugs.python.org> wrote:

    It's customary for even innocent changes in ceval to produce small
    changes in some benchmarks. It's only really important to consider the
    global average.

    Yes, the float benchmark appears to be particularly sensitive. In the
    3.3.0 release it also runs 8% slower than it does now (unpatched).

    I'm trying to keep track of changes because in 2.7 the (float) benchmark runs
    something like 25% faster than in 3.3.0.

    @pitrou
    Copy link
    Member

    pitrou commented Apr 29, 2014

    Here is an updated patch keeping up with recent changes in the source tree.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2014

    END_ITER is a synonym for JUMP_ABSOLUTE. It is needed so that frame.set_lineno() can identify loop blocks.

    I was thinking about this. Isn't it enough to consider all backwards edges as ending a loop block?

    @pitrou
    Copy link
    Member

    pitrou commented Oct 8, 2014

    Ah, no, I guess "continue" would also create a backwards edge. Nevermind, sorry.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Nov 16, 2014

    I like the idea. I have not make faultfinding review yet, but one thing is questionable to me.

    Is it necessary to get rid of the END_FINALLY opcode?

    Currently Python code

    try:
        stmt1
    finally:
        stmt2
    

    is compiled to:

    SETUP_FINALLY L1
    // stmt1
    POP_BLOCK
    LOAD_CONST None
    

    L1: // stmt2
    END_FINALLY

    With the patch it is compiled to:

    SETUP_FINALLY L1
    // stmt1
    POP_BLOCK
    // stmt2
    JUMP_ABSOLUTE L2
    

    L1: // stmt2
    RERAISE
    L2:

    Finally body is duplicated. In case of large finally body this increases the size of the bytecode. Line numbers are duplicated in disassembly listing, this is confused. And I afraid that due to increasing the size of the bytecode, some relative jumps can became too long.

    If it is absolutely necessary to remove the END_FINALLY opcode, I suggest to generate following bytecode:

    SETUP_FINALLY L1
    // stmt1
    POP_BLOCK
    LOAD_CONST None
    

    L1: // stmt2
    POP_JUMP_IF_FALSE L2
    RERAISE
    L2:

    But it looks to me that special END_FINALLY opcode which combines POP_JUMP_IF_FALSE/RERAISE would be better for performance and compiled code size.

    @serhiy-storchaka serhiy-storchaka added interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 16, 2014
    @vstinner
    Copy link
    Member

    vstinner commented Jan 20, 2016

    FYI The issue bpo-26107 changed code.co_lnotab to support negative line number deltas.

    @serprex
    Copy link
    Mannequin

    serprex mannequin commented Jun 4, 2016

    Java duplicates finally bytecode too: http://cliffhacks.blogspot.ca/2008/02/java-6-tryfinally-compilation-without.html

    Tho they avoid jsr instruction because it causes non trivial bytecode validation. Still, they would've had to concluded that finally blocks are often small enough to be alright to inline

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jun 5, 2016

    The patch no longer applied cleanly. Mark, can you please update your patch?

    @nascheme
    Copy link
    Member

    nascheme commented Jun 13, 2016

    This looks to be a good idea and a good time to merge it now the bytecode has changed to 16-bit. The increase in complexity to compile.c is not bad and reducing the complexity of the eval loop is worth it, IMHO.

    @pitrou pitrou added the 3.7 label Jul 22, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Jul 23, 2017

    I updated the patch for git master. Some cleanup is in order.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2017

    So, the approach of duplicating finally blocks tends to lead to a non-trivial bytecode increase. There is even a potential combinatorial explosion with nested "try..finally" block:

    def f():
        try:
            ...
        finally:
            try:
                ...
            finally:
                # etc.

    Such a chain of N nested "finally"s will emit O(2**N) opcodes.

    I'm not sure it's a pratical concern but I find it is detrimental to the readibility of bytecode (which is a nice thing to have when doing low-level debugging or tweaking). I think we can massage the PR to remove the "finally" block duplication while keeping the predictable stack effect.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2017

    For example we could generate the following bytecode:

    SETUP_FINALLY L1
    // ... stmt1
    POP_BLOCK
    JUMP_FINALLY  L1
    

    L1:
    // ... stmt2
    RERAISE
    JUMP_FORWARD L2
    L2:

    JUMP_FINALLY would push 3 Nones on the stack. RERAISE would raise only if non-None values are popped.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2017

    Ok, I've pushed an update with removes the finally block duplication.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jul 24, 2017

    Great! I tried to update the patch myself, but there were too much conflicts. Thank you very much Antoine for taking this!

    Seems there are bugs in frame_setlineno() related to code duplications. And deduplicating the 'finally' blocks doesn't solve all of them. See comments on GitHub.

    I don't like the new END_ITER instruction. This complicates (and slows down) the evaluation loop and the peepholer. This also adds a limitation on the peepholer (END_ITER can't be optimized out). We can use other way for determaining the end of the 'for' block. The argument of the FOR_ITER instruction points to the end of the 'for' block. Just scan all addresses from 0 to max_addr and count all FOR_ITER instructions and their targets in the range between min_addr and max_addr.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2017

    I don't like the new END_ITER instruction. This complicates (and slows down) the evaluation loop and the peepholer.

    I don't think it slows down anything in the eval loop. I agree it adds a bit of complexity.

    This also adds a limitation on the peepholer (END_ITER can't be optimized out).

    It's an unconditional backedge, how could you optimize it?

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2017

    Ok, I've now gotten rid of END_ITER.

    @nascheme
    Copy link
    Member

    nascheme commented Nov 6, 2017

    The attached pyperformance report compares cpython:master to nascheme:unwind_stack. If I've done the tests correctly, the unwind_stack version is slightly faster.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 3, 2017

    I like this change, and think we should go ahead with it, but just wanted to note that I suspect it may make the "Badly timed signals may lead to __exit__ being skipped even after __enter__ succeeded" problem slightly easier to hit: https://bugs.python.org/issue29988

    That's not a new problem though, and this change should make it easier to apply the conventional solutions (i.e. only checking for signals when execution jumps backwards within a function, as well as in function pre-or-post-ambles)

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Dec 3, 2017

    At this point in the release cycle, it would be nice to get this PR applied soon so that there is more time see if there are any minor follow-on issues.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2017

    Le 29/12/2017 à 12:48, Mark Shannon a écrit :

    Why all these competing pull requests? It does feel like my original patch has been hijacked.

    The original patch had a number of issues which needed to be fixed (in
    addition to forward-port it to the latest git master), such as not
    fixing some cases of stack consumption calculation.

    Also, before any more PRs, we need to decide whether to use subroutines or code duplication for finally blocks.

    Here is my attempt at an objective comparison of the two approaches.

                       JSR style           Code duplication
    

    Speed Slower Faster
    Interpreter More complex Simpler
    Bytecode generation Simpler More complex
    Bytecode size ~ +0.1% ~ +0.4%

    Actually, when looking at your original patch
    (https://bugs.python.org/review/17611/) vs. Serhiy's
    #5006, it looks like interpreter
    complexity is the same. Speed also seems to not be significantly
    different. So I would rephrase it as (compared to legacy bytecode
    generation):

                      Subroutine              Code duplication
    

    Speed Same Insignificantly faster
    Interpreter Simpler Simpler
    Bytecode generation Slightly more complex Slightly more complex
    Bytecode size ~ +0.1% ~ +0.4%

    That said, if you want to submit an updated version of the code
    duplication approach (with the same amount of added tests and
    debugging), we could compare on more equal grounds.

    @markshannon
    Copy link
    Member Author

    markshannon commented Dec 29, 2017

    When I said "significant", I meant from a statistically, not as a judgement meaning "useful" or "worthwhile". The code duplication approach is significantly faster in the tests. Whether the small speed difference is worth worrying about is a different matter.

    Also, the comparisons were with each other, not the current interpreter. Both approaches are better than the current implementation.

    One point I didn't cover is jumping to a new line in the debugger.
    Implementing that reliably for finally blocks with code duplication is tricky and would mean adding a number of marker bytecodes.
    Which is a big point in favour of the JSR style.

    If we are going to use the JSR approach, we should do it properly.
    PR 5006 still has elements of the current design such as the overly complex END_FINALLY, WITH_CLEANUP_START and WITH_CLEANUP_FINISH bytecodes. None of those should be necessary.

    @pitrou
    Copy link
    Member

    pitrou commented Dec 29, 2017

    Whether the small speed difference is worth worrying about is a different matter.

    I don't think so. Though one could run the benchmarks suite to get an idea of the impact on real-world code.

    @nascheme
    Copy link
    Member

    nascheme commented Dec 29, 2017

    I apologize if my extra PR is causing confusion. My original intention was to merely forward port Antoine changes so they could compile with the 'master' version of Python. Over time I have made some fixes to it. I have kept it open because I'm not sure which approach is better (JSR or duplication).

    I did a quick test on the .pyc code size increase. It is actually tiny for the Python code in Lib/.

    PR 5006: 21259894 bytes
    PR 4682: 21263081 bytes (+0.015%)

    Running Serhiy's microbenchmarks:

    for i in a:
        try: pass
        finally: pass

    PR 4682: Mean +- std dev: 11.0 us +- 0.1 us
    PR 5006: Mean +- std dev: 10.9 us +- 0.6 us

    for i in a:
        try: continue
        finally: pass

    PR 4682: Mean +- std dev: 9.46 us +- 0.85 us
    PR 5006: Mean +- std dev: 14.3 us +- 0.3 us

    for i in a:
        while True:
            try: break
            finally: pass

    PR 4682: Mean +- std dev: 9.20 us +- 0.09 us
    PR 5006: Mean +- std dev: 14.3 us +- 0.6 us

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Dec 29, 2017

    I consider PR 5006 as just a first step. I had two goals:

    1. Fix all regressions in PR 2827. First at all the regression in the f_lineno
      setter.

    2. Simplify the patch. This is critically important for getting the review.
      And it decreases the chance of regressions. All changes not directly related
      to the main goal (declared in the title of this issue) were removed. Other
      nice improvements can be made in separate PRs.

    Maybe finally we will came to the duplication of the code. But this will
    require making other small steps which can have their own value.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Dec 30, 2017

    Another point in favour of the JSR approach is that it should make it
    easier for tools like coverage.py to continue mapping opcode coverage to
    line coverage.

    I also like Serhiy's proposed strategy of separating the initial
    introduction of the compiler based exception handling from the subsequent
    simplification of the exception handling opcodes.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 1, 2018

    So we now have three competing PRs... each of which is not trivial. It would be nice if somehow there could be some conciliation, or at least a synthetic comparison. I don't think I want to review all three proposals (and I already spent time reviewing previous versions of Serhiy's).

    @markshannon
    Copy link
    Member Author

    markshannon commented Jan 1, 2018

    It shouldn't be a surprise that I have submitted a PR. The original patch was my work and a month ago I said I would update it over the Christmas break.

    To avoid "competing" PRs in future, simply asking before appropriating other peoples work would help. I've signed the CLA, so I don't think there are any legal issues with Serhiy's or Neil's PRs containing my code, but it does seem a bit presumptive.

    Having said that, if someone can convince me that Serhiy's PR is better, then I have no problem with it being merged.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 1, 2018

    New opcodes in PR 5071 look interesting, but I think it is better to leave this experiment for a separate issue.

    Complexities of PRs (ignoring tests, documentation, and generated files):

    PR 4682:
    Include/opcode.h | 12 +-
    Lib/opcode.py | 21 +-
    Objects/frameobject.c | 241 ++++++++-----
    Objects/genobject.c | 10 +-
    Python/ceval.c | 426 +++++++++++------------
    Python/compile.c | 884 +++++++++++++++++++++++++++++++++---------------
    Python/peephole.c | 18 +-
    7 files changed, 979 insertions(+), 633 deletions(-)

    PR 5006:
    Include/opcode.h | 8 +-
    Lib/opcode.py | 11 +-
    Objects/frameobject.c | 203 ++++++++-------------
    Objects/genobject.c | 10 +-
    Python/ceval.c | 344 ++++++++++++++++-------------------
    Python/compile.c | 474 +++++++++++++++++++++++++++++-------------------
    Python/peephole.c | 33 ++--
    7 files changed, 561 insertions(+), 522 deletions(-)

    PR 5071:
    Include/frameobject.h | 5 +-
    Include/opcode.h | 18 +-
    Lib/opcode.py | 25 +-
    Objects/frameobject.c | 224 ++++++---------
    Objects/genobject.c | 10 +-
    Python/ceval.c | 507 ++++++++++++---------------------
    Python/compile.c | 681 +++++++++++++++++++++++++++++++-------------
    Python/peephole.c | 49 +---
    Python/wordcode_helpers.h | 33 ++-
    9 files changed, 827 insertions(+), 725 deletions(-)

    PR 4682: +4 opcodes, -4 opcodes, and changed numerical values of 2 opcodes
    PR 5006: +4 opcodes, -4 opcodes
    PR 5071: +8 opcodes, -8 opcodes, and changed numerical value of 1 opcode

    It looks to me that PR 5006 has the lowest complexity and is the most easy for reviewing.

    I'm working on it since taking Antoine's PR 5 months ago.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 1, 2018

    I took a quick look at Mark's PR. The main thing going for it IMHO is that it produces simpler bytecode: it removes the complicated with/finally-related opcodes, while Serhiy's has non-trivial END_FINALLY and POP_FINALLY.

    It would be nice to know if it's able to fix the stack size issues as in the following tests:
    https://github.com/python/cpython/pull/5006/files?diff=unified#diff-dae68b96e8fdcb924e1ea46c31f51aec

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 1, 2018

    It would be nice to know if it's able to fix the stack size issues as in the following tests:

    I just worked on it. See bpo-24340.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 1, 2018

    I just worked on it.

    I meant Mark's PR :-)

    @markshannon
    Copy link
    Member Author

    markshannon commented Jan 2, 2018

    On 01/01/18 17:54, Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    I took a quick look at Mark's PR. The main thing going for it IMHO is that it produces simpler bytecode: it removes the complicated with/finally-related opcodes, while Serhiy's has non-trivial END_FINALLY and POP_FINALLY.

    It would be nice to know if it's able to fix the stack size issues as in the following tests:
    https://github.com/python/cpython/pull/5006/files?diff=unified#diff-dae68b96e8fdcb924e1ea46c31f51aec

    Yes, it can. The key to determining a correct stack size is that each
    bytecode has a fixed stack delta for each exiting edge.
    For example, FOR_ITER is good because it has a delta of 0 when entering
    the body and a delta of -1 when leaving the loop.
    But (the old) WITH_CLEANUP_START is bad because it has a variable delta.

    With fixed deltas, it is not too hard to construct an algorithm to
    determine the correct stack size. The presence of JSR-style finally
    blocks complicates things a little, but not that much.

    An outline algorithm would be:
    Find all "finally" subgraphs, so as to distinguish them from the
    main CFG - It might be easiest to do this by marking the basic blocks
    where generating the code.
    For each subgraph compute max-depth, by keeping a running total of
    depth. Jumps to finally blocks should not alter the depth, but
    would potentially increase the max-depth.
    The max-depth for the main CFG is the stack depth for the function.

    ----------


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue17611\>


    @pitrou
    Copy link
    Member

    pitrou commented Jan 2, 2018

    With fixed deltas, it is not too hard to construct an algorithm to
    determine the correct stack size. The presence of JSR-style finally
    blocks complicates things a little, but not that much.

    Any chance you can put that patch somewhere to get an idea of the additional complexity?

    I'd like to gauge the total complexity of each approach including the stack size computation fix (since fixing that issue is one of the end goals). Also, that would help compare your PR and Serhiy's on more equal terms.

    @markshannon
    Copy link
    Member Author

    markshannon commented Jan 2, 2018

    I've added a commit to compute the stack size.
    It should compute the exact stack size in all cases, and thus fix bpo-24340.

    @markshannon
    Copy link
    Member Author

    markshannon commented Jan 2, 2018

    It looks like the build core dumped on Travis-CI

    I failed to account for the extra stuff left on the stack when handling exceptions and jumping out of a finally block due to a break. Caught by Serhiy's tests in #5078.
    I could add a work around, but it would be a bit hacky.

    I'm now convinced that code duplication is the only way to do this properly. The only real advantage of JSR-style finally blocks is the ease of implementation of frame.set_lineno().

    Rather than go around in circles any more, I suggest that we merge Serhiy's PR (PR 5006). I can then cherry-pick my commits to clean up the implementation of the with statement (as a new PR).

    I then plan to implement try-finally using code duplication, but that might not make it into 3.7 and we shouldn't hold up this issue any longer.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 2, 2018

    Le 02/01/2018 à 18:13, Mark Shannon a écrit :

    Rather than go around in circles any more, I suggest that we merge Serhiy's PR (PR 5006). I can then cherry-pick my commits to clean up the implementation of the with statement (as a new PR).

    Fair enough. I'm making a final review round on Serhiy's PR and
    hopefully this will be in.

    Also looking forward to the cleanup followup.

    @nascheme
    Copy link
    Member

    nascheme commented Jan 3, 2018

    On 2017-12-29, Mark Shannon wrote:

    One point I didn't cover is jumping to a new line in the debugger.
    Implementing that reliably for finally blocks with code
    duplication is tricky and would mean adding a number of marker
    bytecodes. Which is a big point in favour of the JSR style.

    Could we virtually execute the code, forwards or backwards,
    starting from f_lasti? For conditional jumps, we would follow both
    branches. Stop when we find the line number we want. If we hit an
    opcode that causes a fblock push or pop, abort.

    I started implementing this as a proof of concept but didn't finish.
    It seems to fit most naturally as an exported function of
    Python/peephole.c. That file already does a lot of similar stuff.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 9, 2018

    PR 5143 is yet one alternative. It is a variant of PR 5006 that restores SETUP_EXCEPT and gets rid of the new opcode BEGIN_FINALLY. SETUP_FINALLY, SETUP_WITH and SETUP_ASYNC_WITH now push NULL on the stack instead.

    PR 5143 removes and add fewer opcodes than PR 5006, but changes the behavior of more opcodes. I don't know what of them looks simpler.

    @markshannon
    Copy link
    Member Author

    markshannon commented Jan 9, 2018

    Can we stick to the one PR, please?
    Once it is merged then we can improve things.

    Also, I don't like leaving NULLs on the stack, they are an invitation to seg-fault. PR 5143 leaves more NULLs on the stack for longer.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 10, 2018

    Please let's stick with PR 5006.

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Jan 16, 2018

    What should be done for PR 5006? Changes of this PR have been minimized for easier review. But many interesting things can made after merging it. For example simplify 'with'-related opcodes. Or implement static blocks (this will make the 'try' statement zero-overhead, make checks in frame.f_lineno setter more reliable, and unblock some bytecode optimizations). It would be better to have more time before a feature freeze.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 16, 2018

    Apparently PR 5006 has a conflict. Otherwise, is it ready for review?

    @pitrou
    Copy link
    Member

    pitrou commented Jan 16, 2018

    Simplifying "with"-related opcodes (and perhaps "finally" ones) sounds nice.

    "try" blocks are already zero-overhead for practical purposes (i.e. I don't think I've met any workload where moving a "try" around did improve performance significantly).

    @serhiy-storchaka serhiy-storchaka added 3.8 and removed 3.7 labels Feb 1, 2018
    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 22, 2018

    New changeset 520b7ae by Serhiy Storchaka in branch 'master':
    bpo-17611. Move unwinding of stack for "pseudo exceptions" from interpreter to compiler. (GH-5006)
    520b7ae

    @serhiy-storchaka
    Copy link
    Member

    serhiy-storchaka commented Feb 25, 2018

    Finally it is merged, and seems successfully. Thank you all involved, and first at all Mark, the original author.

    This unblocked further changes, bpo-32489 and bpo-32949. And we can try implement other ideas.

    Changes in the f_lineno setter fixed crashes in bpo-17288 and bpo-28883. But while writing tests for it I found other crashers, will open new issues soon.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 interpreter-core Interpreter core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants