Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the VM's stack manipulations #90686

Closed
brandtbucher opened this issue Jan 26, 2022 · 15 comments
Closed

Simplify the VM's stack manipulations #90686

brandtbucher opened this issue Jan 26, 2022 · 15 comments
Assignees
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage

Comments

@brandtbucher
Copy link
Member

BPO 46528
Nosy @warsaw, @markshannon, @brandtbucher, @sweeneyde, @Fidget-Spinner
PRs
  • bpo-46528: Simplify the VM's stack manipulations #30902
  • bpo-46528: Attempt SWAPs at compile-time #30970
  • bpo-46528: Check PyMem_Malloc call for NULL #30998
  • bpo-46528: Simplify BUILD_TUPLE/UNPACK_SEQUENCE folding #31039
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/brandtbucher'
    closed_at = <Date 2022-02-09.23:16:39.087>
    created_at = <Date 2022-01-26.01:39:59.759>
    labels = ['interpreter-core', '3.11', 'performance']
    title = "Simplify the VM's stack manipulations"
    updated_at = <Date 2022-02-09.23:16:39.087>
    user = 'https://github.com/brandtbucher'

    bugs.python.org fields:

    activity = <Date 2022-02-09.23:16:39.087>
    actor = 'brandtbucher'
    assignee = 'brandtbucher'
    closed = True
    closed_date = <Date 2022-02-09.23:16:39.087>
    closer = 'brandtbucher'
    components = ['Interpreter Core']
    creation = <Date 2022-01-26.01:39:59.759>
    creator = 'brandtbucher'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46528
    keywords = ['patch']
    message_count = 15.0
    messages = ['411697', '411698', '411700', '411791', '411937', '411938', '411940', '411941', '411952', '411986', '412024', '412030', '412314', '412939', '412954']
    nosy_count = 5.0
    nosy_names = ['barry', 'Mark.Shannon', 'brandtbucher', 'Dennis Sweeney', 'kj']
    pr_nums = ['30902', '30970', '30998', '31039']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue46528'
    versions = ['Python 3.11']

    @brandtbucher
    Copy link
    Member Author

    ...as discussed in faster-cpython/ideas#228.

    We can dramatically simplify our stack manipulations by getting rid of the DUP_TOP* and ROT_* families of instructions:

    • Replace DUP_TOP with COPY(1).
    • Replace DUP_TOP_TWO with COPY(2), COPY(2).
    • Introduce a new SWAP instruction.
      • Replace ROT_TWO with SWAP(2).
      • Replace ROT_THREE with SWAP(3), SWAP(2).
      • Remove ROT_FOUR.
      • Replace ROT_N(n) with SWAP(n), SWAP(n - 1), ..., SWAP(2).

    It then becomes much simpler for the peephole optimizer to reason about code sequences involving these instructions (for example, it's pretty straightforward to truly *optimize* an arbitrary sequence of swaps).

    @brandtbucher brandtbucher added the 3.11 only security fixes label Jan 26, 2022
    @brandtbucher brandtbucher self-assigned this Jan 26, 2022
    @brandtbucher brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 26, 2022
    @brandtbucher brandtbucher self-assigned this Jan 26, 2022
    @brandtbucher brandtbucher added interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Jan 26, 2022
    @warsaw
    Copy link
    Member

    warsaw commented Jan 26, 2022

    What are the (unoptimized) performance implications of replacing one instruction with multiple instructions?

    @brandtbucher
    Copy link
    Member Author

    In practice, pretty much the only thing that will do more work now is augmented subscription:

    >> a[b] += c

    main:

    1 2 LOAD_NAME 0 (a)
    4 LOAD_NAME 1 (b)
    6 DUP_TOP_TWO
    8 BINARY_SUBSCR
    10 LOAD_NAME 2 (c)
    12 BINARY_OP 13 (+=)
    14 ROT_THREE
    16 STORE_SUBSCR

    patched:

    1 2 LOAD_NAME 0 (a)
    4 LOAD_NAME 1 (b)
    6 COPY 2
    8 COPY 2
    10 BINARY_SUBSCR
    12 LOAD_NAME 2 (c)
    14 BINARY_OP 13 (+=)
    16 SWAP 3
    18 SWAP 2
    20 STORE_SUBSCR

    Pattern matching is the only place where we use ROT_N, and frankly it's *misused* there... often, it makes the cost of storing names quadratic at runtime. Even though we emit *more* instructions now, the peephole optimizer actually cuts them down significantly, and the result is more efficient than before. For example:

    >>> match x:
    ...     case (a, b, c, None):
    ...         pass

    main:

    1 2 LOAD_NAME 0 (x)

    2 4 MATCH_SEQUENCE
    6 POP_JUMP_IF_FALSE 20 (to 40)
    8 GET_LEN
    10 LOAD_CONST 0 (4)
    12 COMPARE_OP 2 (==)
    14 POP_JUMP_IF_FALSE 20 (to 40)
    16 UNPACK_SEQUENCE 4
    18 ROT_FOUR
    20 ROT_FOUR
    22 ROT_FOUR
    24 POP_JUMP_IF_NOT_NONE 18 (to 36)
    26 STORE_NAME 1 (a)
    28 STORE_NAME 2 (b)
    30 STORE_NAME 3 (c)

    3 32 LOAD_CONST 1 (None)
    34 RETURN_VALUE

    2 >> 36 POP_TOP
    38 POP_TOP
    >> 40 POP_TOP
    42 LOAD_CONST 1 (None)
    44 RETURN_VALUE

    patched:

    1 2 LOAD_NAME 0 (x)

    2 4 MATCH_SEQUENCE
    6 POP_JUMP_IF_FALSE 20 (to 40)
    8 GET_LEN
    10 LOAD_CONST 0 (4)
    12 COMPARE_OP 2 (==)
    14 POP_JUMP_IF_FALSE 20 (to 40)
    16 UNPACK_SEQUENCE 4
    18 SWAP 2
    20 SWAP 3
    22 SWAP 4
    24 POP_JUMP_IF_NOT_NONE 18 (to 36)
    26 STORE_NAME 1 (a)
    28 STORE_NAME 2 (b)
    30 STORE_NAME 3 (c)

    3 32 LOAD_CONST 1 (None)
    34 RETURN_VALUE

    2 >> 36 POP_TOP
    38 POP_TOP
    >> 40 POP_TOP
    42 LOAD_CONST 1 (None)
    44 RETURN_VALUE

    Replacing the ROT_FOURs with SWAPs may seem minor, but it ends up being a *lot* less work at runtime.

    @brandtbucher
    Copy link
    Member Author

    New changeset 8548366 by Brandt Bucher in branch 'main':
    bpo-46528: Simplify the VM's stack manipulations (GH-30902)
    8548366

    @warsaw
    Copy link
    Member

    warsaw commented Jan 27, 2022

    Very interesting. Do we have any current (or even cutting edge, i.e. 3.11) timings for individual instructions? Or a breakdown of how frequently different instructions are invoked? I remember Carl Shapiro presenting his findings here several years ago (I think in the penultimate in-person Language Summit).

    @brandtbucher
    Copy link
    Member Author

    In a typical run of the pyperformance benchmark suite, rotations account for a bit over 1% of all instructions executed.

    I don't have timings for individual instructions, unfortunately.

    @markshannon
    Copy link
    Member

    Timings for individual instructions are a bit meaningless, as out-of-order execution and speculation on modern CPUs makes it hard to pin down the timing of anything.

    I did an experiment to double the number of instructions. It slowed things down by ~10%, so increasing the number of instructions by 1% would be expected to result in a slowdown of 0.1%.

    In other words, this is going to make little or no difference to performance. It does make things cleaner and simpler though, which has its own benefits.

    @brandtbucher
    Copy link
    Member Author

    I did an experiment to double the number of instructions.

    Were the extra instructions just NOPs, or were they actually doing any work?

    If they were NOPs, then presumably those numbers tell us more about the overhead of dispatch and cache pressure than anything else.

    @warsaw
    Copy link
    Member

    warsaw commented Jan 28, 2022

    IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.

    @Fidget-Spinner
    Copy link
    Member

    IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus.

    AFAIK, on nix systems with PGO, the order doesn't matter. Brandt did some interesting research on this earlier faster-cpython/ideas#96.

    On Windows, it might be a different story. There are reports of up to 5% speedups in pyperformance just by moving LOAD_FAST and LOAD_GLOBAL out of the loop (to the top). See the last few messages in https://bugs.python.org/issue45116. I've heard from some people that MSVC's PGO doesn't deal with gigantic switch-case statements very well, though I can't confirm the veracity of that info.

    @sweeneyde
    Copy link
    Member

    Minor nit: I think swaptimize() should check the for PyMem_Malloc returning NULL here.

        // Create an array with elements {0, 1, 2, ..., depth - 1}:
        int *stack = PyMem_Malloc(depth * sizeof(int));
        for (int i = 0; i < depth; i++) {
            stack[i] = i;
        }

    @brandtbucher
    Copy link
    Member Author

    Hm, yeah. Bummer that this needs error handling now.

    I'll have a fix up soon.

    @brandtbucher
    Copy link
    Member Author

    New changeset a0e55a5 by Brandt Bucher in branch 'main':
    bpo-46528: Simplify BUILD_TUPLE/UNPACK_SEQUENCE folding (GH-31039)
    a0e55a5

    @brandtbucher
    Copy link
    Member Author

    New changeset 46328d8 by Brandt Bucher in branch 'main':
    bpo-46528: Check PyMem_Malloc for NULL (GH-30998)
    46328d8

    @brandtbucher
    Copy link
    Member Author

    New changeset 78ae4cc by Brandt Bucher in branch 'main':
    bpo-46528: Attempt SWAPs at compile-time (GH-30970)
    78ae4cc

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants