-
-
Notifications
You must be signed in to change notification settings - Fork 30.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify the VM's stack manipulations #90686
Comments
...as discussed in faster-cpython/ideas#228. We can dramatically simplify our stack manipulations by getting rid of the
It then becomes much simpler for the peephole optimizer to reason about code sequences involving these instructions (for example, it's pretty straightforward to truly *optimize* an arbitrary sequence of swaps). |
What are the (unoptimized) performance implications of replacing one instruction with multiple instructions? |
In practice, pretty much the only thing that will do more work now is augmented subscription:
main: 1 2 LOAD_NAME 0 (a) patched: 1 2 LOAD_NAME 0 (a) Pattern matching is the only place where we use ROT_N, and frankly it's *misused* there... often, it makes the cost of storing names quadratic at runtime. Even though we emit *more* instructions now, the peephole optimizer actually cuts them down significantly, and the result is more efficient than before. For example: >>> match x:
... case (a, b, c, None):
... pass main: 1 2 LOAD_NAME 0 (x) 2 4 MATCH_SEQUENCE 3 32 LOAD_CONST 1 (None) 2 >> 36 POP_TOP patched: 1 2 LOAD_NAME 0 (x) 2 4 MATCH_SEQUENCE 3 32 LOAD_CONST 1 (None) 2 >> 36 POP_TOP Replacing the ROT_FOURs with SWAPs may seem minor, but it ends up being a *lot* less work at runtime. |
Very interesting. Do we have any current (or even cutting edge, i.e. 3.11) timings for individual instructions? Or a breakdown of how frequently different instructions are invoked? I remember Carl Shapiro presenting his findings here several years ago (I think in the penultimate in-person Language Summit). |
In a typical run of the pyperformance benchmark suite, rotations account for a bit over 1% of all instructions executed. I don't have timings for individual instructions, unfortunately. |
Timings for individual instructions are a bit meaningless, as out-of-order execution and speculation on modern CPUs makes it hard to pin down the timing of anything. I did an experiment to double the number of instructions. It slowed things down by ~10%, so increasing the number of instructions by 1% would be expected to result in a slowdown of 0.1%. In other words, this is going to make little or no difference to performance. It does make things cleaner and simpler though, which has its own benefits. |
Were the extra instructions just NOPs, or were they actually doing any work? If they were NOPs, then presumably those numbers tell us more about the overhead of dispatch and cache pressure than anything else. |
IIRC, Carl got a lot of benefit out of reordering the opcodes in the main loop to put the most common ones at the top. I don't know if that is still relevant or whether computed gotos, when enabled, change that calculus. |
AFAIK, on nix systems with PGO, the order doesn't matter. Brandt did some interesting research on this earlier faster-cpython/ideas#96. On Windows, it might be a different story. There are reports of up to 5% speedups in pyperformance just by moving LOAD_FAST and LOAD_GLOBAL out of the loop (to the top). See the last few messages in https://bugs.python.org/issue45116. I've heard from some people that MSVC's PGO doesn't deal with gigantic switch-case statements very well, though I can't confirm the veracity of that info. |
Minor nit: I think swaptimize() should check the for PyMem_Malloc returning NULL here. // Create an array with elements {0, 1, 2, ..., depth - 1}:
int *stack = PyMem_Malloc(depth * sizeof(int));
for (int i = 0; i < depth; i++) {
stack[i] = i;
} |
Hm, yeah. Bummer that this needs error handling now. I'll have a fix up soon. |
SWAP
s at compile-time #30970PyMem_Malloc
call forNULL
#30998BUILD_TUPLE
/UNPACK_SEQUENCE
folding #31039Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: