-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
GH-135379: Top of stack caching for the JIT. #135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool. I'll do a full review soon enough.
78489ea
to
2850d72
Compare
Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall. The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. |
Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug. |
Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review the cases generator later.
Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst
Outdated
Show resolved
Hide resolved
if (_PyUop_Caching[base_opcode].exit_depth_is_output) { | ||
return input + _PyUop_Caching[base_opcode].delta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the output stack depth, which is the input stack depth + the delta.
Tools/cases_generator/analyzer.py
Outdated
if ideal_inputs > 3: | ||
ideal_inputs = 3 | ||
if ideal_outputs > 3: | ||
ideal_outputs = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the value 3
to a global magic number so that we can play around with increasing/decreasing register counts in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nits
def is_large(uop: Uop) -> bool: | ||
return len(list(uop.body.tokens())) > 80 | ||
|
||
def get_uop_cache_depths(uop: Uop) -> Iterator[tuple[int, int, int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like using generators here. There's no benefit as it's not like the uop sequence is huge or something. The code might be clearer without the yields and with just a list append instead. This is an optional review though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using a generator here is fine.
Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.
For decref elimination, there are two ways:
- Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
- Specialize for POP_TOP.
Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.
Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .
I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.
I'm not sure that's the case. Yesterday, we benchmarked this approach together:
So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR. |
@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL |
But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet). |
I can't say for sure whether the effect is the same. |
One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment. The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me. |
One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully. |
@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:
All of them are standard benchmarking builds, Ie PGO, LTO, JIT. These are the results for nbody:
So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system. |
The stats need fixing and the generated tables could be more compact, but it works.