Skip to content

GH-135379: Top of stack caching for the JIT. #135465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2025

The stats need fixing and the generated tables could be more compact, but it works.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool. I'll do a full review soon enough.

@markshannon
Copy link
Member Author

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

@Fidget-Spinner
Copy link
Member

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

@markshannon markshannon marked this pull request as ready for review June 20, 2025 15:04
Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to review the cases generator later.

Comment on lines +1028 to +1029
if (_PyUop_Caching[base_opcode].exit_depth_is_output) {
return input + _PyUop_Caching[base_opcode].delta;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns the output stack depth, which is the input stack depth + the delta.

Comment on lines 1259 to 1262
if ideal_inputs > 3:
ideal_inputs = 3
if ideal_outputs > 3:
ideal_outputs = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the value 3 to a global magic number so that we can play around with increasing/decreasing register counts in the future?

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nits

def is_large(uop: Uop) -> bool:
return len(list(uop.body.tokens())) > 80

def get_uop_cache_depths(uop: Uop) -> Iterator[tuple[int, int, int]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like using generators here. There's no benefit as it's not like the uop sequence is huge or something. The code might be clearer without the yields and with just a list append instead. This is an optional review though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a generator here is fine.

@markshannon
Copy link
Member Author

Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.

For decref elimination, there are two ways:

  1. Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
  2. Specialize for POP_TOP.

Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .

@brandtbucher
Copy link
Member

I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster nbody on most platforms, and everything else in the noise" really just doesn't feel like enough. If this is really a good approach to register allocation, we should be able to see a general pattern of improvement for multiple JIT-heavy benchmarks. That could mean more tweaking of magic numbers, or a change in approach.

I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations.

I'm not sure that's the case. Yesterday, we benchmarked this approach together:

  • Just decref elimination: 7% faster nbody
  • This PR: 13% faster nbody
  • Both PRs together: 9% faster nbody

So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR.

@Fidget-Spinner
Copy link
Member

@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL

@brandtbucher
Copy link
Member

But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet).

@Fidget-Spinner
Copy link
Member

I can't say for sure whether the effect is the same.

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jun 26, 2025

One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment.

The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me.

@Fidget-Spinner
Copy link
Member

One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking

On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully.

@Fidget-Spinner
Copy link
Member

@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:

  1. Base bda1218 (the reference)
  2. Decref
  3. tos caching
  4. tos caching + decref

All of them are standard benchmarking builds, Ie PGO, LTO, JIT.

These are the results for nbody:

Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-decref-bda121862] 102 ms +- 1 ms: 1.03x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-bda121862] 91.1 ms +- 0.5 ms: 1.16x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-decref-bda121862] 85.8 ms +- 0.4 ms: 1.23x faster

So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants