GH-135379: Top of stack caching for the JIT. #135465

markshannon · 2025-06-13T13:11:56Z

The stats need fixing and the generated tables could be more compact, but it works.

Issue: Top-of-stack caching in the JIT #135379

Fidget-Spinner

This is really cool. I'll do a full review soon enough.

Python/optimizer.c

markshannon · 2025-06-20T13:15:39Z

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Fidget-Spinner · 2025-06-20T13:21:10Z

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst

Fidget-Spinner

I need to review the cases generator later.

Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst

Python/optimizer.c

Fidget-Spinner · 2025-06-20T16:14:25Z

Python/optimizer.c

+    if (_PyUop_Caching[base_opcode].exit_depth_is_output) {
+        return input + _PyUop_Caching[base_opcode].delta;


What does this do?

Returns the output stack depth, which is the input stack depth + the delta.

Python/optimizer.c

Fidget-Spinner · 2025-06-20T16:19:44Z

Tools/cases_generator/analyzer.py

+    if ideal_inputs > 3:
+        ideal_inputs = 3
+    if ideal_outputs > 3:
+        ideal_outputs = 3


Can you move the value 3 to a global magic number so that we can play around with increasing/decreasing register counts in the future?

Tools/cases_generator/analyzer.py

Fidget-Spinner

Stylistic nits

Tools/cases_generator/tier2_generator.py

Fidget-Spinner · 2025-06-24T13:41:44Z

Tools/cases_generator/analyzer.py

+def is_large(uop: Uop) -> bool:
+    return len(list(uop.body.tokens())) > 80
+
+def get_uop_cache_depths(uop: Uop) -> Iterator[tuple[int, int, int]]:


I don't like using generators here. There's no benefit as it's not like the uop sequence is huge or something. The code might be clearer without the yields and with just a list append instead. This is an optional review though.

I think using a generator here is fine.

markshannon · 2025-06-26T13:33:52Z

Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum.

Fidget-Spinner

I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.

For decref elimination, there are two ways:

Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
Specialize for POP_TOP.

Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .

Tools/cases_generator/analyzer.py

brandtbucher · 2025-06-26T16:02:29Z

I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster nbody on most platforms, and everything else in the noise" really just doesn't feel like enough. If this is really a good approach to register allocation, we should be able to see a general pattern of improvement for multiple JIT-heavy benchmarks. That could mean more tweaking of magic numbers, or a change in approach.

I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations.

I'm not sure that's the case. Yesterday, we benchmarked this approach together:

Just decref elimination: 7% faster nbody
This PR: 13% faster nbody
Both PRs together: 9% faster nbody

So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR.

Fidget-Spinner · 2025-06-26T16:12:20Z

@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL

brandtbucher · 2025-06-26T16:27:39Z

But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet).

Fidget-Spinner · 2025-06-26T16:31:25Z

I can't say for sure whether the effect is the same.

Fidget-Spinner · 2025-06-26T16:55:39Z

One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment.

The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me.

Fidget-Spinner · 2025-06-26T17:10:11Z

One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking

On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully.

Fidget-Spinner · 2025-06-26T17:47:09Z

@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:

Base bda1218 (the reference)
Decref
tos caching
tos caching + decref

All of them are standard benchmarking builds, Ie PGO, LTO, JIT.

These are the results for nbody:

Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-decref-bda121862] 102 ms +- 1 ms: 1.03x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-bda121862] 91.1 ms +- 0.5 ms: 1.16x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-decref-bda121862] 85.8 ms +- 0.4 ms: 1.23x faster

So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system.

markshannon added 6 commits June 12, 2025 14:19

Tier 2 TOS caching. Work in progress

579b758

Tier 2 TOS caching, working for interpreter.

489e510

Get JIT working

f603929

Fix tool to support 3.11

cf1d7ab

Add news

efd7a0a

int arithmetic doesn't escape

bb4e6b9

bedevere-app bot mentioned this pull request Jun 13, 2025

Top-of-stack caching in the JIT #135379

Open

Fidget-Spinner reviewed Jun 13, 2025

View reviewed changes

Python/optimizer.c Show resolved Hide resolved

markshannon added 10 commits June 13, 2025 14:48

Repair stats

e976b9b

Add missing type annotation

11de93e

Pacify mypy

4698695

Add type annotation

33837a7

Avoid overflow gathering stats

8bb12ef

Reduce spilling

920e6de

Merge branch 'main' into tier-2-tos-caching

45e1abd

Merge branch 'main' into tier-2-tos-caching

3d72871

Merge branch 'main' into tier-2-tos-caching

0240115

Improve heuristics for stack caching

2850d72

markshannon force-pushed the tier-2-tos-caching branch from 78489ea to 2850d72 Compare June 19, 2025 14:49

Merge branch 'main' into tier-2-tos-caching

1c291f1

Add news

ba2331a

markshannon marked this pull request as ready for review June 20, 2025 15:04

markshannon requested review from brandtbucher and savannahostrowski as code owners June 20, 2025 15:04

bedevere-app bot added the awaiting core review label Jun 20, 2025

Fidget-Spinner reviewed Jun 20, 2025

View reviewed changes

Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst Outdated Show resolved Hide resolved

Fidget-Spinner reviewed Jun 20, 2025

View reviewed changes

Fix uop execution stats

cbee8d2

Address review comments

40988a0

Fidget-Spinner mentioned this pull request Jun 23, 2025

gh-134584: Eliminate redundant refcounting from _CALL_TYPE_1 #135818

Open

Fidget-Spinner reviewed Jun 24, 2025

View reviewed changes

Tools/cases_generator/analyzer.py Outdated Show resolved Hide resolved

Fidget-Spinner reviewed Jun 24, 2025

View reviewed changes

markshannon added 2 commits June 26, 2025 13:51

Address code review

76030e9

Merge branch 'main' into tier-2-tos-caching

991bbea

Fidget-Spinner approved these changes Jun 26, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Jun 26, 2025

efimov-mikhail reviewed Jun 26, 2025

View reviewed changes

Tools/cases_generator/analyzer.py Outdated Show resolved Hide resolved

Remove blank line

51d4342

MEDX000 added this to MEDX-arabic-programme Jun 26, 2025

github-project-automation bot moved this to Todo in MEDX-arabic-programme Jun 26, 2025

		if (_PyUop_Caching[base_opcode].exit_depth_is_output) {
		return input + _PyUop_Caching[base_opcode].delta;

Uh oh!

GH-135379: Top of stack caching for the JIT. #135465

Are you sure you want to change the base?

GH-135379: Top of stack caching for the JIT. #135465

Uh oh!

Conversation

markshannon commented Jun 13, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

markshannon commented Jun 20, 2025

Uh oh!

Fidget-Spinner commented Jun 20, 2025

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fidget-Spinner Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

markshannon Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Fidget-Spinner Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Fidget-Spinner Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

markshannon Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

markshannon commented Jun 26, 2025

Uh oh!

Fidget-Spinner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brandtbucher commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

brandtbucher commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Fidget-Spinner commented Jun 26, 2025

Uh oh!

Uh oh!

markshannon commented Jun 13, 2025 •

edited by bedevere-app bot

Loading

Fidget-Spinner commented Jun 26, 2025 •

edited

Loading