-
-
Notifications
You must be signed in to change notification settings - Fork 33.1k
Description
Feature or enhancement
Proposal:
We should get tail calling interpreter support for MSVC.
The latest up-to-date figures for the tail calling interpreter are:
- 1-3% pyperformance faster on Ubuntu x64
- 4-5% pyperformance faster on macOS AArch64
- The last benchmarks for the tail calling interpreter on Windows MSVC reported a 17% speedup on pyperformance.
On Windows, the performance isn't easy to measure because pyperf system tune
doesn't work on there. However, on a best-effort quiet system and some benchmarks from pyperformance on my system, these are my results:
Mean +- std dev: [spectralnorm_tc_no] 146 ms +- 1 ms -> [spectralnorm_tc] 98.3 ms +- 1.1 ms: 1.48x faster
Mean +- std dev: [nbody_tc_no] 145 ms +- 2 ms -> [nbody_tc] 107 ms +- 2 ms: 1.35x faster
Mean +- std dev: [bm_django_template_tc_no] 26.9 ms +- 0.5 ms -> [bm_django_template_tc] 22.8 ms +- 0.4 ms: 1.18x faster
Mean +- std dev: [xdsl_tc_no] 64.2 ms +- 1.6 ms -> [xdsl_tc] 56.1 ms +- 1.5 ms: 1.14x faster
Note that I had to apply a 2-line patch to fix xDSL for Python 3.15.
Nbody and spectralnorm are toy shootout benchmarks. xDSL is a new benchmark added that is a MLIR DSL library with written almost entirely in pure Python. I consider it a pure-Python large library not too far off from mypy (though it's obviously smaller). 14% faster for xDSL is massive. It's roughly half of the speedup we got in 3.11 with the specializing interpreter, and that was thousands of lines of code. This change in contrast will be just one PR. Django templating shows a big speedup too -- 18% faster.
This could not have been possible without help from the MSVC team. Specifically, I'd like to give a shoutout to Hulon Jenkins and the MSVC team in the patch notes when I land this.
I've discussed this during the Wednesday CHIPS meetings, and there was no opposition to the new set of changes required to get this working. The changes are minimal and mostly just involve adding correct restrict
and scoping to things to tell MSVC a local variable doesn't escape. This should also benefit GCC and Clang in some fashion.
We require working CI before this is merged, so I am now waiting on GitHub actions or some other CI to get this working.
Very special shoutout as well to @chris-eibl who has been helping me with this on Windows.
Some other benefits: the TC on MSVC actually correctly "resets" the inlining heuristic on MSVC. This means eventually we can get rid of all the macros and ugly hacks we have to make the current interpreter faster on MSVC once we distribute the builds with tail calling. Example of a hack where we use macros over static inlines on MSVC just because of the interpreter loop breaking the inliniing heuristics of MSVC #121263
Possibility of a compiler bug this time
I doubt it, as MSVC has no computed goto, so it can't have the same bug as Clang that we bumped into the previous time.
Where are the perf gains coming from?
It's mainly the better inlining that we get from the tail calling interpreter, and elimination of double jumps vs the switch-case interpreter.
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response