-
-
Notifications
You must be signed in to change notification settings - Fork 30.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting stack unwinding in the JIT compiler #126910
Comments
Just to clarify for anyone reading, it is sufficient to only enable frame pointers in JIT code, not necessarily the entire interpreter. After spending a week with @pablogsal digging deep into this issue and prototyping and evaluating the different options available to us (including doing nothing), my personal opinion is that this issue is fixing a behavioral change, not adding a feature. It's also my personal opinion that the two-line diff above and the 2% hit are a good compromise that unblocks adoption of the JIT, and I think it's the best path forward. We have explored strategies to emit DWARF as well; copy-and-patch makes harvesting and emitting correct DWARF not-too-difficult, but I also see it as an optional, lower priority nice-to-have. Frame pointers are a convenient escape hatch; for many tools that still don't work, we can at least point at frame pointers and argue that what the tools want should be possible using them. Currently, what they want is impossible without disabling the JIT... and we want people to turn it on! Turning on frame pointers in the naive way above also does not preclude the possibility of improving the way we emit frame pointer code in the future should we decide it's worth clawing back that 2% in exchange for additional implementation complexity; such a patch can stand on its own merits. It also shouldn't preclude a runtime option to opt in or out of them using an environment variable (again, though, I'm not sure this is worth additional complexity right now to compile two versions of every stencil to make this work, but it's certainly possible). I'd just like to thank @pablogsal for patiently educating me on this deep, dark, expansive area (and the people who rely on it) and working with me to explore solutions and compromise on something workable. |
Also, it's kind of neat that we've pretty much followed through with what PEP 744 had to say about this issue:
|
Benchmarks on JIT frame pointers are in:
Looks like a 2-3% performance hit, with the clear outlier of |
According to our profiling jitted code is only ~14% of the execution time, so a 3% slowdown is a ~20% slowdown in the jitted code. I think that's too slow. |
Then what do you propose? I don't think we have an alternative here. This is just the cost of not breaking existing tools and honestly I think we are extremely lucky to have a way to fix it that's just 2 lines with 2% hit. |
Amazing research work for a super important feature. Thank you @pablogsal and @brandtbucher for this effort! I'm a strong +1. |
In the text of the issue you state:
and later in the comments:
which contradicts what you said earlier. Also that outlier sounds very odd to me as all the AArch64 platforms should behave similarly. |
The original sentence was based on some tests runs I did originally but the second is a full py performance run. I also agree it looks off because macOS is using the same instruction set. We need to run again and confirm because it makes no sense to me and also contradicts my own small tests I did separately |
Yeah, the run had some wider min/max values than I'm used to seeing, so it might have been a fluke: https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20241114-3.14.0a1%2B-b1f0a4e-JIT/bm-20241114-arminc-aarch64-brandtbucher-justin_frame_pointer-3.14.0a1%2B-b1f0a4e-vs-base.svg |
I'm +1 on this. Frame pointers are basically the only viable way to handle stack unwinding here, and it's absolutely worth the cost. At the very least, we would need a runtime option to enable them (i.e. without recompiling CPython) because there are going to be essential scenarios that need them. FWIW, MSVC doesn't allow enabling frame pointers for x64 or ARM64, but the function unwinding tables allow specifying them. I'm not sure what LLVM does, but it should be possible to have unwinding via a frame pointer work on these platforms too. |
The idea is to have them activated by default (safety/debug should be the default) and may be a env variable or some other way to opt-out if you don't care but this should be set by the end user with knowledge of what the consequences are. Another reason we want this by default is that we don't want to make it impossible for users to sent us backtraces when the interpreter or a C extension crashes or hangs because reproducing this can be quite challenging (all of this is covered in the text). |
In what way is using frame pointers safer? |
The interpreter is compiled without using the frame pointer. So unwinding clearly doesn't need frame pointers. |
Because it allows debuggers to work. |
What do you mean by "safety" in this context? |
A GitHub issue isn't good for real-time back-and-forth chat. |
That if you application crashes, hangs or generates a core you can actually use a debugger with that and not getting wrong stacks because the JIT makes them choke. |
It needs frame pointers if you have a JIT compiler in the middle because the JIT doesn't have DWARF (debug information). This is explained in the issue.
You cannot because the JIT doesn't have DWARF and a backing elf file that unwinders can use. It's just a random string of bytes. |
Agreed, let's chat on Wednesday. |
Some good news: with this fix (that I plan to upstream) for LLVM's existing "reserved frame pointers" functionality... diff --git a/llvm/lib/Target/X86/X86RegisterInfo.cpp b/llvm/lib/Target/X86/X86RegisterInfo.cpp
index 50db211c99d8..9b8652b7e302 100644
--- a/llvm/lib/Target/X86/X86RegisterInfo.cpp
+++ b/llvm/lib/Target/X86/X86RegisterInfo.cpp
@@ -563,7 +563,7 @@ BitVector X86RegisterInfo::getReservedRegs(const MachineFunction &MF) const {
Reserved.set(SubReg);
// Set the frame-pointer register and its aliases as reserved if needed.
- if (TFI->hasFP(MF)) {
+ if (TFI->hasFP(MF) || MF.getTarget().Options.FramePointerIsReserved(MF)) {
if (MF.getInfo<X86MachineFunctionInfo>()->getFPClobberedByInvoke())
MF.getContext().reportError(
SMLoc(), ...and this change to CPython (replacing the 2-line change @pablogsal suggests above)... diff --git a/Tools/jit/_targets.py b/Tools/jit/_targets.py
index d8dce0a905c..4e898a86f86 100644
--- a/Tools/jit/_targets.py
+++ b/Tools/jit/_targets.py
@@ -121,6 +121,8 @@ async def _compile(
f"-I{CPYTHON / 'Python'}",
f"-I{CPYTHON / 'Tools' / 'jit'}",
"-O3",
+ "-Xclang",
+ f"-mframe-pointer={'all' if opname == 'shim' else 'reserved'}",
"-c",
# This debug info isn't necessary, and bloats out the JIT'ed code.
# We *may* be able to re-enable this, process it, and JIT it for a ...frame-pointer-based unwinding works, with 0% slowdown on benchmarks! More context: There are two reasons why frame pointers are slow: you lose a register, and you must save and restore your caller's frame pointer state at the beginning and end of each function. It's the latter that's causing a slowdown for us; we compile each uop as its own function (which tail-calls into the next) and concatenate the bodies. So incrementing a fast local
With frame pointers, this instead becomes:
All of the frame pointer shuffling obviously isn't necessary; we really only need to do it once at the beginning, and once at the end. But some templates have multiple "returns", so finding and removing all of these is hard. And if we compile without frame pointers, the compiler uses the frame pointer register as scratch space, and clobbers whatever value we put there manually. However, LLVM does have (broken on Whenever we call into JIT code, we already push a "shim" frame between the interpreter and the JIT code, to convert between the platform calling convention and the one used for the JIT's tail calls. We can compile just this with frame pointers, since it's very cheap to do so, and compile all of the other JIT code with the frame pointer register reserved. So when the shim calls into the JIT code, the frame pointer register remains valid, and the two frames appear as one to unwinders. |
This may not be the case on Windows (I trust you on other platforms), where the IP is used to look up a table embedded in the executable (or registered dynamically) to decide how to unwind. It looks like the shim function is compiled once, which means it gets its own entry that will unwind correctly, but the jitted code still needs a way to unwind. Unless you generate/copy the shim as part of the rest of the function and it's all contiguous, then both frames really would just be a single frame (and you can probably also just copy the function entry from the shim to apply to the rest of the code, since the unwinding procedure will be identical). There's some relevant code in https://github.com/microsoft/python-etwtrace/blob/main/src/etwtrace/_etwtrace.c that does this for x64 and ARM64, if you prefer a real example. Note that |
Yeah, we haven't checked either approach on platforms other than Linux yet.
It's sort of a mix of the two. We JIT a copy of the shim for each trace, but if one trace jumps into another trace the shim frame from the first remains above it, and the second trace's shim is never used. So if trace A side-exits to B which side-exits to C, then the actual stack will be Not sure if that helps at all? |
Based on our experiments, this seems to be how other unwinding tools work on Linux (@pablogsal can correct me if I'm wrong). If DWARF unwind info for the IP has been registered with the tool (either from loading the executable or through explicit runtime APIs), it will use that. Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers. |
This is fine, I'm sure. It makes it a little harder to figure out exactly which Python code led to the code being executed, but stack unwinding should work (and unwinding is far more important).
I'm 95% sure Windows doesn't fall back to frame pointers, because as I mentioned earlier they're not even enabled by the system compilers. I'm only 50% sure that registering (through those runtime APIs) that a function does use a frame pointer will even work, but at least in that case I'm prepared to report it to the OS as a bug and get it fixed. |
I spent some time investigating windows and trying stuff. Seems that there are two options:
So very similar to |
Our plan now is:
|
For whatever horrible reason, (And it's probably not safe to actually grow a table, since you can't just append, you have to sort the table yourself. But you can provide the entire table and then never grow it. It's just that this is the only API that actually tells anyone that you added a table.) |
TLDR
This is a lot of text because the issue is complex but if you want the gist:
To not break a lot of tools that rely on unwinding, I propose to fix this by compiling the JIT stencils with frame pointers which has a trivial maintenance cost (2 lines) and only involves a 2% hit on speed when using the JIT while making almost all existing debuggers and profilers just work™ in the presence of the JIT.
Although this doesn't fix everything sadly I think is the best compromise that I can find (and we are quite lucky to have it as normally fixing this is a nightmare - as in thousands of complex lines of code and 50% slowdown nightmare).
The issue + proposal
CPython's JIT compiler must provide robust stack unwinding support to maintain compatibility with the Python ecosystem's debugging and profiling tools. This requirement is particularly critical given Python's reliance on native extensions written in C, C++, and Rust. As more performance-critical code moves to these native implementations, the ability to properly unwind through mixed Python and native frames becomes fundamental for effective debugging and profiling. This capability is even more crucial with the introduction of both JIT compilation and free-threaded modes, which significantly increase the complexity of runtime behavior. Without proper unwinding support, debugging te issues that can appear in the presence of these new modes becomes extremely challenging – developers would be unable to generate meaningful stack traces, hampering their ability to understand where and why their programs failed. Consider debugging a deadlock in free threadded code and a Rust extension, or investigating a crash in native code called from JIT-compiled Python – without proper unwinding support, error reports would be incomplete or misleading, making production issues significantly harder to diagnose and fix. The ability to get complete, accurate stack traces in these scenarios is not just a convenience; it's essential for maintaining production applications where Python increasingly interacts with native code through multiple execution modes.
Profiling is equally affected – modern performance analysis tools like py-spy, Austin, or eBPF-based solutions need to understand the complete call stack, including JIT-compiled code, native extensions, and regular Python frames, to provide accurate performance insights.
Some of the popular tools that use native unwinding in some way or form:
This doesn't include the considerable amount of custom tools that are not open source either in the debugging or profiling space.
Unwinding Libraries and Their Capabilities
Several libraries provide stack unwinding capabilities, each with its own strengths and limitations.
_U_dyn_register
__register_frame
backtrace()
__jit_debug_register_code
Implementation Plan for Stack Unwinding Support
After extensive experimentation and analysis (believe me this was a lot of very difficult research work, @brandtbucher and I found that since we have
preserve_none
and LLVM 19 we can just compile the JIT stencils with frame pointers and that makes most tools just work. Based on this I propose implementing frame pointer support as the primary strategy for stack unwinding in CPython's JIT. This approach has proven to be the most pragmatic and effective solution, providing broad compatibility with existing tools while maintaining reasonable performance characteristics. I do think we got very lucky this fixes most of the tools since adding unwinding support for JITs otherwise is very challenging (see following sections).Frame pointer support can be enabled through a minimal change to the JIT compiler flags:
This simple change enables compatibility with a wide range of tools:
Most of the popular tools will just work with this change, instead of having to implement tons of different unwinding information support, which is a nightmare)
Performance Impact
https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20241114-3.14.0a1%2B-925b70b-JIT/bm-20241114-linux-x86_64-brandtbucher-justin_frame_pointer-3.14.0a1%2B-925b70b-vs-base.md
The performance impact of this change has been carefully measured:
This modest performance cost is significantly outweighed by the expected JIT performance improvements. Moreover, the maintenance burden is minimal compared to alternative approaches like implementing custom unwinding support for each tool.
Additional Optimization
We can further enhance compatibility by generating eh-frames in the JIT stencils and calling
__register_frame_table
(a bulk version of__register_frame
). Benchmarks show this additional feature has neutral performance impact while enabling native unwinder support. This makes the solution even more robust without additional overhead.Given these considerations - the minimal implementation complexity, broad tool compatibility, reasonable performance characteristics, and the ability to extend support to native unwinders - frame pointer support represents the optimal path forward for CPython's JIT implementation.
Why this must be activated by default
The decision to enable frame pointers by default is driven by two critical production requirements.
This must be enabled by default because most Python users don't compile Python themselves – they install it through package managers or tools like uv. Unless frame pointers are enabled by default, these critical debugging and profiling capabilities won't be available in most Python installations, severely impacting the quality of bug reports to both CPython and C extensions. Given the minimal performance impact of 2% frame pointers should be enabled by default, following the principle of being "safe by default." While we should provide the option to disable them, this should be a conscious decision made by end users who understand they're trading away the ability to properly profile and debug their applications in production – not a choice made by intermediate distributors or package managers.
Additional information
Using frame pointers aligns with a broader industry trend where major tech companies and Linux distributions are reverting previous optimization decisions and re-enabling frame pointers. Companies like Meta and Google now compile their entire software stack, including their Python installations, with frame pointers enabled. This shift is driven by the recognition that observability and debugging capabilities in production environments far outweigh the minor performance impact of frame pointers. Ubuntu has also adopted this approach, now compiling virtually all packages with frame pointers enabled (with Python being a notable exception we aim to address). This industry-wide move reflects a fundamental reality: in modern production environments, the ability to profile, monitor, and debug applications effectively is crucial, and frame pointers provide the most reliable and universal way to achieve this. The performance cost, typically around 1-5%, is normally considered to be a worthwhile trade-off for the improved observability they enable. This is particularly relevant for server workloads where understanding performance characteristics and debugging production issues is far more valuable than the small overhead frame pointers introduce. The fact that major tech companies maintain this policy even for performance-critical services demonstrates that the benefits of comprehensive profiling and debugging support outweigh the minimal performance impact. Some links:
https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-by-default
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
https://www.polarsignals.com/blog/posts/2023/12/13/embracing-frame-pointers-in-ubuntu-24-04-lts
Background on Stack Unwinding Support Requirements
Adding JIT compilation to CPython requires careful consideration of stack unwinding support. Without proper unwinding capabilities, we risk breaking compatibility with essential development tools that Python developers rely on daily. To understand the scope of this requirement, we need to examine the landscape of tools that depend on stack unwinding.
Tools Requiring Stack Unwinding
Stack unwinding is a critical capability used by three major categories of tools in the Python ecosystem:
Debuggers (GDB, LLDB, pystack) rely on unwinding to show developers where their program is currently executing. These tools deal with remote processes or core files.
Profilers use unwinding to understand program performance.
C++ Exception Handling depends on unwinding information to:
Unwinding Mechanisms
Stack unwinding occurs in two fundamentally different contexts:
In-process unwinding:
Out-of-process unwinding:
Types of Analysis Tools
The tools that perform unwinding can be categorized by their data collection method:
Tracing Profilers:
Statistical Profilers:
Unwinding Libraries and Their Capabilities
Several libraries provide stack unwinding capabilities, each with its own strengths and limitations. The strategies to support unwinding for JIT compiled code are:
Allow frame pointers: makes libdw, gdb, lldb and lubinwind work. libunwind and libdw work in remote mode, including core files.
Per unwinder support for dynamically generated code using DWARF (only local mode and only if it supports it - libdw has no support):
Any other library has no support.
Other JITs
Most JIT compilers implement not only frame pointers to make it possible to profile and debug, but also implement either C++ exception unwinder via eh_frames or even more hardcore support for gdb and other debuggers. Some links:
LuaJIT: https://github.com/LuaJIT/LuaJIT/blob/fe71d0fb54ceadfb5b5f3b6baf29e486d97f6059/src/lj_err.c#L620
DotNet: https://github.com/dotnet/coreclr/pull/468/files
DotNet remote unwinding: https://github.com/dotnet/runtime/blob/d9855f27f7d5d5138452009e9f0dd7a81c5c8b74/src/coreclr/pal/src/exception/remote-unwind.cpp
Julia (frame pointers and __register_frame): https://github.com/JuliaLang/julia/blob/caa2f7d52b430f50c8038a7f6766edba28a3fb65/src/debuginfo.cpp#L554
LLVM's ORC jit (frame pointers, __register_frame and much more): https://github.com/llvm/llvm-project/blob/935d753c6dca0cd9bc5ea14fde5b00386ebcc5be/compiler-rt/lib/orc/elfnix_platform.cpp#L191
V8: Frame pointers, __register_frame and much more: https://www.kvakil.me/posts/2022-10-17-ustackjs-unwind-javascript-stacks-in-ebpf.html, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/gdb-jit.cc#L1674, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/unwinder.cc#L100, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/eh-frame.cc
The text was updated successfully, but these errors were encountered: