Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supporting stack unwinding in the JIT compiler #126910

Open
pablogsal opened this issue Nov 16, 2024 · 28 comments
Open

Supporting stack unwinding in the JIT compiler #126910

pablogsal opened this issue Nov 16, 2024 · 28 comments
Assignees
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-JIT

Comments

@pablogsal
Copy link
Member

pablogsal commented Nov 16, 2024

TLDR

This is a lot of text because the issue is complex but if you want the gist:

To not break a lot of tools that rely on unwinding, I propose to fix this by compiling the JIT stencils with frame pointers which has a trivial maintenance cost (2 lines) and only involves a 2% hit on speed when using the JIT while making almost all existing debuggers and profilers just work™ in the presence of the JIT.

Although this doesn't fix everything sadly I think is the best compromise that I can find (and we are quite lucky to have it as normally fixing this is a nightmare - as in thousands of complex lines of code and 50% slowdown nightmare).


The issue + proposal

CPython's JIT compiler must provide robust stack unwinding support to maintain compatibility with the Python ecosystem's debugging and profiling tools. This requirement is particularly critical given Python's reliance on native extensions written in C, C++, and Rust. As more performance-critical code moves to these native implementations, the ability to properly unwind through mixed Python and native frames becomes fundamental for effective debugging and profiling. This capability is even more crucial with the introduction of both JIT compilation and free-threaded modes, which significantly increase the complexity of runtime behavior. Without proper unwinding support, debugging te issues that can appear in the presence of these new modes becomes extremely challenging – developers would be unable to generate meaningful stack traces, hampering their ability to understand where and why their programs failed. Consider debugging a deadlock in free threadded code and a Rust extension, or investigating a crash in native code called from JIT-compiled Python – without proper unwinding support, error reports would be incomplete or misleading, making production issues significantly harder to diagnose and fix. The ability to get complete, accurate stack traces in these scenarios is not just a convenience; it's essential for maintaining production applications where Python increasingly interacts with native code through multiple execution modes.

Profiling is equally affected – modern performance analysis tools like py-spy, Austin, or eBPF-based solutions need to understand the complete call stack, including JIT-compiled code, native extensions, and regular Python frames, to provide accurate performance insights.

Some of the popular tools that use native unwinding in some way or form:

  • Austin (statistical profiler). Uses libunwind
  • py-spy (statistical profiler).Uses libunwind
  • memray: uses C++ unwinder for macOS and libunwind for Linux
  • pystack: uses elfutils (as is the only one supporting core files).
  • EBPf unwinders : normally rely on frame pointers or rely on the C++ unwinder (https://israelo.io/blog/ehframes/, https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-based-stack-walking-using-ebpf).
  • Additionally, most JIT compilers implement not only frame pointers to make possible to profile and debug, but also implement either C++ exception unwinder via eh_frames or even more hardcore support for gdb and other debuggers.
  • Gdb and hdb-helpers: a combination of libdw, libunwind or custom
  • Perf: frame pointers, lubunwind or libdw
  • Lldb: libunwind or custom
  • valgrind: uses a custom unwinder based on libdw

This doesn't include the considerable amount of custom tools that are not open source either in the debugging or profiling space.

Unwinding Libraries and Their Capabilities

Several libraries provide stack unwinding capabilities, each with its own strengths and limitations.

Library Platforms Local Unwinding Remote Unwinding Core Files Registration API Remote Registration Support Main Users
libunwind Linux only Yes (fast) Yes No _U_dyn_register No Profilers, debuggers
Native Unwinder (LLVM/GDB) Linux, MacOS Yes (optimized) No No __register_frame No C++ exceptions, backtrace()
libdw (elfutils) Linux only Yes Yes Yes None (ELF parsing) No GDB, debuggers like pystack
GDB/LLDB built-in Cross-platform Yes Yes Yes __jit_debug_register_code No Debug tools

Implementation Plan for Stack Unwinding Support

After extensive experimentation and analysis (believe me this was a lot of very difficult research work, @brandtbucher and I found that since we have preserve_none and LLVM 19 we can just compile the JIT stencils with frame pointers and that makes most tools just work. Based on this I propose implementing frame pointer support as the primary strategy for stack unwinding in CPython's JIT. This approach has proven to be the most pragmatic and effective solution, providing broad compatibility with existing tools while maintaining reasonable performance characteristics. I do think we got very lucky this fixes most of the tools since adding unwinding support for JITs otherwise is very challenging (see following sections).

Frame pointer support can be enabled through a minimal change to the JIT compiler flags:

diff --git a/Tools/jit/_targets.py b/Tools/jit/_targets.py
index d8dce0a905c..7c3b2e5aab7 100644
--- a/Tools/jit/_targets.py
+++ b/Tools/jit/_targets.py
@@ -135,6 +135,8 @@ async def _compile(
             # Don't call stack-smashing canaries that we can't find or patch:
             "-fno-stack-protector",
             "-std=c11",
+            "-fno-omit-frame-pointer",
+            "-mno-omit-leaf-frame-pointer",
             "-o",
             f"{o}",
             f"{c}",

This simple change enables compatibility with a wide range of tools:

  • libunwind (both local and remote unwinding)
  • libdw (supporting core file analysis, remote and local unwinding)
  • GDB and LLDB
  • eBPF-based unwinders

Most of the popular tools will just work with this change, instead of having to implement tons of different unwinding information support, which is a nightmare)

Performance Impact

https://github.com/faster-cpython/benchmarking-public/blob/main/results/bm-20241114-3.14.0a1%2B-925b70b-JIT/bm-20241114-linux-x86_64-brandtbucher-justin_frame_pointer-3.14.0a1%2B-925b70b-vs-base.md

The performance impact of this change has been carefully measured:

  • AMD64: Approximately 2% overhead, as shown in recent benchmarks
  • ARM64 and macOS: Even lower overhead due to the presence of dedicated link registers

This modest performance cost is significantly outweighed by the expected JIT performance improvements. Moreover, the maintenance burden is minimal compared to alternative approaches like implementing custom unwinding support for each tool.

Additional Optimization

We can further enhance compatibility by generating eh-frames in the JIT stencils and calling __register_frame_table (a bulk version of __register_frame). Benchmarks show this additional feature has neutral performance impact while enabling native unwinder support. This makes the solution even more robust without additional overhead.

Given these considerations - the minimal implementation complexity, broad tool compatibility, reasonable performance characteristics, and the ability to extend support to native unwinders - frame pointer support represents the optimal path forward for CPython's JIT implementation.

Why this must be activated by default

The decision to enable frame pointers by default is driven by two critical production requirements.

  • When applications crash or hang, tools like pystack need to analyze program state without the luxury of reproduction – you cannot simply "run it again with debug options enabled." You need to unwind the stack either remotely or from a core file.
  • Performance profiling in production requires safely inspecting program state from another process or the kernel. Tools like Austin, py-spy, and eBPF-based profilers need to periodically sample stack traces without interrupting the target process. Both scenarios require frame pointers to work reliably.

This must be enabled by default because most Python users don't compile Python themselves – they install it through package managers or tools like uv. Unless frame pointers are enabled by default, these critical debugging and profiling capabilities won't be available in most Python installations, severely impacting the quality of bug reports to both CPython and C extensions. Given the minimal performance impact of 2% frame pointers should be enabled by default, following the principle of being "safe by default." While we should provide the option to disable them, this should be a conscious decision made by end users who understand they're trading away the ability to properly profile and debug their applications in production – not a choice made by intermediate distributors or package managers.

Additional information

Using frame pointers aligns with a broader industry trend where major tech companies and Linux distributions are reverting previous optimization decisions and re-enabling frame pointers. Companies like Meta and Google now compile their entire software stack, including their Python installations, with frame pointers enabled. This shift is driven by the recognition that observability and debugging capabilities in production environments far outweigh the minor performance impact of frame pointers. Ubuntu has also adopted this approach, now compiling virtually all packages with frame pointers enabled (with Python being a notable exception we aim to address). This industry-wide move reflects a fundamental reality: in modern production environments, the ability to profile, monitor, and debug applications effectively is crucial, and frame pointers provide the most reliable and universal way to achieve this. The performance cost, typically around 1-5%, is normally considered to be a worthwhile trade-off for the improved observability they enable. This is particularly relevant for server workloads where understanding performance characteristics and debugging production issues is far more valuable than the small overhead frame pointers introduce. The fact that major tech companies maintain this policy even for performance-critical services demonstrates that the benefits of comprehensive profiling and debugging support outweigh the minimal performance impact. Some links:

https://ubuntu.com/blog/ubuntu-performance-engineering-with-frame-pointers-by-default
https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html
https://www.polarsignals.com/blog/posts/2023/12/13/embracing-frame-pointers-in-ubuntu-24-04-lts


Background on Stack Unwinding Support Requirements

Adding JIT compilation to CPython requires careful consideration of stack unwinding support. Without proper unwinding capabilities, we risk breaking compatibility with essential development tools that Python developers rely on daily. To understand the scope of this requirement, we need to examine the landscape of tools that depend on stack unwinding.

Tools Requiring Stack Unwinding

Stack unwinding is a critical capability used by three major categories of tools in the Python ecosystem:

  • Debuggers (GDB, LLDB, pystack) rely on unwinding to show developers where their program is currently executing. These tools deal with remote processes or core files.

  • Profilers use unwinding to understand program performance.

  • C++ Exception Handling depends on unwinding information to:

    • Propagate exceptions through the stack
    • Clean up resources during stack unwinding

Unwinding Mechanisms

Stack unwinding occurs in two fundamentally different contexts:

  • In-process unwinding:

    • Code examines its own stack
    • Used by tracing profilers and exception handlers
    • Has direct access to program memory and CPU registers
    • Primarily used for live analysis
  • Out-of-process unwinding:

    • External program examines another program's stack
    • Used by debuggers and statistical profilers
    • Includes core dump analysis

Types of Analysis Tools

The tools that perform unwinding can be categorized by their data collection method:

  • Tracing Profilers:

    • Run inside the target process
    • Instrument function entries and exits
    • Collect detailed execution information
    • Higher overhead but more precise
    • Rely on local unwinding capabilities
    • Examples: cProfile, memray
  • Statistical Profilers:

    • Run outside the target process
    • Sample program state periodically
    • Much lower overhead
    • Suitable for production use
    • Growing eBPF ecosystem:
      • Enables efficient system-wide analysis
      • Becoming standard for production monitoring
      • Requires reliable stack unwinding support
      • Examples: perf, py-spy, bcc tools

Unwinding Libraries and Their Capabilities

Several libraries provide stack unwinding capabilities, each with its own strengths and limitations. The strategies to support unwinding for JIT compiled code are:

  • Allow frame pointers: makes libdw, gdb, lldb and lubinwind work. libunwind and libdw work in remote mode, including core files.

  • Per unwinder support for dynamically generated code using DWARF (only local mode and only if it supports it - libdw has no support):

    • libunwind: Construct ELF .debug_frame data structures, and table_entry data structures, and a unw_dyn_table_info data structure, and a unw_dyn_info_t structure, then call _U_dyn_register.
    • C++ exception unwinder: Construct ELF .eh_frame data structures, then call __register_frame
    • GDB: Construct a full in-memory ELF object, manually maintain a doubly-linked list of all such objects in a global variable called __jit_debug_descriptor, and call a global function called __jit_debug_register_code when this list is changed

Any other library has no support.

Other JITs

Most JIT compilers implement not only frame pointers to make it possible to profile and debug, but also implement either C++ exception unwinder via eh_frames or even more hardcore support for gdb and other debuggers. Some links:

LuaJIT: https://github.com/LuaJIT/LuaJIT/blob/fe71d0fb54ceadfb5b5f3b6baf29e486d97f6059/src/lj_err.c#L620
DotNet: https://github.com/dotnet/coreclr/pull/468/files
DotNet remote unwinding: https://github.com/dotnet/runtime/blob/d9855f27f7d5d5138452009e9f0dd7a81c5c8b74/src/coreclr/pal/src/exception/remote-unwind.cpp
Julia (frame pointers and __register_frame): https://github.com/JuliaLang/julia/blob/caa2f7d52b430f50c8038a7f6766edba28a3fb65/src/debuginfo.cpp#L554
LLVM's ORC jit (frame pointers, __register_frame and much more): https://github.com/llvm/llvm-project/blob/935d753c6dca0cd9bc5ea14fde5b00386ebcc5be/compiler-rt/lib/orc/elfnix_platform.cpp#L191
V8: Frame pointers, __register_frame and much more: https://www.kvakil.me/posts/2022-10-17-ustackjs-unwind-javascript-stacks-in-ebpf.html, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/gdb-jit.cc#L1674, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/unwinder.cc#L100, https://github.com/v8/v8/blob/b70457462cb22753a011096c9c9be20275dc4437/src/diagnostics/eh-frame.cc

@Eclips4 Eclips4 added topic-JIT interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement labels Nov 16, 2024
@pablogsal pablogsal removed the type-feature A feature request or enhancement label Nov 16, 2024
@brandtbucher
Copy link
Member

brandtbucher commented Nov 16, 2024

Just to clarify for anyone reading, it is sufficient to only enable frame pointers in JIT code, not necessarily the entire interpreter.

After spending a week with @pablogsal digging deep into this issue and prototyping and evaluating the different options available to us (including doing nothing), my personal opinion is that this issue is fixing a behavioral change, not adding a feature. It's also my personal opinion that the two-line diff above and the 2% hit are a good compromise that unblocks adoption of the JIT, and I think it's the best path forward. We have explored strategies to emit DWARF as well; copy-and-patch makes harvesting and emitting correct DWARF not-too-difficult, but I also see it as an optional, lower priority nice-to-have.

Frame pointers are a convenient escape hatch; for many tools that still don't work, we can at least point at frame pointers and argue that what the tools want should be possible using them. Currently, what they want is impossible without disabling the JIT... and we want people to turn it on!

Turning on frame pointers in the naive way above also does not preclude the possibility of improving the way we emit frame pointer code in the future should we decide it's worth clawing back that 2% in exchange for additional implementation complexity; such a patch can stand on its own merits. It also shouldn't preclude a runtime option to opt in or out of them using an environment variable (again, though, I'm not sure this is worth additional complexity right now to compile two versions of every stencil to make this work, but it's certainly possible).

I'd just like to thank @pablogsal for patiently educating me on this deep, dark, expansive area (and the people who rely on it) and working with me to explore solutions and compromise on something workable.

@brandtbucher brandtbucher added the 3.14 new features, bugs and security fixes label Nov 16, 2024
@brandtbucher
Copy link
Member

Also, it's kind of neat that we've pretty much followed through with what PEP 744 had to say about this issue:

Since the code templates emitted by the JIT are compiled by Clang, it may be possible to allow JIT frames to be traced through by simply modifying the compiler flags to use frame pointers more carefully. It may also be possible to harvest and emit the debugging information produced by Clang. Neither of these ideas have been explored very deeply.

While this is an issue that should be fixed, fixing it is not a particularly high priority at this time. This is probably a problem best explored by somebody with more domain expertise in collaboration with those maintaining the JIT, who have little experience with the inner workings of these tools.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 16, 2024

Also, it's kind of neat that we've pretty much followed through with what PEP 744 had to say about this issue:

Maybe except this part 😆 :

Neither of these ideas have been explored very deeply.

Now all of these ideas have been explored deeply. We have been through some stuff 😨

@python python deleted a comment from maleycl Nov 16, 2024
@brandtbucher
Copy link
Member

Benchmarks on JIT frame pointers are in:

  • aarch64-apple-darwin: 2.1% slower
  • aarch64-unknown-linux-gnu: 6.1% slower
  • x86_64-unknown-linux-gnu: 2.1-3.1% slower
  • x86_64-pc-windows-msvc: 2.6% slower
  • i686-pc-windows-msvc: 2.2% slower

Looks like a 2-3% performance hit, with the clear outlier of aarch64-unknown-linux-gnu at around a 6% hit.

@python python deleted a comment from maleycl Nov 16, 2024
@markshannon
Copy link
Member

According to our profiling jitted code is only ~14% of the execution time, so a 3% slowdown is a ~20% slowdown in the jitted code.

I think that's too slow.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 18, 2024

Then what do you propose? I don't think we have an alternative here. This is just the cost of not breaking existing tools and honestly I think we are extremely lucky to have a way to fix it that's just 2 lines with 2% hit.

@cfbolz
Copy link
Contributor

cfbolz commented Nov 18, 2024

Amazing research work for a super important feature. Thank you @pablogsal and @brandtbucher for this effort! I'm a strong +1.

@diegorusso
Copy link
Contributor

In the text of the issue you state:

ARM64 and macOS: Even lower overhead due to the presence of dedicated link registers

and later in the comments:

aarch64-apple-darwin: 2.1% slower
aarch64-unknown-linux-gnu: 6.1% slower

which contradicts what you said earlier. Also that outlier sounds very odd to me as all the AArch64 platforms should behave similarly.

@pablogsal
Copy link
Member Author

which contradicts what you said earlier. Also that outlier sounds very odd to me as all the AArch64 platforms should behave similarly.

The original sentence was based on some tests runs I did originally but the second is a full py performance run. I also agree it looks off because macOS is using the same instruction set.

We need to run again and confirm because it makes no sense to me and also contradicts my own small tests I did separately

@brandtbucher
Copy link
Member

@zooba
Copy link
Member

zooba commented Nov 18, 2024

I'm +1 on this. Frame pointers are basically the only viable way to handle stack unwinding here, and it's absolutely worth the cost. At the very least, we would need a runtime option to enable them (i.e. without recompiling CPython) because there are going to be essential scenarios that need them.

FWIW, MSVC doesn't allow enabling frame pointers for x64 or ARM64, but the function unwinding tables allow specifying them. I'm not sure what LLVM does, but it should be possible to have unwinding via a frame pointer work on these platforms too.

@pablogsal
Copy link
Member Author

At the very least, we would need a runtime option to enable them (i.e. without recompiling CPython) because there are going to be essential scenarios that need them.

The idea is to have them activated by default (safety/debug should be the default) and may be a env variable or some other way to opt-out if you don't care but this should be set by the end user with knowledge of what the consequences are. Another reason we want this by default is that we don't want to make it impossible for users to sent us backtraces when the interpreter or a C extension crashes or hangs because reproducing this can be quite challenging (all of this is covered in the text).

@markshannon
Copy link
Member

(safety/debug should be the default)

In what way is using frame pointers safer?

@markshannon
Copy link
Member

Then what do you propose?

The interpreter is compiled without using the frame pointer. So unwinding clearly doesn't need frame pointers.
So use whatever is used to unwind through the interpreter.

@pablogsal
Copy link
Member Author

(safety/debug should be the default)

In what way is using frame pointers safer?

Because it allows debuggers to work.

@markshannon
Copy link
Member

What do you mean by "safety" in this context?

@brandtbucher
Copy link
Member

A GitHub issue isn't good for real-time back-and-forth chat.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 18, 2024

What do you mean by "safety" in this context?

That if you application crashes, hangs or generates a core you can actually use a debugger with that and not getting wrong stacks because the JIT makes them choke.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 18, 2024

Then what do you propose?

The interpreter is compiled without using the frame pointer. So unwinding clearly doesn't need frame pointers.

It needs frame pointers if you have a JIT compiler in the middle because the JIT doesn't have DWARF (debug information). This is explained in the issue.

So use whatever is used to unwind through the interpreter.

You cannot because the JIT doesn't have DWARF and a backing elf file that unwinders can use. It's just a random string of bytes.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 18, 2024

A GitHub issue isn't good for real-time back-and-forth chat.

Agreed, let's chat on Wednesday.

@pablogsal pablogsal changed the title Unwinding support for the JIT compiler Supporting stack unwinding in the JIT compiler Nov 19, 2024
@brandtbucher
Copy link
Member

brandtbucher commented Nov 20, 2024

Some good news: with this fix (that I plan to upstream) for LLVM's existing "reserved frame pointers" functionality...

diff --git a/llvm/lib/Target/X86/X86RegisterInfo.cpp b/llvm/lib/Target/X86/X86RegisterInfo.cpp
index 50db211c99d8..9b8652b7e302 100644
--- a/llvm/lib/Target/X86/X86RegisterInfo.cpp
+++ b/llvm/lib/Target/X86/X86RegisterInfo.cpp
@@ -563,7 +563,7 @@ BitVector X86RegisterInfo::getReservedRegs(const MachineFunction &MF) const {
     Reserved.set(SubReg);
 
   // Set the frame-pointer register and its aliases as reserved if needed.
-  if (TFI->hasFP(MF)) {
+  if (TFI->hasFP(MF) || MF.getTarget().Options.FramePointerIsReserved(MF)) {
     if (MF.getInfo<X86MachineFunctionInfo>()->getFPClobberedByInvoke())
       MF.getContext().reportError(
           SMLoc(),

...and this change to CPython (replacing the 2-line change @pablogsal suggests above)...

diff --git a/Tools/jit/_targets.py b/Tools/jit/_targets.py
index d8dce0a905c..4e898a86f86 100644
--- a/Tools/jit/_targets.py
+++ b/Tools/jit/_targets.py
@@ -121,6 +121,8 @@ async def _compile(
             f"-I{CPYTHON / 'Python'}",
             f"-I{CPYTHON / 'Tools' / 'jit'}",
             "-O3",
+            "-Xclang",
+            f"-mframe-pointer={'all' if opname == 'shim' else 'reserved'}",
             "-c",
             # This debug info isn't necessary, and bloats out the JIT'ed code.
             # We *may* be able to re-enable this, process it, and JIT it for a

...frame-pointer-based unwinding works, with 0% slowdown on benchmarks!


More context:

There are two reasons why frame pointers are slow: you lose a register, and you must save and restore your caller's frame pointer state at the beginning and end of each function. It's the latter that's causing a slowdown for us; we compile each uop as its own function (which tail-calls into the next) and concatenate the bodies. So incrementing a fast local x += 1, should look something like this:

  • push and incref x
  • push 1
  • guard that x is an int
  • add them
  • decref and store x

With frame pointers, this instead becomes:

  • save and set rbp
  • push and incref x
  • restore rbp
  • save and set rbp
  • push 1
  • restore rbp
  • save and set rbp
  • guard that x is an int
  • restore rbp
  • save and set rbp
  • add them
  • restore rbp
  • save and set rbp
  • decref and store x
  • restore rbp

All of the frame pointer shuffling obviously isn't necessary; we really only need to do it once at the beginning, and once at the end. But some templates have multiple "returns", so finding and removing all of these is hard. And if we compile without frame pointers, the compiler uses the frame pointer register as scratch space, and clobbers whatever value we put there manually.

However, LLVM does have (broken on main, seemingly fixable with the change I found above) functionality to "reserve" the frame pointer register, meaning it pretends it isn't even there. This is perfect, since it means that as long as we know the frame pointer register is a valid value on entry to our concatenated sequence of code, it will remain valid throughout.

Whenever we call into JIT code, we already push a "shim" frame between the interpreter and the JIT code, to convert between the platform calling convention and the one used for the JIT's tail calls. We can compile just this with frame pointers, since it's very cheap to do so, and compile all of the other JIT code with the frame pointer register reserved. So when the shim calls into the JIT code, the frame pointer register remains valid, and the two frames appear as one to unwinders.

@zooba
Copy link
Member

zooba commented Nov 20, 2024

and the two frames appear as one to unwinders.

This may not be the case on Windows (I trust you on other platforms), where the IP is used to look up a table embedded in the executable (or registered dynamically) to decide how to unwind. It looks like the shim function is compiled once, which means it gets its own entry that will unwind correctly, but the jitted code still needs a way to unwind.

Unless you generate/copy the shim as part of the rest of the function and it's all contiguous, then both frames really would just be a single frame (and you can probably also just copy the function entry from the shim to apply to the rest of the code, since the unwinding procedure will be identical).

There's some relevant code in https://github.com/microsoft/python-etwtrace/blob/main/src/etwtrace/_etwtrace.c that does this for x64 and ARM64, if you prefer a real example. Note that _thunk is never called directly by this code - that's the most not-obvious part.

@brandtbucher
Copy link
Member

Yeah, we haven't checked either approach on platforms other than Linux yet.

It looks like the shim function is compiled once, which means it gets its own entry that will unwind correctly, but the jitted code still needs a way to unwind.

Unless you generate/copy the shim as part of the rest of the function and it's all contiguous, then both frames really would just be a single frame (and you can probably also just copy the function entry from the shim to apply to the rest of the code, since the unwinding procedure will be identical).

It's sort of a mix of the two. We JIT a copy of the shim for each trace, but if one trace jumps into another trace the shim frame from the first remains above it, and the second trace's shim is never used.

So if trace A side-exits to B which side-exits to C, then the actual stack will be [..., _PyEval_EvalFrameDefault, <shim A>, <trace C>, ...]. With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

Not sure if that helps at all?

@brandtbucher
Copy link
Member

brandtbucher commented Nov 20, 2024

This may not be the case on Windows (I trust you on other platforms), where the IP is used to look up a table embedded in the executable (or registered dynamically) to decide how to unwind.

Based on our experiments, this seems to be how other unwinding tools work on Linux (@pablogsal can correct me if I'm wrong). If DWARF unwind info for the IP has been registered with the tool (either from loading the executable or through explicit runtime APIs), it will use that. Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

@zooba
Copy link
Member

zooba commented Nov 20, 2024

With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

This is fine, I'm sure. It makes it a little harder to figure out exactly which Python code led to the code being executed, but stack unwinding should work (and unwinding is far more important).

Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

I'm 95% sure Windows doesn't fall back to frame pointers, because as I mentioned earlier they're not even enabled by the system compilers. I'm only 50% sure that registering (through those runtime APIs) that a function does use a frame pointer will even work, but at least in that case I'm prepared to report it to the OS as a bug and get it fixed.

@pablogsal
Copy link
Member Author

pablogsal commented Nov 20, 2024

With the approach in my comment, an unwinder would just see <shim A>'s frame pointer in rbp upon unwinding to <trace C>. Which means it would see [..., _PyEval_EvalFrameDefault, <shim A>, ...].

This is fine, I'm sure. It makes it a little harder to figure out exactly which Python code led to the code being executed, but stack unwinding should work (and unwinding is far more important).

Otherwise, many will attempt to use frame pointers as a fallback just to get through that frame, which is the entire reason why this fix works, even when the rest of the interpreter doesn't have frame pointers.

I'm 95% sure Windows doesn't fall back to frame pointers, because as I mentioned earlier they're not even enabled by the system compilers. I'm only 50% sure that registering (through those runtime APIs) that a function does use a frame pointer will even work, but at least in that case I'm prepared to report it to the OS as a bug and get it fixed.

I spent some time investigating windows and trying stuff. Seems that there are two options:

So very similar to __register_frame in Linux.

@brandtbucher
Copy link
Member

brandtbucher commented Nov 20, 2024

Our plan now is:

  • I'll open an issue about the reserved frame pointers fix for LLVM's x86 backend, just so we can get an idea of the timeline / likelihood of a fix.
  • We'll test that the reserved frame pointers work on AArch64 Linux with an unmodified LLVM 19. If not, we'll dig into why. If they do work (expected), we'll benchmark it and open a PR with a test (assuming performance doesn't take a hit).
  • In the meantime, don't make any deep changes to the JIT build that may break frame-pointer-based unwinding, since it's untested currently.

@zooba
Copy link
Member

zooba commented Nov 20, 2024

I spent some time investigating windows and trying stuff. Seems that there are two options:

For whatever horrible reason, RtlAddGrowableFunctionTable (and associated RtlGrowFunctionTable ) is actually the only one that works. It was reported over a year ago (after I spent way too much time figuring it out), so it might get fixed, but I'd start with those APIs.

(And it's probably not safe to actually grow a table, since you can't just append, you have to sort the table yourself. But you can provide the entire table and then never grow it. It's just that this is the only API that actually tells anyone that you added a table.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 new features, bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-JIT
Projects
None yet
Development

No branches or pull requests

7 participants