-
-
Notifications
You must be signed in to change notification settings - Fork 32.3k
gh-136459: Add perf trampoline support for macOS #136461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @pablogsal. I went ahead and submitted this PR for #136459 but please let me know what you think! |
A running process may create a file in the ``/tmp`` directory, which contains entries | ||
that can map a section of executable code to a name. This interface is described in the | ||
profiling tool (such as `perf <https://perf.wiki.kernel.org/index.php/Main_Page>`_ or | ||
`samply <https://github.com/mstange/samply/>`_). A running process may create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a shameless plug to samply 😄 Disclaimer: I'm not the maintainer of the project, but the maintainer is my colleague. But it doesn't change the fact that it's an awesome profiler! But I can revert it if you prefer not to include :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy with the plug, but this docs are going to need much more than this then. If samply
is the main way to use this on macOS then we will need to update https://docs.python.org/3/howto/perf_profiling.html with full instructions for samply :)
f0887a1
to
f663627
Compare
Python/perf_jit_trampoline.c
Outdated
|
||
/* These constants are defined inside <elf.h>, which we can't use outside of linux. */ | ||
#if !defined(__linux__) | ||
# define EM_386 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to define the variable for different platforms. like this https://github.com/python/cpython/blob/main/Python/perf_jit_trampoline.c#L126-L135
Not define all the possible variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I just changed them to be defined depending on the platform. I actually copy pasted this from the old code but your suggestion is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the trampoline depens on the mmap.
But it has record about the mmap performance issue on macOS.
I think we need a benchmark here
FYI
Edit: ah I think you are referring to the mmap to tell Perf about the maps file no? Sorry I thought you were referring to the jit trampoline compiler that itself uses mmap to get the chunks. |
@canova i am currently on vacation until EOW I will answer more when I have some time later today or tomorrow but I would love to get macOS support. I have some questions I think we need to answer first to ensure we provide a good UX if we are adding more profilers. |
On macOS, we don't need to call mmap because samply has already detected the file path during the call to `open` before (it interposes `open` with a preloaded library), and because the mmap call can be slow.
Ah I think you are right. I actually looked for this and only found one mmap inside the cpython/Python/perf_jit_trampoline.c Lines 945 to 961 in e697f5e
I updated the PR to remove this on macOS, let me know what you think!
Thanks! I'm glad that you would like to support it on macOS. And I'm happy to discuss the details further and update the code later. (Enjoy your vacation!) |
@@ -1,13 +1,21 @@ | |||
.text | |||
#if defined(__APPLE__) | |||
.globl __Py_trampoline_func_start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, Apple and their symbol choices...:S
@canova One thing I am concerned is that for perf I had to add jit support because nothing compiles with frame pointers in the wild which means that unless people compile Python themselves (and take a big-ish hit unfortunately) the frame-pointer version is suboptimal. What's the status of this for samply? Does it also support the perf jit interface? Can you ping here the samply maintainer if the answer is nuanced? |
Samply maintainer here - I don't understand the question, I'm afraid. What do you mean by "add jit support", and what do you mean by "the perf jit interface"? Samply supports unwinding with DWARF information if available, but only for regular code mapped from a binary / shared library, not for JIT code. When it encounters JIT code during unwinding, it falls back to using frame pointers for that frame. It does not make use of any |
Perf supports two modes to deal with JIT compilers which is what we are leveraging to make it work with Python via the trampolines:
We do implement both. The reason we implement both is that the simple maps interface works as long as perf can unwind through the JIT frames. Perf uses Unfortunately, nobody compiles Python with frame pointers because it's very slow. See #96174. Not only that, even if you compile with frame pointers most of the time wheels and binary packages are not compiled with frame pointers so this is sort of not very useful in the wild unless you have control over your full ecosystem. To deal with this we implement the much more complex JIT interface. This allows us to provide to perf DWARF for the trampolines so we can pass the I hope this gives some background for the question. I really want to know if simply is going to be able cope with the general case in general in Linux and Mac for all arches that we support. I am asking this among other things because we already had problems with code that doesn't have frame pointers around the trampolines. See #130856. |
Also, we probably need a buildbot and tests for this to ensure this doesn't break in the future |
Thanks for the pointer! Wow, I feel slightly nauseous after reading the part about having to to space out the trampolines by the size of the unwind info... Anyway: samply supports both perf.map file and jitdump. But as I said above, it does not respect the JIT unwind info. Instead, when finding the return address for a JIT frame, samply always uses framepointer unwinding. Looking at the actual assembly for the trampolines, I can see the following:
For samply's current approach to work on x86_64, the x86_64 trampoline just needs to set up a framepointer - add a |
Thanks a lot @mstange, that makes perfect sense and really clarifies the situation! If we’re going to add samply support, we need to ensure it works reliably on both aarch64 and x86_64 across both macOS and Linux. Based on your analysis, we should definitely add frame pointers to the x86_64 trampoline in this PR to ensure consistent behavior across architectures. I can help set up some buildbot infrastructure to ensure all of this works properly with samply once the PR is ready for more comprehensive testing. I also think we should have full end-to-end documentation that covers how to use this with Python, similar to our existing perf docs. We should probably make the docs more generic to cover both perf and samply workflows, since users will likely want guidance on both tools depending on their platform and preferences. @canova would you be willing to add the frame pointer setup to the x86_64 trampoline and the DWARF as part of this PR? It seems like the right time to ensure sample works everywhere and we can advertise it properly |
I feel you: I was certainly nauseous when I had to debug that nonsense for hours and hours and even more when I had to implement the hack to "fix" it. |
Yes, I mean maps file here. Sorry for confusing |
SGTM |
@@ -1166,7 +1191,11 @@ static void perf_map_jit_write_entry(void *state, const void *code_addr, | |||
ev.base.size = sizeof(ev) + (name_length+1) + size; | |||
ev.base.time_stamp = get_current_monotonic_ticks(); | |||
ev.process_id = getpid(); | |||
#if defined(__APPLE__) | |||
pthread_threadid_np(NULL, &ev.thread_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use this implementation in here?
Lines 193 to 255 in b44316a
_Py_ThreadId(void) | |
{ | |
uintptr_t tid; | |
#if defined(_MSC_VER) && defined(_M_X64) | |
tid = __readgsqword(48); | |
#elif defined(_MSC_VER) && defined(_M_IX86) | |
tid = __readfsdword(24); | |
#elif defined(_MSC_VER) && defined(_M_ARM64) | |
tid = __getReg(18); | |
#elif defined(__MINGW32__) && defined(_M_X64) | |
tid = __readgsqword(48); | |
#elif defined(__MINGW32__) && defined(_M_IX86) | |
tid = __readfsdword(24); | |
#elif defined(__MINGW32__) && defined(_M_ARM64) | |
tid = __getReg(18); | |
#elif defined(__i386__) | |
__asm__("movl %%gs:0, %0" : "=r" (tid)); // 32-bit always uses GS | |
#elif defined(__MACH__) && defined(__x86_64__) | |
__asm__("movq %%gs:0, %0" : "=r" (tid)); // x86_64 macOSX uses GS | |
#elif defined(__x86_64__) | |
__asm__("movq %%fs:0, %0" : "=r" (tid)); // x86_64 Linux, BSD uses FS | |
#elif defined(__arm__) && __ARM_ARCH >= 7 | |
__asm__ ("mrc p15, 0, %0, c13, c0, 3\nbic %0, %0, #3" : "=r" (tid)); | |
#elif defined(__aarch64__) && defined(__APPLE__) | |
__asm__ ("mrs %0, tpidrro_el0" : "=r" (tid)); | |
#elif defined(__aarch64__) | |
__asm__ ("mrs %0, tpidr_el0" : "=r" (tid)); | |
#elif defined(__powerpc64__) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// r13 is reserved for use as system thread ID by the Power 64-bit ABI. | |
register uintptr_t tp __asm__ ("r13"); | |
__asm__("" : "=r" (tp)); | |
tid = tp; | |
#endif | |
#elif defined(__powerpc__) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// r2 is reserved for use as system thread ID by the Power 32-bit ABI. | |
register uintptr_t tp __asm__ ("r2"); | |
__asm__ ("" : "=r" (tp)); | |
tid = tp; | |
#endif | |
#elif defined(__s390__) && defined(__GNUC__) | |
// Both GCC and Clang have supported __builtin_thread_pointer | |
// for s390 from long time ago. | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#elif defined(__riscv) | |
#if defined(__clang__) && _Py__has_builtin(__builtin_thread_pointer) | |
tid = (uintptr_t)__builtin_thread_pointer(); | |
#else | |
// tp is Thread Pointer provided by the RISC-V ABI. | |
__asm__ ("mv %0, tp" : "=r" (tid)); | |
#endif | |
#else | |
// Fallback to a portable implementation if we do not have a faster | |
// platform-specific implementation. | |
tid = _Py_GetThreadLocal_Addr(); | |
#endif | |
return tid; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! It looks like _Py_ThreadId
is defined only when Py_GIL_DISABLED
is defined and Py_LIMITED_API
is not defined:
Line 189 in d754f75
#if defined(Py_GIL_DISABLED) && !defined(Py_LIMITED_API) |
It doesn't compile because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I meant just for inline assembly because it was for low over head at free threading, maybe you can just pick it for apple implementation.
@mstange @pablogsal Thanks for the investigation!
Sure that sounds good to me, I'll update the PR soon with this and add some more tests. Would you like me to update the documentation in this PR as well? |
This should do the trick, but needs checking and a fix for CET:
|
Yep please. |
Just tested your patch using samply. I can verify that it fixes the stack walking! Here are before and after profiles: |
It's a samply bug, yes - or rather a workaround that's not necessary for Python because there's only a single native function for each Python function (just the trampoline itself). The duplication was initially added for JS JITs which can compile the same function multiple times with different JIT tiers, and it was useful to both have a single call node per JS function and to have a separate call node for each tier. |
@pablogsal I just added two new commits for documentation and testing. For documentation, I made the perf profiling page more generic and added a section for samply. Started small, so I can shape it along the way with the feedback I get. For the test, we already have some test coverage already from previous tests for perf. But now I added some |
e2252bf
to
8b03dc1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hugovk @AA-Turner I would love if you could help us a bit with the docs. The previous docs are assuming that only the perf profiler exist and is centered on it but seems we want to move to a world when perf and other perf-like profilers can exist. I want this to be a excellent place to learn about this so I would love if you could guide @canova on how to ensure there is enough advice on how to use "simply" that users can have a good experience and doesn't feel out of place.
@@ -148,6 +149,26 @@ Instead, if we run the same experiment with ``perf`` support enabled we get: | |||
|
|||
|
|||
|
|||
Using ``samply`` profiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to need a bit more here. For example, simply supports both perf modes so we need clarification on when tho use them and what are the recommendations. How to read the flamegraphs etc etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to break these discussions out into a separate PR? It doesn't seem useful to delay landing trampoline support for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem useful to delay landing trampoline support for this.
Is there any rush? This will go into 3.15 anyway and that's going to be released October 2026. We still need to figure out the buildbot situation which will take some time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to separate this into a different PR, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was hoping that we could maybe enable it on 3.14. Considering that the code was there since 3.12, and it's mostly putting lots of ifdefs here and there (minus samply and documentation part). I suspect that updating the documentation will take longer. But I'm not familiar with the release process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was hoping that we could maybe enable it on 3.14.
No way unfortunately as we are 3 betas past beta freeze. It's up to the release manager to decide (CC @hugovk) but we have a strict policy for this I am afraid and no new features can be added past beta freeze.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hugovk Checking just in case although I assume the answer is "no" but would you consider adding this to 3.14 given that this is a new platform and the code is gated by ifdefs? This will allow people on macOS to profile their code using a native profiler, which would be very useful for investigating performance in Python+compiled code.
e36f2af
to
8b03dc1
Compare
This PR adds perf trampoline support for macOS. Previously it was Linux-only, which is not actually needed it to be. We can share the implementation of Linux and macOS with some minor changes done in this PR.
For example, here's a before and after profile of this PR, captured by
samply
(for the same dummy script):Before / After
As you can see, now we have blue frames that mentions the real Python function names, which makes it a lot more useful to understand this profile. Without it, it's just native frames that's really difficult to understand.
This is my first PR in this project, so please let me know if there is anything I didn't do or should do. Thanks!
📚 Documentation preview 📚: https://cpython-previews--136461.org.readthedocs.build/