New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async Call-Stack Reconstruction #91048
Comments
Profiling tools that use the call-stack (i.e. all of them) paint an incomplete picture of what’s really going on in async-heavy codebases. They can only show the stack of the currently executing task; they miss the chain of awaitables that are transitively waiting on the current task. To remedy this, we have added support in Cinder to expose the async call-stack. This consists of the call stack for the currently executing task, followed by the chain of awaitables that are transitively reachable from the currently executing task. See below for a clarifying example.
When retrieved from f4, the two different stacks (top-of-stack last) are: We’d like to merge our implementation into CPython so that other heavy users of asyncio can benefit. This will consist of a few parts:
|
Could you provide a link first, please? |
Sorry for the confusion, I'm working on a PR. I filed the BPO to gauge interest in the feature. |
I've recently dabbled a bit in some new primitives for asyncio, and based on that experience I think this would be very useful. IIRC Trio does this (presumably at considerable cost) in userland. |
The idea looks interesting. |
somewhat related discussion (where this feature might have been helpful) - https://discuss.python.org/t/can-i-get-a-second-opinion-on-an-asyncio-issue/18471 this is the cinder 3.10 implementation of the |
If someone wants to move this forward let them propose on a design here, and (if they're confident enough) submit a PR. |
@mpage Are you still interested in working on this? I am still interested in having this as a feature in CPython! |
@gvanrossum Yes, still interested! Just haven't found the time yet to start working on it. |
You might be interested in the existence of Task.get_stack(). Apparently it was part of the original asyncio code (I'd forgotten about it). I'm not sure if it addresses this problem in general (it's part of asyncio) or if it is fast enough or if it even works. |
For example, the code from the initial comment is codified in https://gist.github.com/mpage/584a02fc986d32b11b290c7032700369. Unfortunately you need Cinder in order to run it. When using
|
Oh, you're right. Looking through the Cinder code it seems that this requires a fair amount of C code (which is maybe why you haven't submitted your PR yet?). Is that fundamental or an optimization? How would pure Python code in 3.11 go about finding the awaiter of a suspended coroutine? Is there just no way? What if we know it's an asyncio task? |
I just created a little proof-of-concept that gets the await chain for a task, unless that task is the current task. To get around that, you can create a dummy task and call it from there. Here's the basic code (sans dummy task hack): def get_stack(task):
coro = task.get_coro()
frames = []
while coro is not None and hasattr(coro, "cr_frame"):
frame = coro.cr_frame
frames.append(frame)
coro = coro.cr_await
return frames It probably needs improvements (I see in Maybe that's enough if you need this in a debugging context? Or is this something where performance is important? Or am I missing something? (Maybe you need this when an uncaught exception is raised?) |
Talking to myself here, my above code doesn't appear to be able to cross task boundaries -- I can get it to produce [f3, f4] for your example program (by calling it from a dummy helper task), but it's missing [f1, f2]. It looks like the iterator used by the Future object (either the Python or the C version) is impenetrable. Thoughts? (Gist here.) |
I have a better version now. It relies on
This leaves a lot to be desired if you're in a coroutine world other than asyncio, and it's pretty inefficient, since it must traverse the My question for you at this point is, do you really need to do this purely at the coroutine level, or is it acceptable for your use case(s) if this is essentially a |
I just realized that f4 and f5 were missing from the output. A little tweak to the logic handles that -- the coro stack should be collected from the helper task as well. I've updated the gist, and the output is now:
|
Our primary use case for this feature is an always-on sampling profiler (roughly equivalent to GWP at Google). Periodically, for each machine in the fleet the profiler will wake up, grab the await stack for the currently running task, and finally send the await stack to a service that performs aggregation. The primary use of this data is to provide CPU-time profiling for application. Performance is important for us. Since this is always on and pauses request processing we don't want the implementation to negatively impact performance of in-flight requests. We typically have a large number of active tasks (I'm somewhat embarrassed to admit that I don't have a number handy here), so I'm not sure the approach you've taken here would work for us. The logic for collecting the await stack is currently implemented as an eBPF probe and the awaiter implementation in Cinder simplifies that. The probe "only" has to walk the await stack backwards from the coroutine for the current task. The approach you've taken is neat but unfortunately I don't think it can cross
|
Yeah, I can see that a sampling profiler would need some support from the C-level data structures, both for speed and to make the eBPF code for collecting the await stack from simpler (or possible). The problem with GatheringFuture seems to be shared with anything that doesn't wait for a task -- using a TaskGroup has the same problem. I'm sure I could work around it (though I'd probably have to modify a few spots in asyncio) but I agree that as a general approach this isn't going to work. So I'm looking forward to the PR you promised! |
Hoping to have some time in the next couple of weeks to put an initial PR up :) |
I know the Cinder folks haven't flagged this, but having an easier way to walk "async frames" when handling an exception would make some things easier in Django's The decorator works by attaching a bunch of locals to a wrapper function: Then looking for that wrapper function in If we can find the wrapper function then the error reporter hides the listed sensitive variables when generating the debug 500 page. (I apologize that requires a bit of reading and isn't in Gist form, but I thought it might be useful to see the utility for this in a practical example) This doesn't work for async functions because I've tried reading https://docs.python.org/3/library/inspect.html but the docs are a little sparse on how these pieces fit together and again I don't really know what I'm talking about / what most of the discussion in this issue is about. Anyway, I'm not entirely sure if adding I can file a separate Issue / post on a discussion forum if that would be more appropriate for this |
@njsmith I'm curious if you have any feedback on this proposed feature from the perspective of Trio. Guido mentioned above that maybe Trio goes to some lengths to make complete logical async stacks available? See also the PR and @markshannon 's comment there proposing that instead of building this feature into coroutines, instead the async frameworks should all agree on how Tasks-or-equivalent can link to their awaiting Task-or-equivalent, allowing stitching their stacks together. (I'm not sure how feasible it is for different frameworks to represent this in the same way so that profilers don't have to treat them all as special cases, or how feasible it is for an out-of-process profiler to find the Task objects and do this stitching regardless.) |
If I'm understanding correctly, this is solving a problem that Trio never has in the first place, because we don't have a concept of "awaitable objects". We do struggle with call stack introspection in a few ways. I think the main ones are:
Currently we work around these issues with deeply messy hacks involving calling For the problem of attributing nursery/task-groups to parent task frames, I think there actually is a neat solution we can implement. You may recall that we want to add some feature to prevent |
@njsmith I think that this feature would solve the problem you describe as "the big one"? So it seems like it does solve a problem that Trio has? (EDIT: on second thought I'm not sure it does; need more study. Thanks for the summary of Trio's issues in this area.) |
I don't think it does, but I could be wrong -- here's a concrete example to make sure we're on the same page :-) async def parent1():
await parent2()
async def parent2():
async with trio.open_nursery() as nursery:
nursery.start_soon(child1)
await parent3()
async def parent3():
await trio.sleep(10)
async def child1():
await child2()
async def child2():
await trio.sleep(10) From this, we want to reconstruct the child's call stack as: (And to clarify a few edge cases: (1) it doesn't matter where |
Per #103976 (comment), can this wait for 3.13? |
My understanding is that this feature would support tracing async stacks "inside-out": starting with a particular task or coroutine, trace outward to find who's waiting on it. There is no way to do that currently, except maybe via The semantics are decidedly funky, though. If you do
then the stack of Trio doesn't allow a task to directly wait on another task (except indirectly via synchronization primitives like Not everything in an async call stack is a coroutine object. Some of the challenging ones I ran into:
There are also regular generators (for The logic in stackscope for locating the context managers on a certain frame's stack (which is a prerequisite for drawing the nursery/task tree accurately) is by far the most convoluted/brittle part. There are two implementations, one that uses I would recommend to anyone working on async call stack introspection to review stackscope's implementation; it's a pretty good synthesis of all the dark corners I've found in ~3 years of poking at this. There are a lot of dark corners. It would be great if we could make CPython changes that result in fewer dark corners. I don't know how much |
Unfortunately @mpage has not had time to look at this recently, so I'm having a try at addressing the comments on his original PR #103976, and this issue. To start with I've made a completely new implementation here: jbower-fb@3918379. This isn't PR ready yet - it's only got a Python implementation, is lacking proper tests, and almost certainly needs some public iteration. So, I wanted to get it up early for comment. My version follows @markshannon and @njsmith's suggestions on #103976, and rather than linking coroutines instead links In addition to new await dependency data, I’ve added a function Finally there is a straight-forward Most of the implementation is now in the Note the problems @njsmith mentioned with Looking forward to feedback. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: