-
-
Notifications
You must be signed in to change notification settings - Fork 338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise at next Checkpoint if Non-awaited coroutine found. #176
Conversation
Hey, this is really cool! Thanks for pushing on this. Some general thoughts: Hopefully, bpo-30491 will mean that this is something we can leave enabled all the time in py37+, and we'll eventually need to structure the code to handle the three cases of (a) 3.5/3.6 with this feature disabled, (b) 3.7+ with access to the unawaited coroutine count (but no information on which coroutines they are), and (c) any version with the coroutine wrapper installed to give full details. If we structure it this way from the start (using the coroutine wrapper to polyfill the proposal in bpo-30491), then that will let us make sure that bpo-30491 actually does what we need it to do. It's not totally clear to me what the public API should look like... a I'm fine with not providing any way to disable checking for individual coroutines, at least in the initial version: I really don't think we ever need to turn it off. And I figure, either I'm right and we can keep the code simple and remove temptation for people to write weird things, or else I'm wrong and making it impossible will force us to understand the problem better :-). Obviously we do need some way to avoid it triggering for actual coroutine spawns. Probably the simplest way is to make it track coroutine objects that have been created but for which When we detect a problem, signalling it is not entirely trivial, because the Trying to inject an exception into the task where the problem actually happened is problematic, as you've noted, because it might already be gone. I suppose technically even for a finished task we could change its One option is to do the same thing we do for Another option would be to just raise the exception out of Regarding the actual error we report: in "fast" mode, where we only have the count (either b/c we're on 3.7, or because we're using a polyfill to pretend to be), then the thing to do is probably to collect a stack trace from the last coroutine that ran, and present that to the user with some text saying "here's where we detected the problem, it actually happened just before this". It's unfortunate that there's no way to get more details at that point, but I think it's unlikely we can convince @1st1 that cpython should provide a way to find all unawaited coroutine objects due to the runtime overhead. (I guess this could be pretty cheap if we use an intrusive list in coroutine objects...?) Hooking the warning machinery too might help, since in many (but not all) cases the In "slow" mode, where we use a wrapper and keep track of individual coroutines, then maybe we don't want to rely on tracemalloc and should save the information ourselves? A this point the user is explicitly running in slow mode to get detailed information, quite possibly because they just got a not-as-helpful error message from the "fast" mode machinery, so we definitely want to provide the detailed information, and asking them to enable tracemalloc as well seems unfriendly. Or maybe we could enable it ourselves? And tracemalloc has quite a bit of overhead because it saves data on every Python object, not just coroutines. ...This detail is something we can fine-tune later, though. |
No worries, I got some questions post pycon as some-people told me that the "getting started with asyncio" talk was not exactly what they were expecting. In particular the Chess Master working on multiple games felt to them like (thinking == CPU intensive), so I decided to wrote a toy runner for myself, and wrote this piece just for me to play with, so it was easy to port here :-)
I think I'm starting to agree with "no context manager", as if you use it and inadvertently await a function inside the context manager, then you also remove the protection for the awaited coroutine... which may itself no await anything. I still think we should have an explicit way on a per-coroutine basis, mostly for interoperability with framework that will use trio and may do some crazy thing. I'm fine leaving it in hazmat with unstable API. It seem reasonable, but I have way less experience with async that you, so I'll trust you.
Yes, agreed. I think it should be slow mode by default. Otherwise no-one will use it. And that goes hand in hand with "make it work make it right make it fast".
I'll reread that later. Brain is tired of the last few days and follow each point but the all paragraph does not yet click in my head.
It might not be that hard in fact. I may give it a try.
Yeah, I partially went this route, but had the weird bug that the
Agreed.
Yeah, the "Just before" is tricky. It's definitively not how "usual" stack trace work especially if the "just before" can be "Just before in any of the higher frames". This make me think that and actual context manager/decorator
I'm unsure how. If it's user defined coroutines or even another library that have the bug, you can't really track that can you ? Or do you think using |
Though... on further thought... maybe it is reasonable to ask the interpreter to keep a record of all unawaited coroutines? The implementation I'm imagining is: each coroutine object has 2 extra pointer values embedded in it. In py36, the minimal size of a coroutine is at least: >>> async def f():
... pass
...
>>> sys.getsizeof(coro) + sys.getsizeof(coro.cr_frame) + sys.getsizeof(coro.cr_frame.f_locals)
736 so adding 2 pointers would only increase the size by ~2% at most. We use the two pointers to maintain a doubly-linked list, protected by the GIL: there's a global/thread-local holding the list head, and when a coroutine is created, it inserts itself into the list, and when it's awaited, it removes itself from the list. For an intrusive doubly-linked list these are both O(1) and very cheap, just a few pointer assignments: e.g. to remove you do something like The list references could be either strong or weak. Implementation-wise, strong refs would mean adding some Semantics-wise, the difference would be: say we have a piece of code like trio or pytest-asyncio that wants to start tracking unawaited coroutines. With strong refs, the tracking would have to be disabled by default (to prevent memory leaks), so our code would have to be careful about turning it on and off again at the right places, but in exchange it would get exact and reliable tracking of unwaited coroutines. With weak refs, the tracking could be on by default, but we'd also need to hook the warnings module to get full tracking, because sometimes a coroutine might get allocated and then GC'ed in between checks. The advantage of this is that after things go wrong we get full access to the coroutine objects to decide what to do with them – in particular we can get the coroutine name, but can also filter based on attributes, ask tracemalloc if it knows where the object was allocated, etc. Yet, until things go wrong, it's still very cheap -- on the same order of magnitude as a simple counter. |
Hmm, there is |
Hmm, interoperability is actually an important point: if we want to provide some way to support running "legacy" asyncio libraries under trio (#171) then we'll need a way to dial this back. But I don't think we can ask people who want to use, say, asyncpg, to go add some decorator every internal function in asyncpg. And ideally we wouldn't disable the checking for all code in the process, but would restrict it to the "legacy" code. Of course, this might or might not be possible depending on the interoperability strategy. But the two main ones I've thought of so far could support that. (The two are: (1) run asyncio in a thread, where our tracking would automatically be disabled b/c it's thread-local, and (2) have an "asyncio-flavored" task in trio that holds the asyncio loop -- see #171 for details -- so the run loop could know which tasks are asyncio-flavored and ignore their unawaited coroutines when it does its check.) And in any case, this is something we can worry about after the basics are working :-).
Well, the reason I'm thinking about wacky fallbacks like "what if we only had a count of unawaited coroutines but didn't know what they were", or "how can the interpreter track unawaited coroutines really efficiently" is that I want to be able to turn this on by default, but I'm worried that
Sorry, probably it is just me being too telegraphic :-). The problem I'm worrying about is: when you actually spawn some tasks, like: async with trio.open_nursery() as nursery:
nursery.spawn(f1)
nursery.spawn(f2) then trio creates two new
Yeah, that makes sense. But ultimately, as long as we're cleaning up tasks on the way out, then there are going to be a whole bunch of checkpoints in between when we detect the problem and when the user actually sees it, so this doesn't seem like a big deal? Obviously we would want to capture information about where the problem actually occurred in our message, though, e.g. by taking a stack trace from the just-run coroutine immediately after we detect the problem.
Yeah, maybe...
Yeah. It can even be inside a lower frame that exited before the next checkpoint! But unfortunately, there's no way to get a prompt error without help from the interpreter, which doesn't look like it will be forthcoming in 3.7, and there's no way to reconstruct the correct source information after the fact without doing expensive data recording for every single coroutine, so in "default mode" we just have to put up with confusing error messages I guess :-(.
Why? That's what tracemalloc is doing, except it's getting the frames on every call to Python's low-level memory allocation functions :-). It'd be something like |
Python already keeps track of all live threads via |
@ncoghlan: Interesting point! The requirement to detect unawaited coroutines makes this a bit complicated; a list of every live coroutine is less useful, because we can't afford to iterate over the list checking if each one is unawaited after each scheduling step -- that would be effectively quadratic. A possible refinement would be to track a list of all coroutines + a counter of unawaited coroutines, so we use the counter to detect the problem and then scan the list to diagnose it. There still are some possible issues with coroutines that get garbage collected before we notice the problem, and with cases where there are independent coroutine schedulers running in different threads... I'd have to think about that more. |
@njsmith Yeah, I think having the full list would be a complement to your "unawaited counter" idea. Just pointing out that it would be a useful feature in its own right, and has precedent in terms of how the standard library handles threads and how CPython handles subinterpreters. |
Keep 4 linked list of coroutines one for each states they can be in ? Then it's easy to get all the coroutines in a specific state ? That would not increase the memory cost per coroutine much as you still have only 2 pointers. And it's still easy to get them all. with set_coro_wrapper slow path, you wouldn't have issues with GC'd coro as we would keep a reference so can't be GC'd :-) I think saying that some case are only handled with slow path is reasonable. If you really want and CPython accept, you could also have a 5th linked list of Zombies coroutines never awaited (never closed?) that got GC'd. But that seem extreme. |
What do you think should be the raised exceptions in the following case:
MultiError as well in both case ? Seem a bit weird, but I'm not familiar yet with multierrors. |
In general non-awaited coroutine = buggy program; it's not going to do what the user wants regardless of how we signal the error. So we have a lot of reasonable options for delivering the exception, including things that corrupt application state. ( Regarding your second example ( I guess lacking any strong intuition, my default would be to do the same thing we do for |
On further thought: There is a strong argument for delivering the exception "locally" to the offending task: imagine a WSGI-like web server, where one handler has a missing- However, there is a problem: suppose the code looks like this: async def wsgi_like_server_internal_handler():
while True:
request = await get_request_from_network()
try:
await buggy_user_handler(request)
except Exception as exc:
log_and_ignore_exception()
async def buggy_user_handler(request):
sleep(0)
return Here the next checkpoint after the buggy I don't have any great solution here. It's very hard to give errors that are useful and reliable when Python is trying this hard to stop us :-(. Arguably we should still deliver the errors as locally as we can, on the theory that this way we at least might deliver them to the right place. I guess in the HTTP server case it's unlikely that we'd actually escape out to the next loop iteration before passing through a checkpoint, since sending a response requires executing a checkpoint? But it might happen that, say, the handler raises an error (perhaps due to shrapnel from the missing It's all very ugly, but I don't have a better idea. Implementation-wise, I think this would look like:
|
Would a hypothetical hook on coroutine frame termination potentially help here? If so, you might be able to emulate it with a suitable coroutine wrapper that ensures there's always a loop checkpoint between the coroutine terminating and the calling await resuming execution. |
That's what I have currently locally but need to cleanup.
do you mean
In the form of a context manager, would you expect it to keep a reference on un awaited coroutines created in the CM ?
Do you mean use set_coro_wrapper and have it return |
Well, that goes though the window because trio expect native coroutines and you can't do that in |
@Carreau I hadn't fully thought through the suggestion, but taking a closer look at https://docs.python.org/3/library/sys.html#sys.set_coroutine_wrapper, I think the kind of wrapper I had in mind would look like:
|
Well, that what I though I first too but:
So you can't. It does raise a RuntimeError if you try. I tried with a
|
Well, that will work :-). Though it could also be a synchronous context manager that only runs the raise-if-unawaited-coroutines-detected logic. In any case it's probably worth having a dedicated API for this so that the user's intention is clear. I guess one way to do it would be to provide a
Well, the obvious objection is that we can't automagically insert checkpoints everywhere, because the location of checkpoints is part of trio's semantics. But this is easily handled by only checking for unawaited coroutines rather than executing a full checkpoint. And that It's a funny suggestion in some sense because (AFAICT) async frame teardown doesn't form any kind of intrinsic boundary with respect to catch-all exception handlers... it's certainly possible to accidentally write code like: try:
await do_something()
sleep(1)
except Exception as exc:
log_and_discard(exc) (I am assuming here that the only time you would intentionally catch one of these errors is in a catch-all handler, which I think is fair. I can't think of any other cases where people catch the OTOH, I am concerned about the cost of this. Frame teardown happens a lot, and inserting a Python function call in there every time seems like it could add substantial overhead. Of course, I don't actually know for sure, and unfortunately trio doesn't have any substantial "real" programs yet that could be used for realistic benchmarking. But it makes me hesitant; I really want our baseline protection to be something that people can leave running in production. Of course, this is assuming that trio would implement this by providing a pure-Python function that the interpreter would call at every frame teardown. If Python itself added a mode that checked for unawaited-coroutines on, say, every frame teardown or yield, then that would be very different. (From inside the interpreter, it could be extremely cheap, just a single |
@Carreau Huh, I think you may have found a docs bug in |
Haha, no the doc is definitely right , but not user friendly. It explicitly
state that the example given will **not** work. Which is a bit weird at it
screams to the user HERE IS AN EXAMPLE, and whispers "that does not work".
…On Jun 5, 2017 03:06, "Nick Coghlan" ***@***.***> wrote:
@Carreau <https://github.com/carreau> Huh, I think you may have found a
docs bug in sys.set_coroutine_wrapper, as the example given is going to
encounter the recursion problem you describe.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#176 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAUez_kuqMNa5UeW6lny3OjfVTFjhSsaks5sA9MogaJpZM4NofF6>
.
|
3d66569
to
78de594
Compare
@Carreau Heh, apparently I already knew this a couple of years ago: https://bugs.python.org/issue24342#msg244621 I filed https://bugs.python.org/issue30578 to cover the docs readability problem. |
dbac435
to
076ab9c
Compare
Codecov Report
@@ Coverage Diff @@
## master #176 +/- ##
=========================================
- Coverage 99.13% 99.03% -0.1%
=========================================
Files 67 68 +1
Lines 9213 9464 +251
Branches 663 682 +19
=========================================
+ Hits 9133 9373 +240
- Misses 62 71 +9
- Partials 18 20 +2
Continue to review full report at Codecov.
|
Not adding and exception to the tree ? (if so not sure where, I'm fine with both).
Sure, but then one thing I don't understand, what else can raise or re-raise a MultiError ? I would prefer to find and understand a case where this can happen to write it the proper way. I'm pretty sure I'm going to realize a case where this can happen shortly after sending this, in which case I'll respond to myself. |
Well sure enough it took less than 10 minutes: import trio
async def task(exception):
raise exception('just because')
def handler(exc):
return exc
async def run():
with trio.MultiError.catch(handler):
try:
async with trio.open_nursery() as n:
n.spawn(task, ValueError)
n.spawn(task, OSError)
except trio.MultiError as e:
trio.sleep(0)
print('hey')
trio.run(run) |
Yeah, in general you need to use |
79eac9e
to
e6952c4
Compare
I believe this should be ready (minus bikeshedding on the naming). I move the coroutine wrapper into its own module, which exposes a single global instance that can track coroutine creation. And this instance is shared between the One thing I'm unsure is whether I should check regularly that we are still the installed hook on set_coro_wrapper, and/or wether we are still regularly called – but we can probably push that on the user: if you are setting a coroutine wrapper, the you are responsible to still call us. I can do that, by actually creating a unawaited coroutine and checking that the coro_wrapper have intercepted it, but that might be for later and not obviously necessary. The error messages could be better especially when the unawaited coroutine is not followed by a checkpoint before task exit (teach except_hook to skip some frames ?) And one more thing I'm unsure about is the refactor of the tutorial to remove the "Don't forget await" section. It is a really nice section, but inaccurate once this goes in. It's also a bit tough to explain how this works currently as you kinda need to understand what a checkpoint is (which is explained later in the doc). I made a small attempt at writing part of it. |
@jonparrott just landed a useful restructure of https://packaging.python.org/, which separates out "Tutorials" (the essential "Get Started" pieces applicable to essentially everyone), "Guides" (opinionated walk-throughs of specific tasks that will only be applicable to a subset of users), and "Discussions" (topics where multiple valid options are available, and are goal is to enable people to assess the trade-offs for themselves, not make the decision for them). Assuming this change makes "Don't forget an await" less critical for new Trio users, it may be suitable for moving out to a guide aimed specifically at debugging cases of the new exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
I think I want to make the call that initially, let's merge this as an experimental feature that's disabled by default, so we can experiment with it, get a sense of how expensive it is, etc., before deciding how to expose it to users and worrying about docs etc. Maybe key it off an env-var for now?
The other place besides MultiError.catch
that we do a catch-all exception handler in trio is in Result.capture
and Result.acapture
. Those should also have a check for pending unawaited coroutine exceptions, similar to the one that we do when exiting a task.
For that matter, I guess there's an argument that when spawning a task we should always do something like
async def task_wrapper(async_fn, args):
try:
return await async_fn(*args)
finally:
raise_if_unawaited_coroutines()
# Then in the spawn code:
coro = Result.acapture(task_wrapper, async_fn, args)
# And then in the task runner we just wait for StopIteration
# No other exceptions are possible
...though I guess that creates complications both for KeyboardInterrupt protection and for debug tracebacks, so maybe it's not a great idea. Just thinking out loud. But conceptually these should be the same.
if origin_coro_wrapper: | ||
self._wrapped_coro_wrapper = origin_coro_wrapper | ||
else: | ||
self._wrapped_protector = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just do self._previous_coro_wrapper = sys.get_coroutine_wrapper()
, and then later sys.set_coroutine_wrapper(self._previous_coro_wrapper)
, because these two functions agree on their interpretation of None
?
trio/_core/_run.py
Outdated
@@ -354,7 +355,7 @@ def reap_and_unwrap(self, task): | |||
raise mexc | |||
|
|||
def __del__(self): | |||
assert not self.children and not self.zombies | |||
assert not self.children and not self.zombies, "Children: {} and Zombies : {}".format(self.children, self.zombies) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
79 character line limit
trio/_core/_run.py
Outdated
except NonAwaitedCoroutines as e: | ||
task.result = Error(e) | ||
else: | ||
task.result = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to chain off of the StopIteration
, we want to chain off the exception in the result
, if any :-).
I guess something like:
if task._unawaited_coros:
exc = protector.make_non_awaited_coroutines_error(task._unawaited_coros)
if type(result) is Error:
exc.__context__ = result.error
task.result = Error(exc)
else:
task.result = result
trio/_core/_multierror.py
Outdated
if exc is not None: | ||
filtered_exc = MultiError.filter(self._handler, exc) | ||
if filtered_exc is exc: | ||
if filtered_exc is exc and not force_raise: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be clearer if rewritten like:
original_exc = exc
if protector.has_unawaited_coroutines():
exc = ...
if exc is not None:
exc = MultiError.filter(...)
if exc is original_exc:
...
Oh, and I realized there's a tricky bug here: in |
Done. Though I'm wondering how often we should check for unawaited coroutines. |
Since we got down to this being the only outstanding PR, I went ahead and merged the other PR to normalize formatting using yapf (#221). Then I took the liberty of rebasing this branch and fixing its formatting too. Hopefully I didn't mess anything up in the process. |
We install a coroutine wrapper (using sys.set_coroutine_wrapper) to keep track of created coroutine. Then everytime we are in the trio task runner, we ask look at the sate of the coroutine created since then , and if any are non-awaited, we inject an error into the task. - If the task is not Finished with just raise at the next Checkpoint - If the task is finished we attempt to change its Result into an Error. - We also check for non awaited coroutines in MultiError.check. In all the cases a NonAwaitedCoroutines error is thrown, and attempt to give as much information as possible on where the coroutine comes from. If `tracemalloc.start()` has been called earlier, we attempt to find the file and line number where this coroutine originated from. Debugging is still a bit weird as you need to find the checkpoint in the middle of the stacktrace (I guess we can improve that), even worse if the task has finished. One question would be is there a way to get the previous checkpoint of the current task to narrow tings down. For internal use we provide an `await_later` function which is pass-through for the given coroutine, but instruct trio to ignore it if it has not been awaited.
Now if a coro-wrapper is already set, we wrap it, and restore it on uninstall. We also assert on uninstall that we are the installed coro_wrapper.
Whitespace only commit. Just ran 'yapf -rip setup.py trio'.
No worries, I've been overwhelmed recently and I am planning on comming back to that at some point. Thanks ! |
Hey Matthias, haven't heard from you in a while, hope everything's good :-). It'd be neat to look at this again some time, but I'm trying to clear out old PRs, and this is pretty stale, so closing for now. |
right now that works only if the task is not finisehd. Not sure what to
do if the task is done as you have no occasions to throw in it. Should
we still return the
final_result
as a Result , or make it anError()
?
Debugging is still a bit weird as you need to find the checkpoint in the
middle of the stacktrace (I guess we can improve that). Othe question
would be is there a way to get the previous checkpoint of the current
task to narrow tings down.
It's still hard to debug when the non-awaited coroutine is not at the
same stacklevel than the schedule point. We may be able to do better by
inspecting unawaited coro frames maybe ?
The await_later and context managers to relax/enforce await are not
exposed, and I'm unsure whether we want to have custom CoroProtectors
(likely yes for testing). We may also want to list all the unawaited
coroutines in the error message, and so far I have not tried with many
tasks, but the internals of Trio are still unfamiliar.
Docs and Tests are still missing.