Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise at next Checkpoint if Non-awaited coroutine found. #176

Closed
wants to merge 10 commits into from

Conversation

Carreau
Copy link
Contributor

@Carreau Carreau commented May 27, 2017

right now that works only if the task is not finisehd. Not sure what to
do if the task is done as you have no occasions to throw in it. Should
we still return the final_result as a Result , or make it an Error()
?

Debugging is still a bit weird as you need to find the checkpoint in the
middle of the stacktrace (I guess we can improve that). Othe question
would be is there a way to get the previous checkpoint of the current
task to narrow tings down.

It's still hard to debug when the non-awaited coroutine is not at the
same stacklevel than the schedule point. We may be able to do better by
inspecting unawaited coro frames maybe ?

The await_later and context managers to relax/enforce await are not
exposed, and I'm unsure whether we want to have custom CoroProtectors
(likely yes for testing). We may also want to list all the unawaited
coroutines in the error message, and so far I have not tried with many
tasks, but the internals of Trio are still unfamiliar.

Docs and Tests are still missing.

@njsmith
Copy link
Member

njsmith commented May 28, 2017

Hey, this is really cool! Thanks for pushing on this.

Some general thoughts:

Hopefully, bpo-30491 will mean that this is something we can leave enabled all the time in py37+, and we'll eventually need to structure the code to handle the three cases of (a) 3.5/3.6 with this feature disabled, (b) 3.7+ with access to the unawaited coroutine count (but no information on which coroutines they are), and (c) any version with the coroutine wrapper installed to give full details. If we structure it this way from the start (using the coroutine wrapper to polyfill the proposal in bpo-30491), then that will let us make sure that bpo-30491 actually does what we need it to do.

It's not totally clear to me what the public API should look like... a with block might make sense as a convenient internal implementation, but I don't think it's what we want to expose to the user. Eventually trio.run should do some of this automatically and unconditionally, and for turning on the full-fledged debug mode then the main use cases are (a) test runner enabling for all tests, (b) user wants to re-run a program with this enabled. So I think we want something more like an environment variable?

I'm fine with not providing any way to disable checking for individual coroutines, at least in the initial version: I really don't think we ever need to turn it off. And I figure, either I'm right and we can keep the code simple and remove temptation for people to write weird things, or else I'm wrong and making it impossible will force us to understand the problem better :-).

Obviously we do need some way to avoid it triggering for actual coroutine spawns. Probably the simplest way is to make it track coroutine objects that have been created but for which coro.__await__() hasn't been called, and then make Runner.spawn_impl call __await__ on new coroutine objects. That's already slightly different than what I wrote in bpo-30491 (there I suggested that we should track from construction until the first next/send/throw), so this exercise is already being useful in that respect :-).

When we detect a problem, signalling it is not entirely trivial, because the run loop is a delicate place. Fortunately, once we've detected a problem we're 100% certain that the user has a blatant and easy-to-fix bug that has already caused their program to go off the rails, so we have a lot of latitude in how we signal it.

Trying to inject an exception into the task where the problem actually happened is problematic, as you've noted, because it might already be gone. I suppose technically even for a finished task we could change its Result to be an exception so that the parent task sees it when it unpacks it, but this seems like it might be more complex than its worth.

One option is to do the same thing we do for KeyboardInterrupts that we can't immediately deliver: we inject them into the main task, and also check for undelivered ones on our way out of the run loop. The "is a KeyboardInterrupt pending" flag is Runner.ki_pending, and when it arrives we try to wake up the main task via _deliver_ki_cb which calls main_task._attempt_delivery_of_pending_ki(), and if it's already running then the next time it goes to sleep then run_impl also checks the ki_pending flag and calls _attempt_delivery_of_pending_ki(). Basically grep for ki_pending to see all the machinery :-). This could be generalized to allow for other weird "blow up the world" exceptions.

Another option would be to just raise the exception out of run_impl directly. This would have the effect of abandoning all running tasks (i.e. they'd just never get scheduled again, and wouldn't have a chance to release any resources or do any graceful shutdowns, etc.). So I don't like this idea much; in every other case we do guarantee that cleanups run. It makes more sense in this case than in pretty much any other one, so it's worth considering - but I'm leaning towards not.

Regarding the actual error we report: in "fast" mode, where we only have the count (either b/c we're on 3.7, or because we're using a polyfill to pretend to be), then the thing to do is probably to collect a stack trace from the last coroutine that ran, and present that to the user with some text saying "here's where we detected the problem, it actually happened just before this". It's unfortunate that there's no way to get more details at that point, but I think it's unlikely we can convince @1st1 that cpython should provide a way to find all unawaited coroutine objects due to the runtime overhead. (I guess this could be pretty cheap if we use an intrusive list in coroutine objects...?) Hooking the warning machinery too might help, since in many (but not all) cases the Warning: coroutine '...' was never awaited message will have fired.

In "slow" mode, where we use a wrapper and keep track of individual coroutines, then maybe we don't want to rely on tracemalloc and should save the information ourselves? A this point the user is explicitly running in slow mode to get detailed information, quite possibly because they just got a not-as-helpful error message from the "fast" mode machinery, so we definitely want to provide the detailed information, and asking them to enable tracemalloc as well seems unfriendly. Or maybe we could enable it ourselves? And tracemalloc has quite a bit of overhead because it saves data on every Python object, not just coroutines. ...This detail is something we can fine-tune later, though.

@Carreau
Copy link
Contributor Author

Carreau commented May 28, 2017

Hey, this is really cool! Thanks for pushing on this.

No worries, I got some questions post pycon as some-people told me that the "getting started with asyncio" talk was not exactly what they were expecting. In particular the Chess Master working on multiple games felt to them like (thinking == CPU intensive), so I decided to wrote a toy runner for myself, and wrote this piece just for me to play with, so it was easy to port here :-)

It's not totally clear to me what the public API should look like... a with block might make sense as a convenient internal implementation, but I don't think it's what we want to expose to the user.

I think I'm starting to agree with "no context manager", as if you use it and inadvertently await a function inside the context manager, then you also remove the protection for the awaited coroutine... which may itself no await anything.

I still think we should have an explicit way on a per-coroutine basis, mostly for interoperability with framework that will use trio and may do some crazy thing. I'm fine leaving it in hazmat with unstable API. It seem reasonable, but I have way less experience with async that you, so I'll trust you.

Env var and test

Yes, agreed. I think it should be slow mode by default. Otherwise no-one will use it. And that goes hand in hand with "make it work make it right make it fast".

Obviously we do need some way to avoid it triggering for actual coroutine spawns. Probably the simplest way is to make it track coroutine objects that have been created but for which coro.await() hasn't been called, and then make Runner.spawn_impl call await on new coroutine objects. That's already slightly different than what I wrote in bpo-30491 (there I suggested that we should track from construction until the first next/send/throw), so this exercise is already being useful in that respect :-).

I'll reread that later. Brain is tired of the last few days and follow each point but the all paragraph does not yet click in my head.

Trying to inject an exception into the task where the problem actually happened is problematic, as you've noted, because it might already be gone. I suppose technically even for a finished task we could change its Result to be an exception so that the parent task sees it when it unpacks it, but this seems like it might be more complex than its worth.

It might not be that hard in fact. I may give it a try.

One option is to do the same thing we do for KeyboardInterrupts that we can't immediately deliver: we inject them into the main task, and also check for undelivered ones on our way out of the run loop. The "is a KeyboardInterrupt pending" flag is Runner.ki_pending, and when it arrives we try to wake up the main task via _deliver_ki_cb which calls main_task._attempt_delivery_of_pending_ki(), and if it's already running then the next time it goes to sleep then run_impl also checks the ki_pending flag and calls _attempt_delivery_of_pending_ki(). Basically grep for ki_pending to see all the machinery :-). This could be generalized to allow for other weird "blow up the world" exceptions.

Yeah, I partially went this route, but had the weird bug that the NonAwaitedCoroutine Exception was delivered at the checkpoint n+1, and a Cancelled exception at checkpoint n+2. I still kinda think that non-awaited coroutine is more a bug than a cancellation and may be less respectful of context but I'll give that route another chance.

Another option would be to just raise the exception out of run_impl directly. This would have the effect of abandoning all running tasks (i.e. they'd just never get scheduled again, and wouldn't have a chance to release any resources or do any graceful shutdowns, etc.). So I don't like this idea much; in every other case we do guarantee that cleanups run. It makes more sense in this case than in pretty much any other one, so it's worth considering - but I'm leaning towards not.

Agreed.

Regarding the actual error we report: in "fast" mode, where we only have the count (either b/c we're on 3.7, or because we're using a polyfill to pretend to be), then the thing to do is probably to collect a stack trace from the last coroutine that ran, and present that to the user with some text saying "here's where we detected the problem, it actually happened just before this"

Yeah, the "Just before" is tricky. It's definitively not how "usual" stack trace work especially if the "just before" can be "Just before in any of the higher frames".

This make me think that and actual context manager/decorator with track_unawaited_coro: provided by trio might be of help with debugging.

In "slow" mode, where we use a wrapper and keep track of individual coroutines, then maybe we don't want to rely on tracemalloc and should save the information ourselves?

I'm unsure how. If it's user defined coroutines or even another library that have the bug, you can't really track that can you ? Or do you think using set_coro_wrapper and sys.get_frames ? that seem crazy to me.

@njsmith
Copy link
Member

njsmith commented May 28, 2017

Though... on further thought... maybe it is reasonable to ask the interpreter to keep a record of all unawaited coroutines?

The implementation I'm imagining is: each coroutine object has 2 extra pointer values embedded in it. In py36, the minimal size of a coroutine is at least:

>>> async def f():
...     pass
... 
>>> sys.getsizeof(coro) + sys.getsizeof(coro.cr_frame) + sys.getsizeof(coro.cr_frame.f_locals)
736

so adding 2 pointers would only increase the size by ~2% at most. We use the two pointers to maintain a doubly-linked list, protected by the GIL: there's a global/thread-local holding the list head, and when a coroutine is created, it inserts itself into the list, and when it's awaited, it removes itself from the list. For an intrusive doubly-linked list these are both O(1) and very cheap, just a few pointer assignments: e.g. to remove you do something like self->prev->next = self->next; self->next->prev = self->prev; self->next = self->prev = NULL. And then there'd be some object in sys or somewhere that let one examine and clear the list.

The list references could be either strong or weak. Implementation-wise, strong refs would mean adding some Py_INCREF and Py_DECREF calls when adding/removing from the list. For weak refs, we'd leave out the Py_{INC,DEC}REF and make tp_del unlink. (Note that while these refs would have weak-ref semantics, because they'd be built-in to the coroutine object there wouldn't be any reason to involve the official weakref machinery and __weakrefs__ and all that.)

Semantics-wise, the difference would be: say we have a piece of code like trio or pytest-asyncio that wants to start tracking unawaited coroutines. With strong refs, the tracking would have to be disabled by default (to prevent memory leaks), so our code would have to be careful about turning it on and off again at the right places, but in exchange it would get exact and reliable tracking of unwaited coroutines. With weak refs, the tracking could be on by default, but we'd also need to hook the warnings module to get full tracking, because sometimes a coroutine might get allocated and then GC'ed in between checks.

The advantage of this is that after things go wrong we get full access to the coroutine objects to decide what to do with them – in particular we can get the coroutine name, but can also filter based on attributes, ask tracemalloc if it knows where the object was allocated, etc. Yet, until things go wrong, it's still very cheap -- on the same order of magnitude as a simple counter.

@njsmith
Copy link
Member

njsmith commented May 28, 2017

Hmm, there is gc.get_objects(), which I guess we could iterate over to find unawaited coroutines. This seems a bit ridiculous though...

@njsmith
Copy link
Member

njsmith commented May 28, 2017

I still think we should have an explicit way [to opt out] on a per-coroutine basis, mostly for interoperability with framework that will use trio and may do some crazy thing.

Hmm, interoperability is actually an important point: if we want to provide some way to support running "legacy" asyncio libraries under trio (#171) then we'll need a way to dial this back. But I don't think we can ask people who want to use, say, asyncpg, to go add some decorator every internal function in asyncpg. And ideally we wouldn't disable the checking for all code in the process, but would restrict it to the "legacy" code. Of course, this might or might not be possible depending on the interoperability strategy. But the two main ones I've thought of so far could support that. (The two are: (1) run asyncio in a thread, where our tracking would automatically be disabled b/c it's thread-local, and (2) have an "asyncio-flavored" task in trio that holds the asyncio loop -- see #171 for details -- so the run loop could know which tasks are asyncio-flavored and ignore their unawaited coroutines when it does its check.) And in any case, this is something we can worry about after the basics are working :-).

Yes, agreed. I think it should be slow mode by default. Otherwise no-one will use it. And that goes hand in hand with "make it work make it right make it fast".

Well, the reason I'm thinking about wacky fallbacks like "what if we only had a count of unawaited coroutines but didn't know what they were", or "how can the interpreter track unawaited coroutines really efficiently" is that I want to be able to turn this on by default, but I'm worried that set_coroutine_wrapper is too expensive to use by default.

Obviously we do need some way to avoid it triggering for actual coroutine spawns [...]

[...] paragraph does not yet click in my head.

Sorry, probably it is just me being too telegraphic :-).

The problem I'm worrying about is: when you actually spawn some tasks, like:

async with trio.open_nursery() as nursery:
    nursery.spawn(f1)
    nursery.spawn(f2)

then trio creates two new Task objects, each of which holds a coroutine object, and sticks the new tasks on the run queue. But until the Tasks actually get scheduled and run, their coroutines are sitting in the CORO_CREATED state. (And we have to create the coroutine object immediately inside spawn, because we want spawn to fail synchronously if you pass the wrong arguments to the function. I guess this isn't super critical, but it's certainly nice.) So I think with how this PR is currently written, it would complain about my little code sample above, because at least one checkpoint will be executed between the two calls to spawn and when the two tasks start running.

await f() expands to: awaitable = f(); gen = awaitable.__await__(); yield from gen. For coroutines, coro.__await__() just does return self, so the second step is a no-op. My thought was that the __await__ call is a convenient place for an eventual built-in feature like bpo-30491 to key off of, so we might want to start using it now as the "stop worrying about this coroutine" signal, inside the wrapper and inside spawn.

Yeah, I partially went this route, but had the weird bug that the NonAwaitedCoroutine Exception was delivered at the checkpoint n+1, and a Cancelled exception at checkpoint n+2.

Yeah, that makes sense. But ultimately, as long as we're cleaning up tasks on the way out, then there are going to be a whole bunch of checkpoints in between when we detect the problem and when the user actually sees it, so this doesn't seem like a big deal? Obviously we would want to capture information about where the problem actually occurred in our message, though, e.g. by taking a stack trace from the just-run coroutine immediately after we detect the problem.

I still kinda think that non-awaited coroutine is more a bug than a cancellation and may be less respectful of context but I'll give that route another chance.

Yeah, maybe...

Yeah, the "Just before" is tricky. It's definitively not how "usual" stack trace work especially if the "just before" can be "Just before in any of the higher frames".

Yeah. It can even be inside a lower frame that exited before the next checkpoint! But unfortunately, there's no way to get a prompt error without help from the interpreter, which doesn't look like it will be forthcoming in 3.7, and there's no way to reconstruct the correct source information after the fact without doing expensive data recording for every single coroutine, so in "default mode" we just have to put up with confusing error messages I guess :-(.

Or do you think using set_coro_wrapper and sys.get_frames ? that seem crazy to me.

Why? That's what tracemalloc is doing, except it's getting the frames on every call to Python's low-level memory allocation functions :-). It'd be something like traceback.StackSummary.extract(traceback.walk_stack(), lookup_lines=False).

@ncoghlan
Copy link

Python already keeps track of all live threads via list(threading.enumerate()) (as well as of all live interpreters if you're using CPython's subinterpreter support), so requesting a comparable level of runtime accounting for native coroutines seems reasonable. Independently of your use case here, such a mechanism would be useful for doing system state dumps in tools like faulthandler, since it would make it straightforward to dump suspended coroutines in addition to running ones.

@njsmith
Copy link
Member

njsmith commented May 28, 2017

@ncoghlan: Interesting point! The requirement to detect unawaited coroutines makes this a bit complicated; a list of every live coroutine is less useful, because we can't afford to iterate over the list checking if each one is unawaited after each scheduling step -- that would be effectively quadratic. A possible refinement would be to track a list of all coroutines + a counter of unawaited coroutines, so we use the counter to detect the problem and then scan the list to diagnose it. There still are some possible issues with coroutines that get garbage collected before we notice the problem, and with cases where there are independent coroutine schedulers running in different threads... I'd have to think about that more.

@ncoghlan
Copy link

@njsmith Yeah, I think having the full list would be a complement to your "unawaited counter" idea. Just pointing out that it would be a useful feature in its own right, and has precedent in terms of how the standard library handles threads and how CPython handles subinterpreters.

@Carreau
Copy link
Contributor Author

Carreau commented May 28, 2017

Keep 4 linked list of coroutines one for each states they can be in ? Then it's easy to get all the coroutines in a specific state ? That would not increase the memory cost per coroutine much as you still have only 2 pointers. And it's still easy to get them all.

with set_coro_wrapper slow path, you wouldn't have issues with GC'd coro as we would keep a reference so can't be GC'd :-) I think saying that some case are only handled with slow path is reasonable.

If you really want and CPython accept, you could also have a 5th linked list of Zombies coroutines never awaited (never closed?) that got GC'd. But that seem extreme.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 4, 2017

What do you think should be the raised exceptions in the following case:

  • Cancelled and non awaited coroutines
  • Raises and unawaited coroutines:
async def foo():
    sleep(0)
    raise ValueError(...)

MultiError as well in both case ? Seem a bit weird, but I'm not familiar yet with multierrors.
It seem reasonable (to me) that if an exception is raise then you may have an un awaited coroutine. Or should we (try to) propagate the unawaited coro to the parent task ?

@njsmith
Copy link
Member

njsmith commented Jun 4, 2017

In general non-awaited coroutine = buggy program; it's not going to do what the user wants regardless of how we signal the error. So we have a lot of reasonable options for delivering the exception, including things that corrupt application state. (KeyboardInterrupt can also blindly overwrite other exceptions.) Unfortunately having a lot of options means that I'm not sure what the best choice is :-)

Regarding your second example (raise), I don't think it's an issue – the soonest we could detect the unawaited coroutine is when it passes through the next checkpoint, and raise isn't a checkpoint.

I guess lacking any strong intuition, my default would be to do the same thing we do for KeyboardInterrupt, because it's (a) plausible, and (b) we already know how to do it. So that would mean that we raise a special exception inside the main task as if it were cancelled (and I guess this exception should have higher priority than delivering a Cancelled error). (Speaking of which, it looks like currently Cancelled has higher priority than KeyboardInterrupt... I guess that's a bug. #184)

@njsmith
Copy link
Member

njsmith commented Jun 4, 2017

On further thought:

There is a strong argument for delivering the exception "locally" to the offending task: imagine a WSGI-like web server, where one handler has a missing-await bug. Even in my fantasy version of Python where an unawaited async function call immediately raises TypeError, this TypeError would probably get caught and logged by a generic except Exception: handler (+ trigger a 500 Internal Error response), rather than crashing the server. I guess the exception could be class BelatedTypeError(TypeError) and if it has to overwrite another exception then we can use implicit chaining so that the original exception remains visible.

However, there is a problem: suppose the code looks like this:

async def wsgi_like_server_internal_handler():
    while True:
        request = await get_request_from_network()
        try:
            await buggy_user_handler(request)
        except Exception as exc:
            log_and_ignore_exception()

async def buggy_user_handler(request):
    sleep(0)
    return

Here the next checkpoint after the buggy sleep(0) doesn't occur until the next pass through the loop, in get_request_from_network -- i.e., the error can "leak out" of the buggy code into some other code and then crash the whole program, because "the next checkpoint" is a dynamic, linear property that doesn't respect things like try block nesting.

I don't have any great solution here. It's very hard to give errors that are useful and reliable when Python is trying this hard to stop us :-(.

Arguably we should still deliver the errors as locally as we can, on the theory that this way we at least might deliver them to the right place. I guess in the HTTP server case it's unlikely that we'd actually escape out to the next loop iteration before passing through a checkpoint, since sending a response requires executing a checkpoint? But it might happen that, say, the handler raises an error (perhaps due to shrapnel from the missing await), so we end up in the catch-all except block and try to send a 500 response from there, and then this attempt to send is what notices the problem and raises BelatedTypeError. (Or the handler could finish sending its response and then call the unawaited coroutine.) I'm uncomfortable with the idea that we might mostly deliver them to the right place, but not always. We could supplement this with some sort of explicit check-for-unawaited-coroutines function and hope that people remember to call it before leaving the catch-all handler. (Probably expressed as a context manager, since the check should effectively always be inside a finally block.)

It's all very ugly, but I don't have a better idea.

Implementation-wise, I think this would look like:

  • Add an attribute to Task objects that holds a list of unawaited coroutines (or whatever information about them we have)

  • After running each task step, check for unawaited coroutines and if found add them to the task's list

  • In _attempt_delivery_of_any_pending_cancel, check to see if the list is non-empty and if so inject the error (and clear the list)

  • in the task cleanup code, if the list is non-empty, replace the task's Result with an error

  • in yield_if_cancelled, yield if the list is non-empty

@ncoghlan
Copy link

ncoghlan commented Jun 4, 2017

Would a hypothetical hook on coroutine frame termination potentially help here? If so, you might be able to emulate it with a suitable coroutine wrapper that ensures there's always a loop checkpoint between the coroutine terminating and the calling await resuming execution.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2017

Add an attribute to Task objects that holds a list of unawaited coroutines (or whatever information about them we have)

After running each task step, check for unawaited coroutines and if found add them to the task's list

In _attempt_delivery_of_any_pending_cancel, check to see if the list is non-empty and if so inject the error (and clear the list)

in the task cleanup code, if the list is non-empty, replace the task's Result with an error

That's what I have currently locally but need to cleanup.

some sort of explicit check-for-unawaited-coroutines function

do you mean sleep(0) 😄 ?

(Probably expressed as a context manager, since the check should effectively always be inside a finally block.)

In the form of a context manager, would you expect it to keep a reference on un awaited coroutines created in the CM ?

Would a hypothetical hook on coroutine frame termination potentially help here? If so, you might be able to emulate it with a suitable coroutine wrapper that ensures there's always a loop checkpoint between the coroutine terminating and the calling await resuming execution.

Do you mean use set_coro_wrapper and have it return @types.coroutines that sleep(0) ? I'm going to see if I can do that.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2017

Do you mean use set_coro_wrapper and have it return @types.coroutines that sleep(0) ? I'm going to see if I can do that.

Well, that goes though the window because trio expect native coroutines and you can't do that in set_coro_wrapper

@ncoghlan
Copy link

ncoghlan commented Jun 5, 2017

@Carreau I hadn't fully thought through the suggestion, but taking a closer look at https://docs.python.org/3/library/sys.html#sys.set_coroutine_wrapper, I think the kind of wrapper I had in mind would look like:

async def checkpointed_wrapper(coro):
    try:
        result = await coro
    finally:
        await sleep(0) # Check for unawaited coroutines
    return result

def ensure_terminal_checkpoint(coro):
    return checkpointed_wrapper(coro)

sys.set_coroutine_wrapper(ensure_terminal_checkpoint)

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2017

@Carreau I hadn't fully thought through the suggestion, but taking a closer look at https://docs.python.org/3/library/sys.html#sys.set_coroutine_wrapper, I think the kind of wrapper I had in mind would look like:

Well, that what I though I first too but:

The wrapper callable cannot define new coroutines directly or indirectly:

So you can't. It does raise a RuntimeError if you try. I tried with a type.coroutine, but that fails as well because trio is expecting coroutine:

if not inspect.iscoroutine(coro):
        raise TypeError("spawn expected an async function")

@njsmith
Copy link
Member

njsmith commented Jun 5, 2017

do you mean sleep(0) 😄 ?

Well, that will work :-). Though it could also be a synchronous context manager that only runs the raise-if-unawaited-coroutines-detected logic. In any case it's probably worth having a dedicated API for this so that the user's intention is clear.

I guess one way to do it would be to provide a with catch_all(logging_fn): ... context manager, which provides the catch-all semantics and the unawaited-coroutine containment semantics in a single level of indentation.

Would a hypothetical hook on coroutine frame termination potentially help here? If so, you might be able to emulate it with a suitable coroutine wrapper that ensures there's always a loop checkpoint between the coroutine terminating and the calling await resuming execution.

Well, the obvious objection is that we can't automagically insert checkpoints everywhere, because the location of checkpoints is part of trio's semantics. But this is easily handled by only checking for unawaited coroutines rather than executing a full checkpoint. And that inspect.iscoroutine check that @Carreau ran into can easily be loosened. So, hmm, let's assume these trivial matters are handled and think about the actual issues.

It's a funny suggestion in some sense because (AFAICT) async frame teardown doesn't form any kind of intrinsic boundary with respect to catch-all exception handlers... it's certainly possible to accidentally write code like:

try:
    await do_something()
    sleep(1)
except Exception as exc:
    log_and_discard(exc)

(I am assuming here that the only time you would intentionally catch one of these errors is in a catch-all handler, which I think is fair. I can't think of any other cases where people catch the NameError from slepe(0) or the TypeError from sleep(0, 0), and this is similar.) So I think we should think of this as just a way to insert a lot more checks willy-nilly throughout the program. Which is definitely useful – it increases the probability of catching these errors promptly and close to the source. But since it's not 100% guaranteed, I think we'd still need to provide and document the context manager for containing unawaited-coroutine errors.

OTOH, I am concerned about the cost of this. Frame teardown happens a lot, and inserting a Python function call in there every time seems like it could add substantial overhead. Of course, I don't actually know for sure, and unfortunately trio doesn't have any substantial "real" programs yet that could be used for realistic benchmarking. But it makes me hesitant; I really want our baseline protection to be something that people can leave running in production.

Of course, this is assuming that trio would implement this by providing a pure-Python function that the interpreter would call at every frame teardown. If Python itself added a mode that checked for unawaited-coroutines on, say, every frame teardown or yield, then that would be very different. (From inside the interpreter, it could be extremely cheap, just a single if (unlikely(unawaited_coro_list_head != NULL)) { ... } at each place.) But this might be reaching a level of intrusiveness that Yury et al would be unwilling to go along with... I'm not even sure he's willing to track unawaited coroutines :-).

@ncoghlan
Copy link

ncoghlan commented Jun 5, 2017

@Carreau Huh, I think you may have found a docs bug in sys.set_coroutine_wrapper, as the example given is going to encounter the recursion problem you describe.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 5, 2017 via email

@Carreau Carreau force-pushed the coro-protector branch 2 times, most recently from 3d66569 to 78de594 Compare June 6, 2017 02:42
@ncoghlan
Copy link

ncoghlan commented Jun 6, 2017

@Carreau Heh, apparently I already knew this a couple of years ago: https://bugs.python.org/issue24342#msg244621

I filed https://bugs.python.org/issue30578 to cover the docs readability problem.

@Carreau Carreau force-pushed the coro-protector branch 2 times, most recently from dbac435 to 076ab9c Compare June 7, 2017 01:33
@codecov
Copy link

codecov bot commented Jun 7, 2017

Codecov Report

Merging #176 into master will decrease coverage by 0.09%.
The diff coverage is 96.11%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #176     +/-   ##
=========================================
- Coverage   99.13%   99.03%   -0.1%     
=========================================
  Files          67       68      +1     
  Lines        9213     9464    +251     
  Branches      663      682     +19     
=========================================
+ Hits         9133     9373    +240     
- Misses         62       71      +9     
- Partials       18       20      +2
Impacted Files Coverage Δ
trio/_core/_exceptions.py 100% <100%> (ø) ⬆️
trio/_core/tests/test_run.py 100% <100%> (ø) ⬆️
trio/_core/_multierror.py 100% <100%> (ø) ⬆️
trio/_core/_result.py 90.56% <70.58%> (-9.44%) ⬇️
trio/_core/tests/test_result.py 97.03% <89.47%> (-2.97%) ⬇️
trio/_core/_run.py 99.84% <97.36%> (-0.16%) ⬇️
trio/_core/_non_awaited_coroutines.py 98.07% <98.07%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 027d374...079b7b7. Read the comment docs.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 13, 2017

overwriting the incoming exception (if any) with the new one, and then passing the new one to MultiError.filter.

Not adding and exception to the tree ? (if so not sure where, I'm fine with both).

Not sure what you mean here... MultiError.catch can be called in all kinds of places, not just around task exit?

Sure, but then one thing I don't understand, what else can raise or re-raise a MultiError ?
What's the point of checking for un awaited coroutines in MultiError.catch, if they have already been transformed into NonAwaitedExceptions ? I am likely the one missing something here, but I don't see how an unawaited coroutine can be created between the raising of a MultiError and it being caught.

I would prefer to find and understand a case where this can happen to write it the proper way. I'm pretty sure I'm going to realize a case where this can happen shortly after sending this, in which case I'll respond to myself.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 13, 2017

Well sure enough it took less than 10 minutes:

import trio

async def task(exception):
    raise exception('just because')

def handler(exc):
    return exc

async def run():
    with trio.MultiError.catch(handler):
        try:
            async with trio.open_nursery() as n:
                n.spawn(task, ValueError)
                n.spawn(task, OSError)
        except trio.MultiError as e:
            trio.sleep(0)
        print('hey')
trio.run(run)

@njsmith
Copy link
Member

njsmith commented Jun 13, 2017

Yeah, in general you need to use MultiError.catch if you want to accurately exceptions coming out of an arbitrary block of code that might contain sub-tasks, but that doesn't mean that the unawaited coroutine is guaranteed to happen inside one of those subtasks, it could be directly inside the MultiError.catch's body.

@Carreau Carreau force-pushed the coro-protector branch 4 times, most recently from 79eac9e to e6952c4 Compare June 16, 2017 04:52
@Carreau
Copy link
Contributor Author

Carreau commented Jun 16, 2017

I believe this should be ready (minus bikeshedding on the naming).

I move the coroutine wrapper into its own module, which exposes a single global instance that can track coroutine creation. And this instance is shared between the MultiErrors.catch logic and the trio.run.
Trio.run (when configured to disallow un-awaited coroutine) will install a coroutine_wrapper, and uninstall it on exit, if a coro_wrapper is already set it will wrap it, and restore it on exit.

One thing I'm unsure is whether I should check regularly that we are still the installed hook on set_coro_wrapper, and/or wether we are still regularly called – but we can probably push that on the user: if you are setting a coroutine wrapper, the you are responsible to still call us.

I can do that, by actually creating a unawaited coroutine and checking that the coro_wrapper have intercepted it, but that might be for later and not obviously necessary.

The error messages could be better especially when the unawaited coroutine is not followed by a checkpoint before task exit (teach except_hook to skip some frames ?)

And one more thing I'm unsure about is the refactor of the tutorial to remove the "Don't forget await" section. It is a really nice section, but inaccurate once this goes in. It's also a bit tough to explain how this works currently as you kinda need to understand what a checkpoint is (which is explained later in the doc). I made a small attempt at writing part of it.

@ncoghlan
Copy link

@jonparrott just landed a useful restructure of https://packaging.python.org/, which separates out "Tutorials" (the essential "Get Started" pieces applicable to essentially everyone), "Guides" (opinionated walk-throughs of specific tasks that will only be applicable to a subset of users), and "Discussions" (topics where multiple valid options are available, and are goal is to enable people to assess the trade-offs for themselves, not make the decision for them).

Assuming this change makes "Don't forget an await" less critical for new Trio users, it may be suitable for moving out to a guide aimed specifically at debugging cases of the new exception.

Copy link
Member

@njsmith njsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

I think I want to make the call that initially, let's merge this as an experimental feature that's disabled by default, so we can experiment with it, get a sense of how expensive it is, etc., before deciding how to expose it to users and worrying about docs etc. Maybe key it off an env-var for now?

The other place besides MultiError.catch that we do a catch-all exception handler in trio is in Result.capture and Result.acapture. Those should also have a check for pending unawaited coroutine exceptions, similar to the one that we do when exiting a task.

For that matter, I guess there's an argument that when spawning a task we should always do something like

async def task_wrapper(async_fn, args):
    try:
        return await async_fn(*args)
    finally:
        raise_if_unawaited_coroutines()

# Then in the spawn code:
coro = Result.acapture(task_wrapper, async_fn, args)

# And then in the task runner we just wait for StopIteration
# No other exceptions are possible

...though I guess that creates complications both for KeyboardInterrupt protection and for debug tracebacks, so maybe it's not a great idea. Just thinking out loud. But conceptually these should be the same.

if origin_coro_wrapper:
self._wrapped_coro_wrapper = origin_coro_wrapper
else:
self._wrapped_protector = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just do self._previous_coro_wrapper = sys.get_coroutine_wrapper(), and then later sys.set_coroutine_wrapper(self._previous_coro_wrapper), because these two functions agree on their interpretation of None?

@@ -354,7 +355,7 @@ def reap_and_unwrap(self, task):
raise mexc

def __del__(self):
assert not self.children and not self.zombies
assert not self.children and not self.zombies, "Children: {} and Zombies : {}".format(self.children, self.zombies)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

79 character line limit

except NonAwaitedCoroutines as e:
task.result = Error(e)
else:
task.result = result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to chain off of the StopIteration, we want to chain off the exception in the result, if any :-).

I guess something like:

if task._unawaited_coros:
    exc = protector.make_non_awaited_coroutines_error(task._unawaited_coros)
    if type(result) is Error:
        exc.__context__ = result.error
    task.result = Error(exc)
else:
    task.result = result

if exc is not None:
filtered_exc = MultiError.filter(self._handler, exc)
if filtered_exc is exc:
if filtered_exc is exc and not force_raise:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be clearer if rewritten like:

original_exc = exc
if protector.has_unawaited_coroutines():
    exc = ...
if exc is not None:
    exc = MultiError.filter(...)
if exc is original_exc:
    ...

@njsmith
Copy link
Member

njsmith commented Jun 17, 2017

Oh, and I realized there's a tricky bug here: in _attempt_delivery_of_any_pending_cancel, then the only unawaited coroutines we care about the ones attached to the task object. But in all the other functions – yield_if_cancelled, MultiError.catch, Result.(a)capture – we actually want to trigger the exception logic if there are any unawaited coroutines attached to the task or newly created since the last pass through the scheduler and not yet attached to any task. Because those are the coroutines that will get attached to this task if/when we yield.

@Carreau
Copy link
Contributor Author

Carreau commented Jun 18, 2017

But in all the other functions – yield_if_cancelled, MultiError.catch, Result.(a)capture – we actually want to trigger the exception logic if there are any unawaited coroutines attached to the task or newly created since the last pass through the scheduler and not yet attached to any task. Because those are the coroutines that will get attached to this task if/when we yield.

Done. Though I'm wondering how often we should check for unawaited coroutines.yield_if_cancelled is reasonable. MultiError.catch, Result.(a)capture seem less. At least I would have felt ok to let unawaited coroutine be caught later. It's just unclear to me where is the limit is.

@njsmith
Copy link
Member

njsmith commented Aug 4, 2017

Since we got down to this being the only outstanding PR, I went ahead and merged the other PR to normalize formatting using yapf (#221). Then I took the liberty of rebasing this branch and fixing its formatting too. Hopefully I didn't mess anything up in the process.

Carreau and others added 10 commits August 3, 2017 21:46
We install a coroutine wrapper (using sys.set_coroutine_wrapper) to keep
track of created coroutine. Then everytime we are in the trio task
runner, we ask look at the sate of the coroutine created since then ,
and if any are non-awaited, we inject an error into the task.

  - If the task is not Finished with just raise at the next Checkpoint
  - If the task is finished we attempt to change its Result into an
  Error.
  - We also check for non awaited coroutines in MultiError.check.

In all the cases a NonAwaitedCoroutines error is thrown, and attempt to
give as much information as possible on where the coroutine comes from.
If `tracemalloc.start()` has been called earlier, we attempt to find the
file and line number where this coroutine originated from.

Debugging is still a bit weird as you need to find the checkpoint in the
middle of the stacktrace (I guess we can improve that), even worse if
the task has finished. One question would be is there a way to get the
previous checkpoint of the current task to narrow tings down.

For internal use we provide an `await_later` function which is
pass-through for the given coroutine, but instruct trio to ignore it if
it has not been awaited.
Now if a coro-wrapper is already set, we wrap it, and restore it on
uninstall. We also assert on uninstall that we are the installed
coro_wrapper.
Whitespace only commit. Just ran 'yapf -rip setup.py trio'.
@Carreau
Copy link
Contributor Author

Carreau commented Aug 10, 2017

Since we got down to this being the only outstanding PR, I went ahead and merged the other PR to normalize formatting using yapf (#221). Then I took the liberty of rebasing this branch and fixing its formatting too. Hopefully I didn't mess anything up in the process.

No worries, I've been overwhelmed recently and I am planning on comming back to that at some point. Thanks !

@njsmith
Copy link
Member

njsmith commented Jul 31, 2018

Hey Matthias, haven't heard from you in a while, hope everything's good :-). It'd be neat to look at this again some time, but I'm trying to clear out old PRs, and this is pretty stale, so closing for now.

@njsmith njsmith closed this Jul 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants