Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider the behavior of nursery.start when the target nursery is cancelled #1431

Open
njsmith opened this issue Mar 10, 2020 · 2 comments
Open

Comments

@njsmith
Copy link
Member

njsmith commented Mar 10, 2020

There was a discussion today in chat about the subtle interactions of nursery.start and cancellation, involving a somewhat complicated example: https://gitter.im/python-trio/general?at=5e6684aea2897318a999fe4e

In the example, @catern has an inner nursery and an outer nursery. A task in the outer nursery tries to spawn a task, using await inner_nursery.start(...). In the mean time, another task inside the inner nursery crashes. And unluckily, the new task's startup involved waiting for the crashed task. This led to a deadlock, because:

  • The inner nursery is trying to unwind due to the crashed task, but it can't until all tasks have completed
  • So it cancels everything inside it, to expedite that
  • But when a task is in nursery.start, and hasn't yet called task_status.started(), then it's in a weird state where it keeps the target nursery open, but it's still in the original task's cancel scope.
  • So the inner nursery can't finish unwinding until the new child calls task_status.started(), the new child is stuck and should be cancelled, it won't be cancelled until the exception propagates out of the inner nursery → deadlock.

I'm not sure how common this particular situation is, but it does seem kinda problematic that we have a state where a nursery is trying to crash but it can't actually cancel everything inside it.

There are two features that are nice on their own, but play a role in creating this situation:

Feature 1: We want task_status.started() to be infallible. To achieve this, nursery.start has to keep the target nursery open while the task is starting; otherwise, the nursery might disappear while the task is starting up (e.g. to all its other tasks exiting).

Feature 2: If the new child raises an exception before calling task_status.started() (e.g., imagine you passed the wrong number of arguments to the function), then that exception is propagated out of the call to nursery.start, not into the nursery itself.

And, as a general rule, we always have to keep our exception propagation hierarchy and our cancel scope hierarchy lined up, so we can't have wild Cancelled exceptions propagating outside of the associated cancel scope. Put together, these mean that the new child has to run inside the caller's cancel scope until task_status.started() is called, and then the call to started transitions it to running inside the target nursery's cancel scope.

And, this is also kind of a nice feature in its own right, because e.g. it means if you want to put a timeout on a task starting up, that's easy to do:

  with move_on_after(60) as cscope:
      await nursery.start(server_fn)
  if cscope.cancelled_caught:
      # server failed to start up in reasonable time; handle the problem

So that's how we end up with code that's holding the nursery open, but isn't inside its cancel scope.

Thinking about it, I think we could potentially relax the first feature... we could make it so nursery.start didn't keep the nursery open, and if the nursery was closed when we reached the call to task_status.started(), then it could raise RuntimeError. This is already what nursery.start_soon and nursery.start do if the nursery is closed when you call them. But right now, we only check for this once, before the task starts running; the new thing would be that the task could get through its startup phase and then fail due to the missing nursery.

This could potentially cause new failures, if code isn't being careful to keep the nursery open until the task has finished starting up? But I think any code that could hit that failure was already race-y, because if the nursery could spontaneously close while nursery.start is running, then it could also spontaneously close just before nursery.start is called, right? Unless the startup phase code itself does something to trigger the other tasks to exit, but that seems pretty perverse.

OTOH, this wouldn't actually help with @catern's particular situation (because their code is deadlocking before it reaches task_status.started()). And it doesn't really feel like it addresses the core theoretical problem, that a nursery can be trying to tear itself down while there's still code "running inside it" that hasn't been informed. To do that, we need to actually propagate cancellation into the new task.

But, because of "Feature 2", we can't simply go ahead and put the child task directly into the target nursery's cancel scope.

One possibility that came up in chat: we could potentially put the child task into both cancel scopes simultaneously, but with a twist: if the task terminates with a Cancelled exception before calling task_status.started(), and the nursery.start scope is cancelled, then the Cancelled is allowed to propagate into the caller, like it does now. But if this Cancelled came from the nursery's cancel scope, then we can't allow it to propagate out into the caller, because that would become a wild Cancelled that might never be caught. So instead, nursery.start would need to detect this situation, and convert it into some other exception, like a NurseryWasCancelledError or something.

This seems potentially complicated to implement, and nursery.start is already pretty complicated. (Though @oremanj's cancel scope refactoring made it much better; maybe this wouldn't actually be too hard? idk) But it does seem like it might be The Right Solution? I'm not sure.

@catern
Copy link
Contributor

catern commented Mar 13, 2020

Let me give a little more context and color on my scenario (since I think this scenario will be relatively common in an async codebase of sufficient size).

Specifically in our scenario, our code is running in some nursery A, with some child nursery B. We launch a OS process P in nursery B with B.start_soon; the function to launch it waits until it exits and throws if it exits uncleanly. Then also in B we start up a function M with B.start to monitor process P (in a process-specific way; it essentially reads structured logging information from disk); this monitoring task returns a monitoring interface with task_status.started that we query for state changes in P. It doesn't return the monitoring interface until it's able to read the first structured logging information from disk, indicating that P has come up successfully.

The problem is if P exits uncleanly during startup, before it starts writing any structured logging, then the task monitoring P will throw an exception, which will cancel B - but monitoring function M will never get any structured logging information, so it will just hang forever.

This kind of pattern is pretty common in our codebase - it might be less common for others since we've been developing complex process management code with trio long before trio natively supported subprocesses.

@arthur-tacca
Copy link
Contributor

arthur-tacca commented Apr 26, 2024

I came across this today while looking at what AnyIO does in a similar situation (as part of agronholm/anyio#717). I was just running abstract tests rather than encountering it in the wild though (and the nurseries were siblings rather than parent/child).

I've got to say, I was very surprised that a cancelled nursery could be held up by a task that has no indication that it should be cancelling.

I considered the same two fixes as njs: allow task_started() to raise an exception if the target nursery closes, so it doesn't need to wait for starting tasks to finish, or allow a starting task to immediately cancel if either its current or its destination nursery is cancelled. My instinct was that the first one is best, but given that it doesn't fix the only reported (that I know of) case in the wild, and has the issue that it could cause a nursery to close prematurely in a non cancelled situation, it seems like the doible-cancelability option is best after all. I agree it's complicated and a bit weird but better than the current situation and any other option I can see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants