Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested TaskGroup can silently swallow cancellation request from parent TaskGroup #116720

Closed
arthur-tacca opened this issue Mar 13, 2024 · 28 comments
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@arthur-tacca
Copy link

arthur-tacca commented Mar 13, 2024

Bug report

Bug description:

In the following code snippet, I start an asyncio.TaskGroup called outer_tg then start another one within it called inner_tg. The inner task group is wrapped in an except* block to catch a specific type of exception. A background task in each task group raises a different exception at roughly the same time, so the inner one ends up (correctly) raising the inner exception in an ExceptionGroup, which is caught and discarded by the except* block. However, this means that there is now no longer any exception bubbling up through the call stack, and the main body of the outer task group just continues on, even allowing waiting on more routines (but not creating more tasks in the outer group, as it is still shutting down).

import asyncio

class ExceptionOuter(Exception):
    pass
class ExceptionInner(Exception):
    pass

async def raise_after(t, e):
    await asyncio.sleep(t)
    print(f"Raising {e}")
    raise e()

async def my_main():
    try:
        async with asyncio.TaskGroup() as outer_tg:
            try:
                async with asyncio.TaskGroup() as inner_tg:
                    inner_tg.create_task(raise_after(1, ExceptionInner))
                    outer_tg.create_task(raise_after(1, ExceptionOuter))
            except* ExceptionInner:
                print("Got inner exception")
            print("should not get here")
            await asyncio.sleep(0.2)
            print("waited")
            # outer_tg.create_task(asyncio.sleep(0.2)) # raises RuntimeError("TaskGroup shutting down")
    except* ExceptionOuter:
        print("Got outer exception")
    print("Done")

asyncio.run(my_main())

Expected vs observed behaviour:

Observed behaviour:

Raising <class '__main__.ExceptionInner'>
Raising <class '__main__.ExceptionOuter'>
Got inner exception
should not get here
waited
Got outer exception
Done

Expected behaviour: the rest of the main body of the outer task group is skipped, so we just see:

Raising <class '__main__.ExceptionInner'>
Raising <class '__main__.ExceptionOuter'>
Got inner exception
Got outer exception
Done

Variations:

Making either of these changes (or both) still gives the same issue:

  • Add a third line within the inner task group await asyncio.sleep(10). This means that the main body of the inner task group finishes with asyncio.CancelledError rather than just finishing before any exceptions are raised by tasks.

  • Replace inner_tg.create_task(raise_after(1, ExceptionInner)) with inner_tg.create_task(raise_in_cancel()), where:

async def raise_in_cancel():
    try:
        await asyncio.sleep(10)
    except asyncio.CancelledError:
        raise ExceptionInner()

It's fair to dismiss this second case because it violates the rule about not suppressing CancelledError).

Root cause:

I think the problem is that TaskGroup.__aexit__() never includes CancelledError in the ExceptionGroup it raises if there are any other types of exception, even if a parent task group is shutting down. That appears to be due to this code in TaskGroup.__aexit__() (specifically the and not self._errors part, because self._errors is the list of all non-cancellation exceptions):

# Propagate CancelledError if there is one, except if there
# are other errors -- those have priority.
if propagate_cancellation_error and not self._errors:
    raise propagate_cancellation_error

I'm not sure of the right solution. Here are a couple of possibilities that come to mind, but I'm not sure if either would work or what the wider implications would be.

  • Perhaps, after all child tasks are completed, TaskGroup.__aexit__(), after needs to walk up the stack of parent task groups and see if any of those are shutting down, and if so include a CancelledError in the ExceptionGroup.
  • Maybe CancelledError needs to include some metadata about what task group caused it to be raised and then it's only suppressed by the task group it's associated with (that's what Trio and AnyIO do I believe).

CPython versions tested on:

3.11, 3.12

Operating systems tested on:

Windows

Linked PRs

@arthur-tacca arthur-tacca added the type-bug An unexpected behavior, bug, or error label Mar 13, 2024
@gvanrossum
Copy link
Member

Thanks for the very careful bug report. I can repro this in 3.13 (main) as well.

Including metadata in CancelledError is not an easy thing. CancelledError actually has a metadata field but it's reserved for the application, and we don't like it much, because there is code (probably even in asyncio itself) that catches CancelledError and then raises a different instance of CancelledError. Adding more metadata would suffer the same fate. This is why we came up with the awkward uncancel machinery when task groups were introduced.

This same problem makes it problematic to reverse the relationship (having the task group that tracks the CancelledError instance) -- we can't trust the identity of the instance.

I'm also not excited about your other idea, keeping track of the nesting of task groups, but honestly I haven't thought much about it yet -- it just sounds tricky to keep track of.

I will have to mull this over for a bit.

Meanwhile if you want to throw out more ideas please go ahead! I have found that for this kind of issues, putting a debugging breakpoint (or using print()) in the asyncio code (e.g. in __aexit__()) is sometimes quite helpful, and surprisingly doesn't disturb the sequencing of events too much, since asyncio is single-threaded. (You do have to be careful with sleeps though.)

@arthur-tacca
Copy link
Author

Thanks Guido.

Not sure if relevant, but I've now I was wrong about Trio including metadata in cancellation exceptions about which scope it belongs to. It used to do that but changed to use the other idea (scopes decide at the point of receiving a cancellation exception whether they are the outermost active scope) in Trio 0.11.0 (2019-02-09); the release note describes it as:

trio.Cancelled exceptions now always propagate until they reach the outermost unshielded cancelled scope, even if more cancellations occur or shielding is changed between when the Cancelled is delivered and when it is caught. (#860)

I suspect that asyncio's edge-based cancellation nature means that metadata in the exception could never work.

@gvanrossum
Copy link
Member

No worries, I tend to ignore comparisons with Trio anyways -- they chose a different model, and good for them, but asyncio has to steer its own course.

I haven't made much progress with understanding what exactly happens, but I think that, regardless of context, in this snippet:

            try:
                async with asyncio.TaskGroup() as inner_tg:
                    inner_tg.create_task(raise_after(1, ExceptionInner))
                    outer_tg.create_task(raise_after(1, ExceptionOuter))
            except* ExceptionInner:
                print("Got inner exception")
            print("should not get here")

that last print("should not get here") should actually be reached -- the inner task group has one task, which raises an exception, which is the only exception bubbled out of the exception group, and then is caught by the except* block, after which normal execution continues.

So it seems that things go wrong right after that, here:

            await asyncio.sleep(0.2)

This await should be interrupted because the outer task group has a pending error to process, and that doesn't seem to happen.

@gvanrossum
Copy link
Member

Hm, I may have to disagree with myself. The __aexit__ call that ends the inner async with should know that the task is also cancelled on behalf of the outer async with. And the task attribute _num_cancels_requested (which is returned by task.cancelling()) is supposed to track this. It's quite possible that a modified version of your first proposed mitigation can look at that.

@gvanrossum
Copy link
Member

It turns out propagate_cancellation_error already takes _num_cancels_requested into account (it is only set when uncancel() on the parent task returns nonzero). Unfortunately, just adding it to self._errors when it is not getting raised causes various test failure. And this makes me realize the problem: we don't include CancelledError in the exception group because the contract for cancellation requires CancelledError to be a "bare" exception -- large amounts of asyncio infrastructure and applications would stop working if it could be wrapped in an exception group.

I don't know if there is any way out then without breaking other scenarios. Even putting the inner task group in a separate task doesn't help. Sorry.

@arthur-tacca
Copy link
Author

I see, that's a pity. Thanks for looking into it.

@gvanrossum
Copy link
Member

gvanrossum commented Mar 17, 2024

Hm, there's hope. If I insert

            if self._parent_task.uncancel() > 0:
                self._parent_task.cancel()

after if self._errors:, the await asyncio.sleep(0.2) in your test program gets interrupted. Even better yet, the asyncio test suite still passes! Your test program's output is now:

Raising <class '__main__.ExceptionInner'>
Raising <class '__main__.ExceptionOuter'>
Got inner exception
should not get here
Got outer exception
Done

Note waited is no longer printed. The should not get here is still printed, but I believe this is the best we can do -- the parent task is in a cancelled state, but it will first carry out other activities until it hits the next await.

@arthur-tacca Are you interested in contributing a PR for this?

@gvanrossum
Copy link
Member

@arthur-tacca Ping? "No" is a perfectly fine answer too. :-)

@arthur-tacca
Copy link
Author

arthur-tacca commented Mar 21, 2024

Sorry I didn't reply earlier. Partly I wanted to look into a couple of other test cases first and partly I wanted to think about making the pull request myself. Sadly it's a "no" to me doing the pull request - I'd like to but it's just not realistic at the moment.

Your comment with a suggested fix is really good news, it's very promising, and not a disaster that it runs that extra bit of sync code that I contentiously labelled "should not get here". I've found the fix is not quite complete but I do have a suggestion based on it that I think could work.

(Side note: I could be totally wrong here... but I don't think TaskGroup is actually using cancellation counts at the moment. This sounds silly but actually task groups can usually get away with the rule "if any task raises an exception, propagate that, otherwise if there's a cancellation error it must be external so raise that" (as implemented in the lines "if propagate_cancellation_error and not self._errors: raise propagate_cancellation_error), so you don't need the count. This only breaks down if both child exception and genuine parent task cancellation are happening at the same time, which we've established isn't working. So lucky you added cancellation counts!)

In addition to the test case in my original post, I wanted to look at a variant: instead of two nested task groups, both shut down at the same time, consider a single task group within a task that is cancelled in the usual old-school way of asyncio.cancel(). This is the test mentioned in the comment of TaskGroup._on_task_done(), except that it doesn't mention the possibility of both types of cancellation (external .cancel() and also due to child task raising exception) happening at the same time. Test code is in the fold below.

Test code for cancel task containing task group
import asyncio
from asyncio import TaskGroup

async def raise_after(t, e):
    await asyncio.sleep(t)
    print(f"Raising {e}")
    raise e()

async def foo():
    print("foo(): starting")
    try:
        async with TaskGroup() as g:
            g.create_task(raise_after(0.5, RuntimeError))
            await asyncio.sleep(1)  # <- this needs to be canceled
                            #    by the TaskGroup, e.g.
                            #    foo() needs to be cancelled
    except Exception as e:
        # Ignore any exceptions raised in the TaskGroup
        print(f"foo() ex 1: {e!r}")
    except BaseException as be:
        print(f"foo() base ex 1: {be!r}")
        raise
    print("foo(): before second sleep")
    try:
        await asyncio.sleep(1)     # this line has to be called
                                # after TaskGroup is finished.
        print("foo(): before third sleep - FAIL")
        await asyncio.sleep(1)
        print("foo(): after third sleep - FAIL")
    except Exception as e:
        print(f"foo() ex 2: {e!r}")
    except BaseException as be:
        print(f"foo() base ex 2: {be!r}")
    print("foo(): finish")

async def call_foo():
    print("call_foo(): starting")
    t = asyncio.create_task(foo())
    await asyncio.sleep(0.5)
    print("call_foo(): cancelling")
    t.cancel()
    try:
        await t
    except BaseException as be:
        print(f"call_foo(): foo raised {be!r}")
    print("call_foo(): finished\n\n")

asyncio.run(call_foo())

I found that this was broken both before and after applying your fix. Looking more closely, the problem is that this:

            if self._parent_task.uncancel() > 0:
                self._parent_task.cancel()

Is equivalent (I believe) to this:

            if self._parent_task.cancelling() > 1:
                self._parent_task.uncancel()
                self._parent_task.cancel()

But I think we want to re-raise cancel even when self._parent_task.cancelling() == 1: we have already passed the only call to self._parent_task.uncancel() in this function so any outstanding cancelling() must surely be from some external source. So I tried out the simple modification and changed it to this:

            if self._parent_task.cancelling() > 0:
                self._parent_task.uncancel()
                self._parent_task.cancel()

But when I tried this I got spurious cancels propagating out of task groups that ought to catch it. That's because that one call to self._parent_task.uncancel() that I mentioned in TaskGroup.__aexit__() happens before looping over the futures of the child tasks, one of which could call self._parent_task.cancel(). In other words, TaskGroup has a separate bug of not always calling self._parent_task.uncancel() when it should! How could it ever work at all then? I think it's because of the side note above: it's not actually using cancellation counts in most cases.

So I further modified the TaskGroup to remember whether it had called uncancel at the start of __aexit__(), then check self._parent_cancel_requested again after the child tasks had finished and call uncancel there if necessary. I suspect that actually the early check could be removed altogether in favour of the later one, which would simplify the code a bit, but I wasn't sure of the implications.

The revisions of this gist show the process I went through.

In the end, this is a task group implementation that fixes the two currently-broken cases and didn't break any existing working cases that I tried. But I haven't tried the whole Python test suite.

@arthur-tacca
Copy link
Author

By the way:

Another test case I tried is where we await task in a task group and then call task.cancel() in the old-school way. It seems hopeless that this would work (at least, in general) with task groups and sure enough the cancellation would be swallowed (with the existing TaskGroup implementation and with your and my modified versions). Code in the fold:

Unsurprisingly broken example mixing task groups with cancellation propagation using await task
import asyncio

async def raise_after(t, e):
    await asyncio.sleep(t)
    print(f"Raising {e}")
    raise e()

async def safe_await(task: asyncio.Task):
    try:
        return await task
    except asyncio.CancelledError:
        if task.cancelled():
            asyncio.current_task().cancel()
        raise

async def upward_propagation_test():
    print("upward_propagation_test(): starting")
    try:
        async with asyncio.TaskGroup() as g:
            g.create_task(raise_after(0.5, RuntimeError))
            try:
                await asyncio.sleep(10)
            finally:
                t = asyncio.create_task(asyncio.sleep(10))
                await asyncio.sleep(0.1)
                t.cancel()
                await asyncio.sleep(0.1)
                print("upward_propagation_test(): waiting for cancelled task...")
                await t # fixed by using this instead: await safe_await(t)
                print("upward_propagation_test(): should not get here (t was cancelled)")
    except Exception as e:
        # Ignore any exceptions raised in the TaskGroup
        print(f"upward_propagation_test() ex 1: {e!r}")
    except BaseException as be:
        print(f"upward_propagation_test() base ex 1: {be!r}")
        raise
    print("upward_propagation_test(): before second sleep")
    try:
        await asyncio.sleep(1)     # this line has to be called
                                # after TaskGroup is finished.
        print("upward_propagation_test(): before third sleep - FAIL")
        await asyncio.sleep(1)
        print("upward_propagation_test(): after third sleep - FAIL")
    except Exception as e:
        print(f"upward_propagation_test() ex 2: {e!r}")
    except BaseException as be:
        print(f"upward_propagation_test() base ex 2: {be!r}")
    print("upward_propagation_test(): finish")


async def upward_wrapper():
    try:
        await upward_propagation_test()
    except:
        pass

asyncio.run(upward_wrapper())

The reason for this is that the code:

ret = await task

Will raise asyncio.CancelledError if task was cancelled, but it does not increment cancelled() on the current task. That is impossible to change retrospectively (it was even when cancellation counts were introduced) because older code may be catching asyncio.CancelledError to suppress cancellation propagation without calling uncancel() (despite the warning in the current docs - which is misleading because it actually would be wrong to call in this case).

Instead it works if you wait for a child task like this:

try:
    ret = await task
except asyncio.CancelledError:
    if task.cancelled():
        asyncio.current_task().cancel()
    raise

@gvanrossum
Copy link
Member

Thanks, it will take me some time to process all this research. Much appreciated!

@arthur-tacca
Copy link
Author

Thanks 😊

Sorry to add even more to the wall of text:

  • Here's a concrete example illustrating the "TaskGroup not calling uncancel" bug (assumes earlier raise_after() function is defined). This ends by printing "After try/except*; have cancelling=5"
async def uncancel_test():
    for i in range(5):
        try:
            async with asyncio.TaskGroup() as tg:
                tg.create_task(raise_after(0.2, RuntimeError))
                print("task group about to wait")
        except* RuntimeError:
            print("Ignoring exception")
        print(f"After try/except*, have cancelling={asyncio.current_task().cancelling()}")
  • I'm now not so confident that you should call asyncio.current_task().cancel() if a child task raises asyncio.CancelledError. That makes some sense if the child task was cancelled for some reason unrelated to the current (outer) task. But if the current task is cancelled then that will cause the child task to be cancelled but we should not increase the current task's cancellation count. Maybe it's possible to come up with some more complex safe strategy involving shielding the child task, but I suspect it is safest to just never mix task groups with directly awaiting child tasks.

@gvanrossum
Copy link
Member

I finally found some time to think about this some more.

(Somewhat off-topic: I take it that you're not that comfortable with Git? I recommend that you spend some time learning about forks and branches -- what you put in a Gist really belongs on a branch in your own fork of the cpython repo. I promise it will pay off. You should also be able to clone the repo, build Python from source, and run parts of the test suite, e.g. ./python -m test test_asyncio.test_taskgroups will run the taskgroup tests. See the devguide.)

It sounds like we're hitting a dead end by just considering individual examples and test cases, and instead it looks useful to try to reason about tasks and task groups as little state machines. We're also reaching the limits of "programming by random modification", something we're all likely to do when we're trying to put off the inevitable deep thinking. (It's how I came up with that two line "fix" before. :-)

You noticed an important distinction. A task receiving CancelledError and a task being cancelled are separate things. And the invariants around cancel(), uncancel(), cancelling() and cancelled() are confusing. I also now think that the docs for uncancel() are not clear. Let me try to unpack things a bit (as much for my own benefit as for yours!).

  • cancelled() is one of the three "final" states of a task or future (the others are exception() and result(); collectively done()). Once a task or future is in a final state, its state won't change.
  • cancel() does not cause the task to transition to the cancelled() state -- instead, it either calls cancel() on the task/future it is waiting for, or (if it is not waiting) sets a flag that will cause CancelledError to be thrown into the thread's coroutine. It also increments the "cancel count" (returned by cancelling()).
  • I think that the only reason a task can be not waiting for another task/future is that it is ready to run, or it is already running. When it is ready to run but not yet running, its state can still be changed, because several other tasks (or callbacks) may call cancel() on it (or do other things -- including calling cancel() on the task/future it is waiting for).
  • It is a little difficult for a task to discover whether it has been cancelled. Typically, await raises CancelledError, but that is not enough -- you must also check cancelling(). And you might already be handling a cancellation, so you ought to check cancelling() before await as well as after.
  • I was not sufficiently on top of all this when I incorporated task groups into asyncio -- I mostly copied the code from uvloop, trusting Yury's experience.

Now I have to really understand your examples, which will take me more time.

@gvanrossum
Copy link
Member

I just found a really interesting thing that you have hinted at: The sequence

t.cancel()
t.uncancel()

is not always a no-op. If initially t is not being cancelled, afterwards t.cancelling() is 0, but t will still be cancelled. That is because t._must_cancel is not cleared by t.uncancel().

Worse, there is no unittest that checks for this behavior: If I add

            if self._num_cancels_requested == 0:
                self._must_cancel = False

to uncancel() (after it decrements _num_cancels_requested), no tests fail.

Given that all this is kind of subtle regarding backwards compatibility, I wouldn't want to change this, but it may make it harder to fix task groups.

There is also a corner case where a task is cancelled but await doesn't raise CancelledError: If the task cancels itself and then awaits something that doesn't actually block. Example:

import asyncio

async def main():
    f = asyncio.Future()
    f.set_result(None)
    this = asyncio.current_task()
    this.cancel()
    print("BEFORE", this.cancelling())
    await f  # Already done
    print("AFTER", this.cancelling())
    await asyncio.sleep(0)  # <-- CancelledError raised here

asyncio.run(main())

This prints

BEFORE 1
AFTER 1
Traceback (most recent call last):
...

@gvanrossum
Copy link
Member

However, the latter corner case is only an issue if the task cancels itself (which is rather uncommon, although it is in fact what I proposed to fix the original problem). If it is cancelled by something else (another task or a callback) while blocked in await, that await will always raise CancelledError. Maybe we can use this to our advantage!

@gvanrossum
Copy link
Member

(This is just me brainstorming, sorry.)

I have a feeling we need to change things in TaskGroup.__aexit__() to not assume that CancelledError implies cancel() was called, unless cancelling() is nonzero. Both upon entry to __aexit__() and at await self._on_completed_fut (for the latter we can compare cancelling() before and after).

I fear we will end up rewriting a fair amount of __aexit__(). Fortunately I think the rest is solid -- e.g., _on_task_done() looks good, and the logic around _parent_cancel_requested looks unimpeachable. It is also good that there is only a single await in the entire file (other than in comments :-). It is also helpful that we have so many unit tests already to alert us to expected existing behavior (it is not impossible that we may have to renege some of these, but unlikely).

I have one concert about _abort(). This calls t.cancel() for all tasks that are still alive, but if a task catches and ignores CancelledError, the entire task group will hang until that task exits voluntarily, and cancelling the task group isn't going to help (since _abort() is only ever called once). We don't have to fix this at the same time though. Eventually we may want to address this by having a mechanism to re-cancel such tasks, perhaps if the parent task is re-cancelled we should check for these. We may also want to consider looking at the offending task's cancelling() -- if the task calls uncancel() on itself that's signals that it needs to be re-cancelled when the parent task is re-cancelled; if it doesn't, we may assume it is still aware of the cancellation and it may yet re-raise it. (Hm, this looks like it may be a useful general rule.)

@gvanrossum
Copy link
Member

Final thought of the night: all problems seem related to _propagate_cancellation_error. Maybe we should see if we can't replace that with something more systematic.

@gvanrossum
Copy link
Member

Case analysis at the point we enter __aexit__().

  • The parent task could be cancelled "from outside" (possibly multiple times)
  • It could be cancelled by _on_task_done() (at most once, and we know whether this happened)
  • There could be a pending cancellation that has not yet been delivered (or more)
  • Two or more of the above bullets could have happened

We can't even tell the difference between a cancellation that was delivered and then caught and ignored, and one that is still pending (to be delivered at the next await) without looking in a private task variable (_must_cancel) -- in both cases cancelling() returns nonzero. I propose to deal with this particular wrinkle by insisting that when catching and ignoring CancelledError one should call uncancel(). Thus:

async with asyncio.TaskGroup() as tg:
    ...
    try:
        await <something>
    except asyncio.CancelledError:
        asyncio.current_task.uncancel()  # Indicate we're suppressing the CancelledError
    ...

However, while this works if the CancelledError is the result of the parent task being cancelled; but it's wrong if <something> was cancelled. So maybe the advice should be just not to suppress CancelledError and instead always re-raise it. That makes it the problem of __aexit__. :-)

(I'm still just blabbering on to expose my thought process. Jump in any time, or you can wait until I start proposing some code. :-)

@gvanrossum
Copy link
Member

Have a look at this: #117407

I think this addresses all your examples.

It would be great if you could contribute tests.

@gvanrossum
Copy link
Member

(And by "contributing tests" I meant to ask if you wanted to turn your example programs into unittests that can be added to test_taskgroups.py.)

FWIW, Here's why I think the changes I proposed are the right ones.

The first commit in the PR moves this block:

        if self._parent_cancel_requested:
            # If this flag is set we *must* call uncancel().
            if self._parent_task.uncancel() == 0:
                # If there are no pending cancellations left,
                # don't propagate CancelledError.
                propagate_cancellation_error = None

from its original location near the top of __aexit__() to below the while loop that waits for all tasks to finish. The reason it should go in the latter place is that the while loop contains an await which may set self._parent_cancel_requested and call self._parent_task.cancel(). Also, the except block in the while loop may set propagate_cancellation_error and it is appropriate that whatever value it is set to there is nullified here if uncancel() indicates that the only pending cancellation was our "self-cancel".

The next commit inserts this code:

            # If the parent task is being cancelled from the outside,
            # re-cancel it, while keeping the cancel count stable.
            if self._parent_task.cancelling():
                self._parent_task.uncancel()
                self._parent_task.cancel()

in the block that constructs and raises the exception group. This is basically your version of my earlier partial fix.

Finally, to keep one test in test_timeouts.py working now that task groups properly maintain the cancellation count, I had to modify uncancel() to reset _must_cancel if _num_cancels_requested goes back down to zero. The reason here is my earlier observation that this edge case left _must_cancel inconsistent. (Note that this must be done in the C accelerator too.)

There's one more issue with task groups: If one of the child tasks ignores cancellation, the whole task group will wait until that task voluntarily exits, and re-cancelling the parent doesn't help (it doesn't get propagated to the children). I propose to leave this for another time -- nobody has complained about it yet (and there is a strong recommendation against doing this).

@Tinche
Copy link
Contributor

Tinche commented Apr 2, 2024

Ooph, interesting. I spent an hour playing with the OP's example to try to understand the problem.

I tried playing out the example in my head to see where reality would diverge. On the face of it, the main issue is that the asyncio.sleep(0.2) line doesn't raise CancelledError. The actual problem seems to originate in the inner TG's __aexit__.

I agree we uncancel too early in __aexit__. The example sequence is:

  1. __aenter__
  2. create_task
  3. Enter __aexit__
  4. Try uncancelling - no luck, we haven't been cancelled yet
  5. Block waiting for our task
  6. The task cancels us, but we've already uncancelled. We should uncancel but we won't

Here's a very simple test that fails:

async def main():
    async def raise_after(t, e):
        await sleep(t)
        raise e()

    try:
        async with TaskGroup() as tg:
            tg.create_task(raise_after(0.1, Exception))
    except ExceptionGroup:
        pass
    assert current_task().cancelling() == 0

(If you insert an await sleep(1) after the create_task it succeeds because it will delay entering __aexit__.)

I think doing the uncancel after the tasks are handled is the right move. Is that enough to solve the OP issue? It should cause the inner __aexit__ to propagate the CancelledError to the outer TG.

@gvanrossum
Copy link
Member

Thanks for confirming that. (And I may use that as a unit test. :-)

At least one of the OP's examples also requires the other thing I changed in taskgroups.py -- if the parent task is cancelled but we also have errors for the exception group, re-cancel the parent, so that it will be cancelled at the next await. This is a bit inconsistent, as when there are no errors to report, we just propagate the CancelledError exception right out of __aexit__, but we're constrained by backwards compatibility and except* semantics -- we can't add the CancelledError to the exception group without huge backwards compatibility issues (we'd have to tell everyone to use except* CancelledError).

Finally, after all was said and done, one existing unittest failed, and resetting _must_cancel in uncancel() feels like the right fix for that.

@Tinche
Copy link
Contributor

Tinche commented Apr 3, 2024

if the parent task is cancelled but we also have errors for the exception group, re-cancel the parent, so that it will be cancelled at the next await

I'm trying to figure out if this is what we want. Without thinking about it too much, my gut feeling is if the parent is cancelled, there should not be a CancelledError on 'next await'. The inner __aexit__ should raise CancelledError. If I understand this correctly, your opinion is that the inner __aexit__ should raise an ExceptionGroup, and re-cancel the parent task so the next await will raise CancelledError?

@gvanrossum
Copy link
Member

I want my cake and eat it too. :-)

The reason we want the parent cancellation to bubble up (if it came from the outside -- not from self._on_task_done()) is that otherwise the outer task group will not get its cancellation (assuming that is the "outside" source of cancellation).

And indeed if there are no errors from the child group, __aexit__ will re-raise the cancellation error:

        if propagate_cancellation_error is not None and not self._errors:
            raise propagate_cancellation_error

The problem occurs when there are also errors from the child task group. Certainly, without a pending "outside" cancellation, we want to raise those errors, grouped together in an exception group, so that a try/except* outside the task group can handle them, and so that the code immediately after the task group doesn't continue as if nothing happened.

But when there are errors to be reported and a cancellation from the parent, what are we supposed to do? The hack I came up with is that in this case only we cancel the parent, which mostly has the intended effect: a try/except* is able to handle exceptions from child tasks associated with the task group, and the parent cancellation isn't lost (only delayed).

This isn't perfect, but it's better than before, IMO.

@Tinche
Copy link
Contributor

Tinche commented Apr 4, 2024

Right. Just to be clear, I'm not arguing we should do nothing. I'm also good with your solution, and will try to mimic our ultimate approach in Quattro. I'm trying to collaboratively brainstorm out loud before we commit.

So in __aexit__ we want to raise 2 things: somebody else's CancelledError and an ExceptionGroup with child errors. But we can't raise two things, so either we swallow one or do your cancellation trick where we raise the ExceptionGroup and mark the task for later cancellation.

I wrote a larger reply but I think I'm good with your approach here. My initial gut feeling was that it was weird to allow an additional event loop step in this case, but this is a race condition so the impact should be minimal. It also helps us not let errors pass silently.

@arthur-tacca
Copy link
Author

arthur-tacca commented Apr 7, 2024

In addition to what I wrote on the merge request, here is some Docs broader discussion

I think many users (it would be all of them from my point of view!) would deal with tasks exclusively by using task groups. They would never ever call asyncio.create_task(), or use await t on any task t, or access t.result() for a task unless they were sure that it hadn't been cancelled (and didn't finish with any other exception for that matter - since that is already going to be raised into an ExceptionGroup elsewhere). (That last one is easier than it sounds - in the lines of code immediately after a task group you can be sure of that. - actually that's not true of cancellation, since you can cancel an individual task started in a task group.) Those users don't need to know anything about cancellation counts and whatnot, and would be safe from a cancellation in one place accidentally being re-raised somewhere else (possibly with incorrect cancelling() count). IMO, the docs would benefit from a little discussion about working to this strategy. It would perhaps be usefully supplemented by a safe way to wait for a task to finish - something like task.wait_done() as I suggested in the issue about a safe way to wait for a task to be completely cancelled. Or they could just be advised to use other synchronisation mechanisms between tasks (e.g. asyncio.queue).

Trio doesn't have any way to wait for a task to finish or get its result - I know that Trio isn't asyncio and vice-versa but I still think that's interesting. I actually wrote a very small library aioresult for Trio / AnyIO to allow getting the result of a task in a nursery (this is the "something else I planned to spend my time on" I mentioned in my other comment). I used exactly that strategy of having a separate await task.wait_done() method. The task.result() method avoids confusion about exceptions by unconditionally wrapping task exceptions in wrapper exception type, though I see that it's too late for asyncio to use that idea (given that its task.result() method already exists and does something else).

@gvanrossum
Copy link
Member

Thanks for the deeper feedback!

You are absolutely right that the docs around cancellation are still pretty vague and confusing if you're looking for a precise spec, and probably give incorrect guidance to beginners. (Even the simple "don't try to catch CancelledError is muddled by the existence of finally blocks and the possibility of accidentally raising another exception there.)

Unfortunately I think I am going to have to give the job of improving those docs to someone else (or possibly myself in the distant future) -- there just are too many other things to do.

There are probably also still things that could be improved in the implementation of cancellation. But it is in the nature of Python's development process to iterate, rather than to attempt to come up with the perfect solution in one go.

In the meantime, I hope you come back and help us find some more subtle issues. You've been quite helpful.

gvanrossum added a commit that referenced this issue Apr 9, 2024
This prevents external cancellations of a task group's parent task to
be dropped when an internal cancellation happens at the same time.
Also strengthen the semantics of uncancel() to clear self._must_cancel
when the cancellation count reaches zero.

Co-Authored-By: Tin Tvrtković <tinchester@gmail.com>
Co-Authored-By: Arthur Tacca
@arthur-tacca
Copy link
Author

@gvanrossum Thanks for the nice comment, I hope so too. And thanks for taking the time to fix this issue.

diegorusso pushed a commit to diegorusso/cpython that referenced this issue Apr 17, 2024
This prevents external cancellations of a task group's parent task to
be dropped when an internal cancellation happens at the same time.
Also strengthen the semantics of uncancel() to clear self._must_cancel
when the cancellation count reaches zero.

Co-Authored-By: Tin Tvrtković <tinchester@gmail.com>
Co-Authored-By: Arthur Tacca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-bug An unexpected behavior, bug, or error
Projects
Status: Done
Development

No branches or pull requests

4 participants