Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe synchronous cancellation in asyncio #103486

Open
evgeny-osipenko opened this issue Apr 12, 2023 · 10 comments
Open

Safe synchronous cancellation in asyncio #103486

evgeny-osipenko opened this issue Apr 12, 2023 · 10 comments
Labels
topic-asyncio type-feature A feature request or enhancement

Comments

@evgeny-osipenko
Copy link

evgeny-osipenko commented Apr 12, 2023

Feature or enhancement

I propose adding a function to asyncio to cancel a task and then safely wait for the cancellation to complete.

The main difficulty with this is deciding whether to swallow or re-raise the CancelledError once we catch one. It we call the task that we're waiting on "dependency", and the task doing the waiting "controller", then there are two distinct possibilities:

  1. The "dependency" has finished its cleanup, passed its own CancelledError all the way up its own stack, and then further up through the await in the "controller". In this case, the CancelledError should be swallowed, and the "controller" can continue its work normally.
  2. The "controller" itself was cancelled, while being on the await. In this case, the CancelledError is the signal to cancel the "controller" itself, and should be either re-raised further up, or its swallowing should be accompanied by a call to uncancel.

The documentation for asyncio.Task.cancel, in fact, does not make this decision correctly, and thus would be a bug. Copying the example, and changing the names to match this issue's terminology:

async def dependency():
    print('dependency(): start')
    try:
        await asyncio.sleep(3600) # a long or infinite loop
    except asyncio.CancelledError:
        print('dependency(): cancel')
        raise
    finally:
        print('dependency(): cleanup')

async def controller():
    dependency_task = asyncio.create_task(dependency())
    await asyncio.sleep(1)
    dependency_task.cancel()
    try:
        await dependency_task # controller itself may be cancelled at this moment
    except asyncio.CancelledError:
        print("controller(): dependency is cancelled now")
        # CancelledError swallowed unconditionally

If I'm not missing anything, the correct procedure would look like this:

async def controller():
    dependency_task = asyncio.create_task(dependency())
    await asyncio.sleep(1)
    dependency_task.cancel()
    try:
        await dependency_task
    except asyncio.CancelledError:
        print("controller(): dependency is cancelled now")
        if asyncio.current_task().cancelling() > 0:
            raise

Thus, I propose to make these changes:

  1. Introduce a function to asyncio or Task:

    async def cancel_and_wait(task, msg=None):
        task.cancel(msg)
        try:
            await task
        except asyncio.CancelledError:
            if asyncio.current_task().cancelling() == 0:
                raise
            else:
                return # this is the only non-exceptional return
        else:
            raise RuntimeError("Cancelled task did not end with an exception")

    Having a specialized function would reduce the possibility of someone making this mistake in their code (like the author of the example probably did :) ), and allow the implementation to be changed or improved in the future. One such possible enhancement, for example, could be adding a repeat parameter to instruct the function, in case the task uncancels itself, to keep cancelling it again in a loop.

  2. In the documentation for asyncio, add a warning for this kind of mistake, in the "Task Cancellation" section or in the description of asyncio.Task.cancel.

  3. Change the code example for asyncio.Task.cancel to account for cancellation of main. I know that, in this specific snippet, it is impossible for main itself to be cancelled; but a developer unsuspecting of this issue may copy this example into a situation where the controller is indeed cancellable, and end up with a bug in their code.

@evgeny-osipenko evgeny-osipenko added the type-feature A feature request or enhancement label Apr 12, 2023
@gvanrossum
Copy link
Member

This feels very similar to gh-102847, which was just closed. Since that has much more discussion we should probably keep that one open instead?

@achimnol
Copy link
Contributor

achimnol commented Apr 14, 2023

I think this issue shares some contexts with gh-102847, but targets more specific use case: safe synchronous cancellation. Could we reopen it, then it may be better to collect the discussion there. I'd love to see some improvements on the "canceller" side APIs and ergonomics as the OP says.

@gvanrossum
Copy link
Member

I don't think that adding a new API is going to help much. Realistic use cases of cancellation (e.g. as used in task groups and the timeout context manager) wouldn't be able to use it. Most use cases are better off using something like timeout anyway.

You may have made a case for updating the example, but I resent the suggestion that the author of the example "probably made a mistake". As you mention, in the example there is no way main() itself can be cancelled. But if you submit a PR that updates the example and explains (in not too many words) why the cancelling() check is needed it will probably be accepted. Alternatively, you could update the docs for cancelling() to include the improved example, and just plant a link there in the text accompanying the basic cancel() example. If you go that way, the cancel_me() function should probably be reduced to just await sleep(3600), since the rest of the code is just there to demonstrate the execution flow through a coroutine that catches CancelledError. (That's also true for main() BTW -- so it's probably better to update the cancelling() docs than the cancel() docs.)

@evgeny-osipenko
Copy link
Author

Let me describe the context where I encountered this question.

I am building a system composed of multiple services that run in parallel in separate Python instances. To provide synchronization between the services where necessary, I use a shared database with a "locks" table, where the lock's name field, constructed from a resource's global id, is covered with a unique index. When a service takes a lock, it tries to create a row in this table, and to release the lock, the row is deleted.

In addition, each lock is equipped with a watchdog timer, so that if a service bugs out, crashes or "forgets" to unlock the resource in any other way, the lock would time out by itself. When a service genuinely needs to hold the lock for a long time, however, it can periodically reset the watchdog, thus preventing it from timing out for as long as the service is actually functional.

And so, what I wanted to have, is a context manager that takes care of both creating/removing the row and resetting the watchdog timer, by spawning an asyncio task for the duration of the "managed" code. An important caveat is that the "resetter" task must be stopped and joined before the row is removed; because, if the "resetter" discovers that the row has been removed and there is nothing to update anymore, it's treated as if the lock has timed out, the resource may already be in the middle of another process, and it's no longer safe to continue, so it raises an exception and interrupts the parent task.

In the end, though, I implemented this behavior by constructing a TaskGroup from within the lock manager, calling its __aenter__ and __aexit__ manually, and putting the "resetter" task under its control:

class LockConflict(RuntimeError):
    pass

class LockLost(RuntimeError):
    pass

class Lock:
    def __init__(self, name):
        self.name = name
        self.nonce = random.randbytes(8)
        self.task_group = None
        self.resetter_task = None

    async def take(self):
        created_successfully = await db.transaction(
            lambda c: db_try_create_lock(c, self.nonce, self.name),
        )
        if not created_successfully:
            raise LockConflict(
                f"Lock {self.name} is already taken. Are you trying to run "
                f"multiple instances of a non-multiplicable service at once?"
            )

    async def touch(self):
        updated_successfully = await db.transaction(
            lambda c: db_update_lock_watchdog(c, self.nonce, self.name),
        )
        if not updated_successfully:
            raise LockLost(f"Lock {self.name} timed out")

    async def release(self):
        await db.transaction(
            lambda c: db_release_lock(c, self.nonce, self.name),
        )

    async def async_touch_loop(self):
        while True:
            await asyncio.sleep(settings.service_watchdog_timer_seconds // 2)
            await self.touch()

    async def __aenter__(self):
        if self.task is not None:
            raise RuntimeError('Lock context manager entered twice')
        try:
            await self.take()
            self.task_group = await asyncio.TaskGroup().__aenter__()
            self.resetter_task = self.task_group.create_task(self.async_touch_loop())
            return self
        except:
            # handle KeyboardInterrupt during initialization
            await self.release()
            raise

    async def __aexit__(self, *exc_info):
        try:
            self.resetter_task.cancel()
            await self.task_group.__aexit__(None, None, None)
        finally:
            await self.release()

# example usage:
async with Lock(f"record:{my_record.id}"):
    await process_record(my_record)

So, in the end, it might indeed be better to just add a note to the cancel example saying stuff like "check cancelling before swallowing CancelledError" and "actually, just use a TaskGroup, it already has all that covered for you".

@achimnol
Copy link
Contributor

achimnol commented May 1, 2023

The same suggestion from another person: https://discuss.python.org/t/asyncio-cancel-a-cancellation-utility-as-a-coroutine-this-time-with-feeling/26304

@fried
Copy link
Contributor

fried commented May 1, 2023

My talk at pycon.us 2022 https://youtu.be/XW7yv6HuWTE brought up this as an issue I see a lot at Meta. People screwing up cancelling tasks. I even wrote my own utility into the later: https://github.com/facebookincubator/later/blob/main/later/task.py#L59

Though looking at my implementation you can't cancel it if the task its trying to cancel won't finish :P So yeah this is a work in progress. Would be nice to have this in the stdlib. And I can't just tell developers to use taskgroups especially if they are interfacing with 3rd party code.

@gvanrossum
Copy link
Member

I suspect this will need a PEP, since so many people have broken their heads over this (even @fried admits his solution isn't perfect. :-) See https://discuss.python.org/t/asyncio-cancel-a-cancellation-utility-as-a-coroutine-this-time-with-feeling/26304/3 (I'm not sure where we should discuss this -- here on GitHub, or there on Discourse.)

@xitop
Copy link

xitop commented Jan 15, 2024

Speaking from my experience (yes, it's limited), this is an issue that should be solved while dealing with the whole task life-cycle, i.e. on a level above the asyncio, not in the asyncio.

Very closely related to cancellation are other problems I've encountered in my work: task monitoring (essential tasks may not exit or else let's shut down the whole app) and the shutdown management itself (task cancelling order, cleanup actions, etc). I have solved it "good enough" for my projects, i.e. far from being suitable for general use, but that work is the reason I'm now quite sure that a fully fledged tool for managing task and taking care of all the details, edge-cases etc. would create a complete asyncio add-on library and this particular issue belongs there.

@arthur-tacca
Copy link

The root problem here that await task could raise either that task's exception or the current task's own one. But you're only calling it to wait till the task is done – you can check the result later by looking at task.cancelled(). In fact you probably don't even care if it ends that way (rather than e.g. catching CancelledError and returning).

So what you all really need is a way to just wait for the task to finish. Something like await task.wait_done(). Then your synchronous cancellation is two steps:

task.cancel()
await task.wait_done()

await task.wait_done() would never raise regardless of how task finished, but (like any await) would raise CancelledError if the current task is cancelled. This is safe even if the task is already cancelled or already done. await task.wait_done() would also be a lot more generally useful than something tailored to cancellation.

For the record, I don't have any need for this. This is just some drive-by API design 😊

@arthur-tacca
Copy link

arthur-tacca commented Mar 18, 2024

async def cancel_and_wait(task, msg=None):
    task.cancel(msg)
    try:
        await task
    except asyncio.CancelledError:
        if asyncio.current_task().cancelling() == 0:
            raise
        else:
            return # this is the only non-exceptional return
    else:
        raise RuntimeError("Cancelled task did not end with an exception")

Isn't the if condition the wrong way round? If asyncio.current_task().cancelling() == 0 then that means the outer task is NOT being cancelled, so the cancellation is from the inner task (await task) so you would want to just return in that case.

[edit: there was a second half to this comment that had a stupid idea which I have now deleted]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-feature A feature request or enhancement
Projects
Status: Todo
Development

No branches or pull requests

7 participants