Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsyncIO's wait_for can hide cancellation in a rare race condition #86296

Closed
tvoinarovskyi mannequin opened this issue Oct 23, 2020 · 22 comments · Fixed by #98518
Closed

AsyncIO's wait_for can hide cancellation in a rare race condition #86296

tvoinarovskyi mannequin opened this issue Oct 23, 2020 · 22 comments · Fixed by #98518
Labels
3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-asyncio type-bug An unexpected behavior, bug, or error

Comments

@tvoinarovskyi
Copy link
Mannequin

tvoinarovskyi mannequin commented Oct 23, 2020

BPO 42130
Nosy @ods, @asvetlov, @cjerdonek, @1st1, @tvoinarovskyi, @Dreamsorcerer, @aaliddell, @crowbarz, @WJCFerguson
PRs
  • bpo-42130: Fix swallowing of cancellation by wait_for #26097
  • bpo-42130: Fix for explicit suppressing of cancellations in wait_for() #28149
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2020-10-23.16:10:28.990>
    labels = ['type-bug', '3.8', '3.9', '3.10', 'expert-asyncio']
    title = "AsyncIO's wait_for can hide cancellation in a rare race condition"
    updated_at = <Date 2021-12-02.09:55:13.171>
    user = 'https://github.com/tvoinarovskyi'

    bugs.python.org fields:

    activity = <Date 2021-12-02.09:55:13.171>
    actor = 'dreamsorcerer'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['asyncio']
    creation = <Date 2020-10-23.16:10:28.990>
    creator = 'tvoinarovskyi'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 42130
    keywords = ['patch']
    message_count = 6.0
    messages = ['379454', '379541', '379577', '393336', '407489', '407521']
    nosy_count = 10.0
    nosy_names = ['ods', 'asvetlov', 'chris.jerdonek', 'yselivanov', 'tvoinarovskyi', 'dreamsorcerer', 'aaliddell', 'nmatravolgyi', 'crowbarz', 'jamesf']
    pr_nums = ['26097', '28149']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue42130'
    versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

    Linked PRs

    @tvoinarovskyi
    Copy link
    Mannequin Author

    tvoinarovskyi mannequin commented Oct 23, 2020

    Hi, during migration to Python 3.8.6 we encountered a behavior change from previous versions: wait_for ignored the request to cancellation and returned instead. After investigation, it seems to be related to the update in bpo-32751 and is only reproduced if the waited task is finished when cancellation of wait_for happens (code mistakes external CancelledError for a timeout).

    The following example can reproduce the behavior on both 3.8.6 and 3.9.0 for me:

    
    import asyncio
    
    
    async def inner():
        return
    
    async def with_for_coro():
        await asyncio.wait_for(inner(), timeout=100)
        await asyncio.sleep(1)
        print('End of with_for_coro. Should not be reached!')
    
    async def main():
        task = asyncio.create_task(with_for_coro())
        await asyncio.sleep(0)
        assert not task.done()
        task.cancel()
        print('Called task.cancel()')
        await task  # -> You would expect a CancelledError to be raised.
    
    
    asyncio.run(main())
    

    Changing the wait time before cancellation slightly will return the correct behavior and CancelledError will be raised.

    @tvoinarovskyi tvoinarovskyi mannequin added 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes topic-asyncio type-bug An unexpected behavior, bug, or error labels Oct 23, 2020
    @cjerdonek
    Copy link
    Member

    It looks like bpo-37658 might be the relevant change rather.

    Here is the new logic it introduced:

    if fut.done():
    return fut.result()
    else:
    fut.remove_done_callback(cb)
    fut.cancel()
    raise

    (via #21894 )

    @tvoinarovskyi
    Copy link
    Mannequin Author

    tvoinarovskyi mannequin commented Oct 25, 2020

    Hi Chris,
    Yes, I do believe that is the respectful change, if we look the CancelledError is not checked to be external or originate from the timer.
    Best regards,
    Taras Voinarovskyi

    @nmatravolgyi
    Copy link
    Mannequin

    nmatravolgyi mannequin commented May 9, 2021

    I've also found this deficiency of asyncio.wait_for by debugging an obscure hang in an application. Back then I've quickly made an issue about it: https://bugs.python.org/issue43389

    I've just closed it as duplicate, since this issue covers the same bug and has been around longer.

    I'm surprised this issue has not got more attention. This definitely needs a fix. Ignoring a CancellationError is something that is heavily discouraged by the documentation in general. Silently ignoring/eating a cancellation makes this construct unreliable without good workarounds, aside from not using it.

    For a specific application, this was so broken, that I had to come up with a fix in the short term. I made an asyncio.wait_for variant available as a library that fixes the problem: https://github.com/Traktormaster/wait-for2

    The repository has a detailed description of the issue and the way it fixes it. It also has test cases to assert the behaviour of the builtin and the fixed wait_for construct from the library.

    @wjcferguson
    Copy link
    Mannequin

    wjcferguson mannequin commented Dec 1, 2021

    Echoing nmatravolgyi's comments - I got here after hitting this bug and I too am amazed it's been around so long and hasn't been addressed.

    It makes cancellation in my application very unreliable, and the reason I need well-controlled cancellation semantics is to allow emergency stop of physical movement.

    @Dreamsorcerer
    Copy link
    Mannequin

    Dreamsorcerer mannequin commented Dec 2, 2021

    It has been addressed, PR should be merged this week: #28149

    Like most open source projects, it just needed someone to actually propose a fix.

    @graingert
    Copy link
    Contributor

    graingert commented Apr 28, 2022

    can this be fixed by implementing wait_for with asyncio.timeout?:

    from . import timeouts
    
    async def wait_for(aw, timeout):
        async with timeouts.timeout(timeout):
            return await aw

    @twisteroidambassador
    Copy link
    Contributor

    I have just recently found out that one of my libraries had always been affected by this bug.

    Here's a script that reliably reproduces the bug on Python 3.9 - 3.11, without relying on timing coincidence:

    import asyncio
    
    
    async def left(tasks: list[asyncio.Task], event: asyncio.Event):
        me = asyncio.current_task()
        assert tasks[0] is me
        await asyncio.sleep(0.1)
        event.set()
        tasks[1].cancel()
        print('cancelled right')
    
    
    async def right(tasks: list[asyncio.Task], event: asyncio.Event, use_wait_for: bool):
        try:
            me = asyncio.current_task()
            assert tasks[1] is me
            if use_wait_for:
                await asyncio.wait_for(event.wait(), 1)
            else:
                await event.wait()
            print('right done, was not cancelled')
        except asyncio.CancelledError:
            print('right got cancelled')
            raise
    
    
    async def left_right(use_wait_for: bool):
        tasks = []
        event = asyncio.Event()
        tasks.append(asyncio.create_task(left(tasks, event)))
        tasks.append(asyncio.create_task(right(tasks, event, use_wait_for)))
        await asyncio.wait(tasks)
    
    
    if __name__ == '__main__':
        print('Not using wait_for():')
        asyncio.run(left_right(False))
        print('\nUsing wait_for():')
        asyncio.run(left_right(True))
    
    

    Actual output:

    Not using wait_for():
    cancelled right
    right got cancelled
    
    Using wait_for():
    cancelled right
    right done, was not cancelled
    

    Using asyncio.timeout instead of asyncio.wait_for fixes the problem for me.

    @gvanrossum gvanrossum added the 3.11 only security fixes label Oct 18, 2022
    @gvanrossum
    Copy link
    Member

    Wow, this has certainly been reported enough times -- a fix was even reported but apparently it's still not fixed.

    My head dazzles trying to understand the code of wait_for(). This may have to wait until we have more asyncio experts -- it's too late to fix in 3.11.0 anyways, though possibly we could fix it for 3.11.1 (if there's a PR).

    twisteroidambassador added a commit to twisteroidambassador/cpython that referenced this issue Oct 18, 2022
    One of the new tests may fail intermittently due to python#86296.
    twisteroidambassador added a commit to twisteroidambassador/cpython that referenced this issue Oct 18, 2022
    One of the new tests may fail intermittently due to python#86296.
    @gvanrossum
    Copy link
    Member

    @twisteroidambassador If you want a fix, can you first try @graingert's suggestion from #86296 (comment)? If that works, it'd be great!

    @twisteroidambassador
    Copy link
    Contributor

    Yes, as far as I can see replacing wait_for with timeout solves the issue.

    @twisteroidambassador
    Copy link
    Contributor

    On Python < 3.11 I think using wait instead of timeout is a solution. See #98431.

    @twisteroidambassador
    Copy link
    Contributor

    twisteroidambassador commented Oct 19, 2022

    @gvanrossum said:

    I wonder if the key problem here isn't the one that caused so much trouble for the Semaphore class as well -- you can await a future, and catch CancelledError, but that doesn't mean the future was cancelled, it could mean the current task was cancelled after the future already completed. Maybe there's still a path through the code that doesn't account for that, somehow?

    I think this is the crux of the issue, and an important decision to make for the behavior of wait_for:

    When wait_for is being cancelled from the outside, while simultaneously the inner fut is done, should wait_for:

    1. swallow the cancellation and return fut's result / exception, or
    2. accept the cancellation, and thus swallow fut's result / exception?

    The current code's behavior is 1: See Lines 470 - 472 (EDIT: and I believe it's intentional, since nothing could be cancelling waiter, so this CancelledError must come from the outside)

    try:
    # wait until the future completes or the timeout
    try:
    await waiter
    except exceptions.CancelledError:
    if fut.done():
    return fut.result()
    else:
    fut.remove_done_callback(cb)
    # We must ensure that the task is not running
    # after wait_for() returns.
    # See https://bugs.python.org/issue32751
    await _cancel_and_wait(fut, loop=loop)
    raise

    while my PR #98431 does 2.

    @twisteroidambassador
    Copy link
    Contributor

    A much smaller-scale fix can also achieve behavior 2 above: #98432

    @gvanrossum
    Copy link
    Member

    I've been lying awake thinking this through. In regards to your question about what should happen when wait_for() is cancelled from the outside, (1) or (2), may I propose

    1. If the result is an exception, raise it; if the result is a "successful" return, re-raise the CancelledError.

    My reason is derived from what @grangert convinced me was the correct outcome in some other cancellation edge case (I think it was for either TaskGroup or timeout()). His point was that if a coroutine has an except CancelledError clause but then hits a bug in its exception handling code, thus raising a new exception, that exception should not be suppressed, since it may be the only indication that there's a bug in the app. OTOH if the exception handling for CancelledError merely returns (instead of re-raising it) then we should continue propagating the cancellation outward (where it may be expected by some other code).

    But wait, there's more! In 3.11 there's another wrinkle, uncancel(). When you call task.cancel() and it returns True, a counter on the task is incremented. (And it may take several event loop iterations before the task actually completes.) Later when inspecting the result of the task, you are supposed to call task.uncancel() to decrement that counter again. Technically this is even the right thing to do if the task's state is not "cancelled", but only if your earlier task.cancel() call returned True.

    The cancel-uncancel dance is intended to make it possibly to reliably distinguish between an external cancellation and a cancellation you initiated, even if both occurred simultaneously on the same object. You have to set a flag that remembers the result of your cancel() call, and later, when the task is done, if that flag is set, you call uncancel(), and if it returns zero the cancellation was yours, otherwise it was external.

    I have a feeling that much of the complexity in wait_for() is because it was trying to work around the lack of such a mechanism.

    With all that, I still haven't fully grokked either of your proposed solutions, for which I apologize (all the above was just mental preparation). I also notice that you haven't added a new test yet that demonstrates the problem.

    Finally, when I try the naive version of @graingert's ideas use timeout(), I get failing tests in test_waitfor.py, so either the tests are over-constraining a certain undesirable behavior, or there's a bug in timeout(), or wait_for() has some historical behavior that is just different but which we would need to preserve.

    @twisteroidambassador
    Copy link
    Contributor

    twisteroidambassador commented Oct 21, 2022

    1. If the result is an exception, raise it; if the result is a "successful" return, re-raise the CancelledError.

    My reason is derived from what @grangert convinced me was the correct outcome in some other cancellation edge case (I think it was for either TaskGroup or timeout()). His point was that if a coroutine has an except CancelledError clause but then hits a bug in its exception handling code, thus raising a new exception, that exception should not be suppressed, since it may be the only indication that there's a bug in the app. OTOH if the exception handling for CancelledError merely returns (instead of re-raising it) then we should continue propagating the cancellation outward (where it may be expected by some other code).

    I like this proposal. It avoids the most unexpected behavior of (1), in which wait_for is cancelled from the outside, but then it returns normally like the cancellation never happened. And it gives the programmer a bit more information than (2).

    If we go with this, then I think we have a complete specification for wait_for.

    There are 3 things that can happen once wait_for has been entered: being cancelled externally, exceeding the timeout, or the inner awaitable becomes done. Let's assume all 3 could be happening simultaneously.

    Existing comments in wait_for describes expected behavior:

    We must ensure that the task is not running after wait_for() returns. See https://bugs.python.org/issue32751

    In case task cancellation failed with some exception, we should re-raise it. See https://bugs.python.org/issue40607

    The proposed behavior is:

    • If being cancelled externally:
      • If awaitable is done:
        • if awaitable is cancelled or has an exception, raise it
        • otherwise, raise the external cancellation
      • Otherwise (awaitable is not done), cancel it and wait until it's done, then do the above
        • (Does it matter if awaitable's cancellation is ours?)
      • (Note that whether timeout is exceeded does not factor into these decisions)
    • If timeout exceeded:
      • If awaitable is done, return / raise its result
      • Otherwise (awaitable is not done), cancel it and wait until it's done
        • If awaitable is cancelled and the cancellation is ours, raise TimeoutError
        • Otherwise, return / raise awaitable's result
    • awaitable is done. Return / raise its result

    Does this look reasonable?

    But wait, there's more! In 3.11 there's another wrinkle, uncancel(). When you call task.cancel() and it returns True, a counter on the task is incremented. (And it may take several event loop iterations before the task actually completes.) Later when inspecting the result of the task, you are supposed to call task.uncancel() to decrement that counter again. Technically this is even the right thing to do if the task's state is not "cancelled", but only if your earlier task.cancel() call returned True.

    The cancel-uncancel dance is intended to make it possibly to reliably distinguish between an external cancellation and a cancellation you initiated, even if both occurred simultaneously on the same object. You have to set a flag that remembers the result of your cancel() call, and later, when the task is done, if that flag is set, you call uncancel(), and if it returns zero the cancellation was yours, otherwise it was external.

    I wonder whether the following sequence of events is possible:

    • A calls some_task.cancel()
    • B calls some_task.cancel()
    • A calls some_task.uncancel(), and decides it's not responsible for the cancellation
    • B calls some_task.uncancel() and decides it's responsible for the cancellation

    If it's possible, does it matter?

    @gvanrossum
    Copy link
    Member

    Your proposal looks reasonable. There is some devil is in the details. How can we tell we're being cancelled externally? I suppose the current wait_for() code can tell because it has an outer future and an inner task, and we consider only cancellation of the outer future to be external. I sort of wish we didn't have to have both a future and a task, but in the end timeout() uses that too, so I should just be okay with that.

    Re: the sequence of cance/uncancel calls, it's certain that we may not be able to tell whether A or B cancelled first, but we do know in which order they are handled, and the convention is that whoever handles last should get to decide. Everyone who handles prior to that will see uncancel() return > 0, and by convention has to propagate the outer cancellation.

    There are still some odd corner cases though, e.g. when both A and B cancel, then the coroutine catches the cancellation and returns None. The first handler (typically the last one to cancel, so B) ought to still call uncancel(), will find it > 0, and then what? Raise a fresh CancelledError? Or do whatever it normally does when the coroutine returns None?

    FWIW on how to proceed, let's fix it in a way that's compatible with 3.10 first (so no uncancel() calls), and then in a later PR get the uncancel() calls in, possibly by using timeout(), e.g. as in Kumar's PR (#98518).

    PS. A final wrinkle with uncancel is that it may not exist -- Future doesn't define it, and 3rd party event loops may create tasks that don't have it yet.

    @twisteroidambassador
    Copy link
    Contributor

    While writing tests for the proposed behavior, I realized that it will make await wait_for(fut, timeout=None) and await fut behave differently. If fut becomes done at the same time await fut is cancelled, the await will always raise CancelledError no matter what result / exception fut has, while the proposed wait_for() will act differently depending on fut.

    This also means that to pass these tests, the "easy path" of wait_for() where timeout is None cannot be a simple return await fut anymore.

    Does this still sound like a good idea?

    @gvanrossum
    Copy link
    Member

    Hmm, that does sound like a problem. I've been see-sawing on this quite a bit. At this point I believe Kumar's PR is the gold standard for behavior, and it follows what await fut does.

    (Note that I'm currently severely backed up and it may take a few days to dig myself out, until then I'm not sure you should believe anything I say.)

    @twisteroidambassador
    Copy link
    Contributor

    I'll write down a new set of proposed behavior at the bottom. These will always raise CancelledError when cancelled externally, discarding any result / exception of the inner awaitable if any.

    This makes await wait_for(fut, timeout=None) and await fut behaves the same, and is also the minimal change from existing behavior that fixes the current bug.

    I have also changed my latest PR to match. If we end up preferring the previous behavior, we can always revert the one commit in the PR.

    New proposed behavior (changes are italicized):

    • If being cancelled externally:
      • If awaitable is done: continue with the external cancellation (discarding awaitable's result / exception if any)
      • Otherwise (awaitable is not done), cancel it and wait until it's done, then do the above
    • If timeout exceeded:
      • If awaitable is done, return / raise its result
      • Otherwise (awaitable is not done), cancel it and wait until it's done
        • If awaitable is cancelled and the cancellation is ours, raise TimeoutError
        • Otherwise, return / raise awaitable's result
    • awaitable is done. Return / raise its result

    @kumaraditya303 kumaraditya303 added 3.12 bugs and security fixes and removed 3.9 only security fixes 3.8 only security fixes labels Oct 26, 2022
    @laker-93
    Copy link

    I believe using asyncio.shield() can offer a work around to this in at least Python 3.8.6 (I've only tested it in this version). Modifying the test script that reproduces the issue given here as follows (wrapping the coroutine in the wait_for() call in asyncio.shield()):

    import asyncio
    from typing import List
    
    
    async def left(tasks: List[asyncio.Task], event: asyncio.Event):
        me = asyncio.current_task()
        assert tasks[0] is me
        await asyncio.sleep(0.1)
        event.set()
        tasks[1].cancel()
        print('cancelled right')
    
    
    async def right(tasks: List[asyncio.Task], event: asyncio.Event, use_wait_for: bool):
        try:
            me = asyncio.current_task()
            assert tasks[1] is me
            if use_wait_for:
                await asyncio.wait_for(asyncio.shield(event.wait()), 1)
            else:
                await event.wait()
            print('right done, was not cancelled')
        except asyncio.CancelledError:
            print('right got cancelled')
            raise
    
    
    async def left_right(use_wait_for: bool):
        tasks = []
        event = asyncio.Event()
        tasks.append(asyncio.create_task(left(tasks, event)))
        tasks.append(asyncio.create_task(right(tasks, event, use_wait_for)))
        await asyncio.wait(tasks)
    
    
    if __name__ == '__main__':
        print('Not using wait_for():')
        asyncio.run(left_right(False))
        print('\nUsing wait_for():')
        asyncio.run(left_right(True))
    

    Gives the desired output:

    Not using wait_for():
    cancelled right
    right got cancelled
    
    Using wait_for():
    cancelled right
    right got cancelled
    

    This suggests the issue is to do with the exception being swallowed somewhere deep in the call stack.

    twisteroidambassador added a commit to twisteroidambassador/async_stagger that referenced this issue Nov 17, 2022
    This test fails, but not very reliably.
    The cause of failure is that `asyncio.wait_for` sometimes swallows cancellation:
    python/cpython#86296
    @gvanrossum
    Copy link
    Member

    And fixed in 3.12 by gh-96764. (We're not 100.00% sure that the fix doesn't disturb some workaround, so we're not backporting the fix to 3.11 even though that has timeout(), and we would have to devise a totally different fix for 3.10, which we're not inclined to do, sorry. But going forward it should be fixed.

    SomberNight added a commit to SomberNight/electrum that referenced this issue Aug 4, 2023
    wasted some time because asyncio.wait_for() was suppressing cancellations. [0][1][2]
    deja vu... [3]
    
    Looks like this is finally getting fixed in cpython 3.12 [4]
    So far away...
    In attempt to avoid encountering this again, let's try using
    asyncio.timeout in 3.11, which is how upstream reimplemented wait_for in 3.12 [4], and
    aiorpcx.timeout_after in 3.8-3.10.
    
    [0] python/cpython#86296
    [1] https://bugs.python.org/issue42130
    [2] https://bugs.python.org/issue45098
    [3] kyuupichan/aiorpcX#44
    [4] python/cpython#98518
    SomberNight added a commit to spesmilo/electrum that referenced this issue Aug 4, 2023
    wasted some time because asyncio.wait_for() was suppressing cancellations. [0][1][2]
    deja vu... [3]
    
    Looks like this is finally getting fixed in cpython 3.12 [4]
    So far away...
    In attempt to avoid encountering this again, let's try using
    asyncio.timeout in 3.11, which is how upstream reimplemented wait_for in 3.12 [4], and
    aiorpcx.timeout_after in 3.8-3.10.
    
    [0] python/cpython#86296
    [1] https://bugs.python.org/issue42130
    [2] https://bugs.python.org/issue45098
    [3] kyuupichan/aiorpcX#44
    [4] python/cpython#98518
    twisteroidambassador added a commit to twisteroidambassador/async_stagger that referenced this issue Apr 2, 2024
    Before 0ab7bfe, this test fails, but not very reliably.
    The cause of failure is that `asyncio.wait_for` sometimes swallows cancellation:
    python/cpython#86296
    
    Fixed by switching to `asyncio.timeout`.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.10 only security fixes 3.11 only security fixes 3.12 bugs and security fixes topic-asyncio type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    Successfully merging a pull request may close this issue.

    6 participants