-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
asyncio: Cancelling wait() after notification leaves Condition in an inconsistent state #67159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If a task that is waiting on an asyncio.Condition is cancelled after notification but before having successfully reacquired the associated lock, the acquire() will be cancelled causing wait() to return without the lock held (violating wait()'s contract and leaving the program in an inconsistent state). This can be reproduced in cases where there's some contention on the Condition's lock. For example: import asyncio
loop = asyncio.get_event_loop()
cond = asyncio.Condition()
@asyncio.coroutine
def cond_wait_timeout(condition, timeout):
wait_task = asyncio.async(condition.wait())
loop.call_later(timeout, wait_task.cancel)
try:
yield from wait_task
return True
except asyncio.CancelledError:
print("Timeout (locked: {0})".format(condition.locked()))
return False
@asyncio.coroutine
def waiter():
yield from cond.acquire()
try:
print("Wait")
if (yield from cond_wait_timeout(cond, 1)):
# Cause some lock contention
print("Do work")
yield from asyncio.sleep(1)
finally:
cond.release()
@asyncio.coroutine
def notifier():
# Yield to the waiters
yield from asyncio.sleep(0.1)
The most straightforward fix appears to be just to have wait() retry to acquire the lock, effectively ignoring cancellation at this point (since the condition has already finished waiting and just trying to reacquire the lock before returning). |
I don't like the idea of ignoring exceptions (CancelledError). An option may be to store the latest exception and reraise it when the condition is acquired. I'm not sure that it's safe or correct to "retry" to acquire the condition. I don't know what I should think about this issue. I don't see any obvious and simple fix, but I agree that inconsistencies in lock primitives is a severe issue. |
threading.Condition.wait() implementation is very similar to asyncio.Condition.wait(), and the threading code only call acquire() once, it doesn't loop or ignore exceptions. Does it mean that threading.Condition.wait() has the same issue? |
Hi Victor, (Sorry for the delay, I think GMail ate the notification) The main issue is that Condition.wait MUST reacquire the associated lock released on entry to wait() before returning or else the caller's assumption that it holds a lock (such as `with lock:`) will be incorrect. With the threading module, it's typically not possible to interrupt or cancel lock acquisition (no Thread.interrupt for example). Even on Python 3.2 where lock acquisition can be interrupted by POSIX signals, Condition.wait's implementation uses the non-interruptable lock._acquire_restore (calls rlock_acquire_restore → PyThread_acquire_lock → PyThread_acquire_lock → PyThread_acquire_lock_timed with intr=0 [Python/thread_pthread.h] which does the uninterruptable retry loop). So, not a problem for threading.Condition.wait(). As for re-raising the CancelledError after the lock is reacquired, I'm not sure it serves any useful purpose to the caller since by this point we're either in the process of waking up after a notify() or already raising an exception (such as a CancelledError if someone called task.cancel() _before_ the condition was notified). Making the caller also handle re-acquisition failure would be quite messy. For example, one idea would be to have the caller check cond.locked() after a CancelledError and have the caller reacquire, but it turns in asyncio you can't tell whom owns the lock (could have been grabbed by another task). |
Just for context, the reason for using a custom wait function (cond_wait_timeout) rather than the more idiomatic asyncio.wait_for(cond.wait(), timeout) is that the combination introduces another subtle, but nasty locking inconsistency (see https://groups.google.com/forum/#!topic/python-tulip/eSm7rZAe9LM ). Should probably open a ticket for that issue too. |
This issue can still be reproduced on Python 3.5.0a1. Please let me know if there's any further steps you'd like me to take here. |
Is anyone going to try and fix this for 3.5.0? Or should we "kick the can down the road" and reassign it to 3.6? |
asyncio keeps its "provisional API" status in the Python 3.5 cycle, so this issue must not block the 3.5.0 release. It can be fixed later. This issue is complex, I'm not convinced that asyncio-fix-wait-cancellation-race.patch is the best fix. |
Reassigning to 3.6. |
(It could be fixed in 3.5.1, since asyncio is still provisional in 3.5.) |
Is this still a Release Blocker for 3.6? |
Presumably. :-( |
Since I reported this, I haven't seen any proposed solutions other other than a retry loop to ensure that the lock is guaranteed to be reacquired when the sleeping coroutine is woken. The four possibilities of cancelling the waiting coroutine are:
Cancellation is quite analogous to EINTER of system calls interrupted by signals - you'll notice that the CPython runtime will retry acquiring the lock forever https://hg.python.org/cpython/file/default/Python/thread_pthread.h#l355 . This is acceptable behaviour acording the the POSIX spec http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html , in particular: "If a signal is delivered to a thread waiting for a condition variable, upon return from the signal handler the thread resumes waiting for the condition variable as if it was not interrupted, or it shall return zero due to spurious wakeup." "Upon successful return, the mutex shall have been locked and shall be owned by the calling thread." "These functions shall not return an error code of [EINTR]." Thus, other than throwing something like a RuntimeError, there's no way to legally return from a Condition.wait without having that lock reacquired. Thus the only option is to retry if the lock reacquire is interrupted. |
Yury can you take this on? |
OK, I'll see what I can do (this is some code that I don't know/use). |
Hi David, thanks for working on this issue. Your patch and reasoning behind it looks correct, but it's hard to say that definitively without tests and full code review. Could you please submit a PR to https://github.com/python/asyncio (upstream for asyncio) with tests, and I'll do my best to review it before 3.5.2 happens. |
Hi Yury, Sure - I'll create a PR along with a test that reproduces the issue. Should be able to get that done this weekend. |
Please find the PR including a test to reproduce the issue here: python/asyncio#346 |
New changeset 0bda9bc443ce by Yury Selivanov in branch '3.5': New changeset 00a9de0f3fdc by Yury Selivanov in branch 'default': |
I'm merging David's patch. I concur with David that the only viable way to fix this is to loop through the CancelledError until the lock is acquired (and Condition is in a right state). Thanks, David. It should make it to 3.5.2 now. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: