-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queue.Queue() is not reentrant, so signals and GC can cause deadlocks #59181
Comments
Python 2.7.3 (default, Apr 20 2012, 22:39:59) If a signal handler calls Queue.PriorityQueue.put and a second signal is received while one is already being processed, neither of the calls to put will terminate. Highly dependent on timing so it might be difficult to reproduce. |
Start queue_deadlock.py in one terminal and note the PID it prints. Compile queue_sendint.c in another terminal and execute it with previous PID as only argument. If the bug is triggered, nothing will print in the python terminal window when you press Enter. To terminate the application you can press Ctrl-\ |
I don't think there is anything special about PriorityQueue. There is a similar concerning the use of the Python implementation of RLock in signal handlers -- see http://bugs.python.org/issue13697. Maybe the signal handler should temporarily mask or ignore SIGINT while it runs. |
I did read some more on the subject and it seems like using locks with interrupts is in general a very difficult problem and not limited to Python. Don't know if this is considered common knowledge among programmers or if it would be useful with at least a warning on the signal manual page. |
This is not specifically a signal issue; it can happen with garbage collection as well if you have a Queue.put that runs in __del__ or a weakref callback function. This can happen in real code. In my case, a thread that reads log messages from a Queue and writes them to disk. The thread putting log messages into the Queue can deadlock if GC happens to cause a log message to be written right after Queue.put() acquired the lock. |
SQLAlchemy suffered from this issue long ago as we use a Queue for connections, which can be collected via weakref callback and sent back to put(), which we observed can occur via gc. For many years (like since 2007 or so) we've packaged a complete copy of Queue.Queue with an RLock inside of it to work around it. I'm just noticing this issue because I'm writing a new connection pool implementation that tries to deal with this issue a little differently (not using Queue.Queue though). |
Guido opposes having us going down the path of using an RLock and altering the queue module to cope with reentrancy. In the case of mixing signals with threads, we all agree with Johan Aires Rastén that "using locks with interrupts is in general a very difficult problem and not limited to Python", nor is it even limited to the queue module. I believe that the usual technique for mixing signals and threads is to have the signal handler do the minimal work necessary to record the event without interacting with the rest of the program. Later, another thread can act on the recorded signal but can do so with normal use of locks so that code invariants are not violated. This design makes reasoning about the program more tractable. |
New changeset 137806ca59ce by Gregory P. Smith in branch '3.5': New changeset 4e15e7618715 by Gregory P. Smith in branch 'default': |
yep, that's what im doing in my approach. though longer term thing, I noticed it's very hard to find documentation on exactly when gc might run. E.g. would it ever run if I did something innocuous, like "self.thread_id = None" (probably not). Just wasn't easy to get that answer. |
New changeset 8c00cbbd3ff9 by Raymond Hettinger in branch '3.5': |
This bug was closed on the basis that signals + threads don't interact well. Which is a good point. Unfortunately this bug can happen in cases that have nothing to do with signals. If you look at the title and some of the comments it also happens as a result of garbage collection: in Specifically, garbage collection can happen on any bytecode boundary and cause reentrancy problems with Queue. The attached file is an attempt to demonstrate this: it runs and runs until GC happens and then deadlocks for me (on Python 3.6). I.e. it prints GC! and after that no further output is printed and no CPU usage reported. Please re-open this bug: if you don't want to fix signal case that's fine, but the GC case is still an issue. |
Itamar wrote up a post describing the GC variant of this problem in more detail: https://codewithoutrules.com/2017/08/16/concurrency-python/ In particular, he highlighted a particularly nasty action-at-a-distance variant of the deadlock where:
As far as I can see, there's no application level way of resolving that short of "Only register logging.handlers.QueueHandler with a logger you completely control and hence can ensure is never used in a __del__ method or weakref callback", which doesn't feel like a reasonable restriction to place on the safe use of a standard library logging handler. |
An idea from the blog post: if we rewrite queue in C it will use the GIL as a lock which will fix this particular bug. I can make a patch. |
Using any kind of potentially-blocking synchronization primitive from __del__ or weakref callback is indeed a bug waiting for happen. I agree non-trivial cases can be hard to debug, especially when people don't expect that kind of cause. It would be ok to submit a patch solving this issue using C code IMHO. Note the maxsize argument complicates things even though most uses of Queue don't use maxsize. Of course, the general issue isn't only about Queue: other synchronization primitives are widely used. See tornadoweb/tornado#1876 for an example. |
Here is a pure Python PoC patch that allows unbounded Queue and LifoQueue to have reentrant put(). |
Are all changes are necessary for this issue or some of them are just an optimization? The patch changes the semantic of public attribute unfinished_tasks. Seems that old unfinished_tasks is new unfinished_tasks + _qsize(). |
The change is necessary: you cannot increment unfinished_tasks in reentrant put(), since incrementing in pure Python is not atomic. So the incrementation is moved to get(), which probably cannot be made reentrant at all. If keeping the visible semantics of the |
per http://bugs.python.org/msg275377 guido does not want an RLock here. |
I'll believe it when Guido chimes in and actually says so himself. Otherwise, I am skeptical Guido cares a lot about whether the internal implementation of Queue uses a Lock, a RLock or whatever else. |
Guido, apparently you are opposed to the Queue implementation using a RLock instead of a Lock, according to Raymond in https://bugs.python.org/issue14976#msg275377. I find this a bit surprising, so could you confirm it yourself? |
Also, to elaborate a bit, I don't think we should aim to make Queue fully reentrant, as that would be extremely involved. I think we can settle on the simpler goal of making put() reentrant for unbounded LIFO and FIFO queues, which is what most people care about (and which is incidentally what the posted patch claims to do). |
Is it guaranteed that the GC will happen in the same thread that is holding the lock? IOW will RLock help with all GC/del deadlocking scenarios? |
+1 for treating Queue.put() specifically as the case to be handled, as that's the mechanism that can be used to *avoid* running complex operations directly in __del__ methods and weakref callbacks. For testing purposes, the current deadlock can be reliably reproduced with sys.settrace:
|
Le 18/08/2017 à 23:26, Guido van Rossum a écrit :
Yes.
Currently it gives a decent message on any bounded queue, even if not |
Oh and: Le 18/08/2017 à 23:26, Guido van Rossum a écrit :
Regular Locks don't have the notion of an owning thread so, while we We can also detect reentrancy in get() and raise an error. |
Would it be feasible to change the behaviour of non-reentrant locks such that:
Then they could sensibly expose the same "_is_locked()" API as RLock, while still disallowing reentrancy by default. |
Le 19/08/2017 à 12:09, Nick Coghlan a écrit :
Yes.
No. It's not a deadlock, since you can release a Lock from another thread. |
After experimenting a bit more with this approach, I now realize that the case where a get() is waiting and gets interrupted by a put() call is not handled properly: there is no obvious way for the get() call to realize (when the interruption finishes) that the queue is now non-empty and can be popped from. So perhaps we need C code after all. |
[Antoine Pitrou]
This matches my experience with functools.lru_cache() where I used an RLock() to handle reentrancy. That by itself was insufficient. I also had to make otherwise unnecessary variable assignments to hang onto object references to avoid a decref triggering arbitrary Python code from reentering before the links were all in a consistent state. Further, I had to create a key wrapper to make sure a potentially reentrant __hash__() call wouldn't be made before the state was fully updated. Even then, a potentially reentrant __eq__() call couldn't be avoided, so I had to re-order the operations to make sure this was the last call after the other state updates. This defended against all normal code, but all these measures still could not defend against signals or a GC invocation of __del__, either of which can happen at any time. On the plus side, we now have a C version of functools.lru_cache() that is protected somewhat by the GIL. On the minus side, it was hard to get right. Even with the pure python code as a model, the person who wrote the C code didn't fully think through all sources of reentrancy and wrote buggy code that shipped in 3.5 and 3.6 (resulting in normal code code triggering hard-to-reproduce reentrancy bugs). The lesson here is that while the C code can be written correctly, it isn't easy to do and it is hard to notice when it is incorrect. One other thought: Given that __del__() can be invoked at almost any time and can potentially call any other piece of Python code, we should consider turning every lock into an rlock. Also, there should be some guidance on __del__() advising considerable restraint on what gets called. The world is likely full of pure Python code that can't defend itself against arbitrary re-entrancy. |
Just a random thought: if there was a SimpleQueue class with very basic functionality (only FIFO, only get(), put() and empty(), no size limit, no task management), it would be easier to make it reentrant using C. (FTR, multiprocessing also has a light-weight SimpleQueue) |
Agreed; the Queue class has a bunch of rarely used functionality rolled Why was task management ever added? |
[Guido]
Raymond published a "joinable" queue class as a recipe here: http://code.activestate.com/recipes/475160-taskqueue/ and later folded it into the standard Python queue. So the usual answer applies: "it was easy to add and sometimes useful, so it seemed like a good idea at the time" ;-) |
[Guido]
See http://bugs.python.org/issue1455676 Problem being solved: How can a requestor of a service get notified when that service is complete given that the work is being done by a daemon thread that never returns. |
Oh well. While it is undoubtedly useful I wish we had had more experience and factored the API differently. Ditto for the maxsize=N feature. So, while it's not too late, perhaps we should indeed follow Antoine's advice and implement a different queue that has fewer features but is guaranteed to be usable by signal handlers and GC callbacks (including __del__). The nice part here is that a queue is mostly a wrapper around a deque anyways, and deque itself is reentrant. (At least one would hope so -- Antoine's patch to Queue assumes this too, and I can't think of a reason why deque would need to release the GIL.) |
Yes, in deque.append(item) and deque.popleft() are atomic. Likewise list.append() and list.pop() are atomic. The heappush() and heappop() operations aren't as fortunate since they call __lt__() on arbitrary Python objects.
Is the idea to use a regular non-reentrant lock but write the whole thing in C to avoid running any pure python instructions (any of which could allow a signal handler to run)? If so, it seems like the only feature that needs to be dropped is subclassing. The maxsize feature and unfinished tasks tracking could still be supported. Also, I suppose the guarantees would have be marked as CPython implementation details, leaving the other implementations to fend for themselves. Or were you thinking about a pure python simplified queue class? If so, an RLock() will likely be required (the act of assigning deque.popleft's return value to a pure python variable can allow a pending signal to be handled before the lock could be released). One other thought: Would it make sense get() and put() to add gc.disable() and gc.enable() whenever GC is already enabled? That would eliminate a source of reentrancy. |
That doesn't sound very nice if some thread is waiting on a get() for a very long time (which is reasonable if you have a thread pool that's only used sporadically, for example). Also, using gc.disable() and gc.enable() in multi-threaded programs is a bit delicate. |
I don't mind someone to reimplement a full-fledged Queue in C. As for me, I am currently implementing a SimpleQueue in C that's reentrant and has only the most basic functionality. |
FYI - here is an appropriately licensed (WTFPL) very simple lock-free queue implemented in C: https://github.com/darkautism/lfqueue |
Le 05/09/2017 à 23:21, Gregory P. Smith a écrit :
Looks like it uses busy-waiting inside its get() equivalent. |
yeah, there are others such as https://www.liblfds.org/ that seem better from that standpoint (gcc, and microsoft compiler support - I'm sure clang is fine as well - anything gcc can do they can do). Ensuring they're supported across build target platforms (all the hardware technically has the ability) is the fun part if we're going to go that route. |
One tangential note about a potential full-fledged Queue in C: the pure Python implementation is fair towards consumers, as it uses a threading.Condition which is itself fair. Achieving the same thing in C may significantly complicate the implementation. Raymond would have to decide whether it's an important property to keep. The SimpleQueue class, being a separate API, is not tied by this constraint. |
To handle the logging.handlers.QueueHandler case, the SimpleQueue needs a put_nowait() method (see line 1959 of Lib/logging/handlers.py. [Mike Bayer]
The answer I got from the other coredevs is that GC can trigger whenever a new GCable object is allocated (pretty much any container or any pure python object). |
Just for the record, Guido explained his aversion to RLocks. Roughly: 1) RLocks are slower and more complex 2) It is difficult to correctly write code that can survive reentrancy, so it is easy fool yourself into believing you've written correct code 3) Getting a deadlock with a regular lock is a better outcome than having the invariants subtly violated, 4) Deadlocks have a more clear-cut failure mode and are easier to debug. Guido also explained that he favored some sort of minimal queue class even if it might not handle all possible gc/weakref/signal/del induced issues. Essentially, he believes there is some value in harm reduction by minimizing the likelihood of a failure. The will help sysadmins who already have a practice of doing monitoring and occasional restarts as a way of mitigating transient issues. |
Hi all, The PR has been ready for quite some time now. Raymond posted some review comments back in September, which I addressed by making the requested changes. If someone wants to add their comments, now is the time. Otherwise I plan to merge the PR sometimes soon, so that it gets in before 3.7 beta. |
Ok, there has been enough reviews in the last four months :-) This is now merged. |
Catalin has been examining code... switching concurrent.futures.thread to use SimpleQueue instead of Queue is probably a good idea as the queues in there get used from weakref callbacks. |
Could you open a new issue for it? |
https://bugs.python.org/issue32576 filed for that |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: