-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Locks in the standard library should be sanitized on fork #50970
Comments
The python logging module uses a lock to surround many operations, in
The deadlock is more likely to happen on a highly loaded system which A demo of the problem simplified into one file is attached. The Python standard library should not be the cause of these deadlocks. A) acquire all locks before forking, release them immediately after. Code was added to call some cleanups after forking in Rather than having to manually add after fork code hooks into every file |
I've started a project to patch this and similar messes up for Python http://code.google.com/p/python-atfork/ I'd like to take ideas or implementations from that when possible for |
bpo-6923 has been opened to provide a C API for an atfork mechanism for |
Rather than having a kind of global module registry, locks could keep |
Locks can't blindly release themselves because they find themselves If anything if a lock is held and finds itself running in a new I'm not sure a PID check is good enough. old linux using linuxthreads |
I was suggesting "reinitialize", rather than "release". That is, create |
no need for that. the problem is that they're held by a thread that example: if you fork while another thread is in the middle of logging locks are not shared between processes so reinitializing them with a new |
I'm not sure that releasing the mutex is enough, it can still lead to a segfault, as is probably the case in this issue : Quoting pthread_atfork man page : To understand the purpose of pthread_atfork, recall that fork duplicates the whole memory space, including mutexes in their current locking state, but only the calling thread: other threads are not running in the child process. The mutexes are not usable after the fork and must be initialized with pthread_mutex_init in the child process. This is a limitation of the current implementation and might or might not be present in future versions. To avoid this, install handlers with pthread_atfork as follows: have the prepare handler lock the mutexes (in locking order), and the parent handler unlock the mutexes. The child handler should reset the mutexes using pthread_mutex_init, as well as any other synchronization objects such as condition variables. Locking the global mutexes before the fork ensures that all other threads are locked out of the critical regions of code protected by those mutexes. Thus when fork takes a snapshot of the parent's address space, that snapshot will copy valid, stable data. Resetting the synchronization objects in the child process will ensure they are properly cleansed of any artifacts from the threading subsystem of the parent process. For example, a mutex may inherit a wait queue of threads waiting for the lock; this wait queue makes no sense in the child process. Initializing the mutex takes care of this. pthread_atfork might be worth looking into |
fwiw http://bugs.python.org/issue6643 recently fixed on issue where a mutex was being closed instead of reinitialized after a fork. more are likely needed. Are you suggesting to use pthread_atfork to call pthread_mutex_init on all mutexes created by Python in the child after a fork? I'll have to ponder that some more. Given the mutexes are all useless post fork it does not strike me as a bad idea. |
I don't really understand. It's quite similar to the idea you shot down in msg94135. Or am I missing something? |
Yeah, I'm trying to figure out what I was thinking then or if I was just plain wrong. :) I was clearly wrong about a release being done in the child being the right thing to do (bpo-6643 proved that, the state held by a lock is not usable to another process on all platforms such that release even works). Part of it looks like I wanted a way to detect it was happening as any lock that is held during a fork indicates a _potential_ bug (the lock wasn't registered anywhere to be released before the fork) but not everything needs to care about that. |
Yeah, apparently OS-X is one of them, the reporter in bpo-11148 is
Yes, that's what I was thinking. Instead of scattering the 2011/2/10 Gregory P. Smith <report@bugs.python.org>:
|
I encountered this issue while debugging some multiprocessing code; fork() would be called from one thread while sys.stdout was in use in another thread (simply because of a couple of debugging statements). As a result the IO lock would be already "taken" in the child process and any operation on sys.stdout would deadlock. This is definitely something that can happen more easily than I thought. |
Here is a patch with tests for the issue (some of which fail of course). |
Those tests make sense to me. |
# A lock taken from the current thread should stay taken in the Note that I'm not sure of how to implement this. Note that this means that even the current code allocating new locks after fork (in Lib/threading.py, _after_fork and _reset_internal_locks) is unsafe, because the old locks will be deallocated, and the lock deallocation tries to acquire and release the lock before destroying it (in issue bpo-11148 the OP experienced a segfault on OS-X when locking a mutex, but I'm not sure of the exact context). Also, this would imply keeping track of the thread currently owning the lock, and doesn't match the typical pthread_atfork idiom (acquire locks just before fork, release just after in parent and child, or just reinit them in the child process) Finally, IMHO, forking while holding a lock and expecting it to be usable after fork doesn't make much sense, since a lock is acquired by a thread, and this threads doesn't exist in the child process. It's explicitely described as "undefined" by POSIX, see http://pubs.opengroup.org/onlinepubs/007908799/xsh/sem_init.html : So I'm not sure whether it's feasable/wise to provide such a guarantee. |
Yes, we would need to keep track of the thread id and process id inside Synopsis: def _reinit_if_needed(self):
# Call this before each acquire() or release()
if self.pid != getpid():
sem_init(self.sem, 0, 1)
if self.taken:
if self.tid == main_thread_id_after_fork:
# Lock was taken in forked thread, re-take it
sem_wait(self.sem)
else:
# It's now released
self.taken = False
self.pid = getpid()
self.tid = current_thread_id()
Well, I fail to understand how that idiom can help us. We're not a |
A couple remarks:
P1 lock.acquire()
fork() -> P2
start_new_thread T2
T1 T2
lock.acquire() The acquisition of lock by T2 will cause lock's reinitialization: what
Yes, but in that case, you don't have to reacquire the locks after fork. |
Oops, for liunxthreads, you should of course read "different PIDs", not "same PID". |
Ouch, then the approach I'm proposing is probably doomed.
Well, this means indeed that *some* locks can be handled, but not all of (do note that library code can call arbitrary third-party code, by the |
Well, it works on Linux with NPTL, but I'm not sure at all it holds
When a lock object is allocated in Modules/threadmodule.c |
Please disregard my comment on PyEval_ReInitThreads and _after_fork: |
Thanks for the explanations. This sounds like an interesting path.
Actually, I think the issue is POSIX-specific: Windows has no fork(),
Well, the big difference between Python locks and POSIX mutexes is that But even though we might not be "fixing" arbitrary Python code (we could also imagine that the creator of the lock decides whether it |
Hi, There seem to be two alternatives for atfork handlers:
http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_atfork.html Option (2) makes sense but is probably not always applicable. Initializing locks in child after fork without acquiring them before the fork may result in corrupted program state and so is probably not a good idea. On a positive note, if I understand correctly, Python signal handler functions are actually run in the regular interpreter loop (as pending calls) after the signal has been handled and so os.fork() atfork handlers will not be restricted to async-signal-safe operations (since a Python fork is never done in a signal handler). http://pubs.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_04.html Opinion by Butenhof who was involved in the standardization effort of POSIX threads: ...so how can we establish correct (cross library) locking order during prepare stage? Nir |
@nir Aides: *thanks* for this link: |
That sounds like a lost battle, if it requires the libraries' |
Hello Nir,
There are indeed a couple problems with 1:
So, we would have to:
I think this is going to be very complicated.
T1 T2 This is perfectly valid with the current lock implementation (for For all those reasons, I don't think that this approach is reasonable,
Yes, but in practise, I think that this shouldn't be too much of a
That's correct. In short, I think that we could first try to avoid common deadlocks by Attached is a first draft for a such patch (with tests).
Notes:
This fixes common deadlocks with threading.Lock, and |
While having to deal with this bug for a while I have written a small library using It allows registering atfork-hooks (similar to the ones available by now) and frees the stdout/stderr as well as manually provided io locks. I guess it uses some hacky ways to get the job done, but resolved the issue for me and has been working without problems for some weeks now. |
I think we should somehow move forward on this, at least for logging locks which can be quite an annoyance. There are two possible approaches:
What do you think? |
Oh, I forgot that IO buffered objects also have a lock. So we would have to special-case those as well, unless we take the generic approach... A problem with the generic approach is that it would leave higher-level synchronization objects such as RLock, Event etc. in an inconsistent state. Not to mention the case where the lock is taken by the thread calling fork()... |
logging is pretty easy to deal with so I created a PR. bufferedio.c is a little more work as we either need to use the posixmodule.c os.register_at_fork API or expose it as an internal C API to be able to call it to add acquires and releases around the buffer's self->lock member when non-NULL. either way, that needs to be written safely so that it doesn't crash if fork happens after a buffered io struct is freed. (unregister at fork handlers when freeing it? messy) |
Actually, we already have a doubly-linked list of buffered IO objects |
FWIW, I encountered the same kind of issue when using the mkstemp() function: under the hood, it calls gettempdir() and this one is protected by a lock too. Current thread 0x00007ff10231f700 (most recent call first): |
It seems like this change caused a regression in the Anaconda installer of Fedora: But we are not sure at this point. I have to investigate to understand exactly what is happening. |
I suspect 3b69993 is causing a hang in libreswan's kvmrunner.py on Fedora. Looking at the Fedora RPMs: python3-3.7.0-9.fc29.x86_64 didn't contain the fix and worked I believe the hang looks like: Traceback (most recent call last):
File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 389, in _process_test
test_domains = _boot_test_domains(logger, test, domain_prefix, boot_executor)
File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 203, in _boot_test_domains
TestDomain.boot_and_login)
File "/home/build/libreswan-web/master/testing/utils/fab/runner.py", line 150, in submit_job_for_domain
logger.debug("scheduled %s on %s", job, domain)
File "/usr/lib64/python3.7/logging/__init__.py", line 1724, in debug
File "/usr/lib64/python3.7/logging/__init__.py", line 1768, in log
def __repr__(self):
File "/usr/lib64/python3.7/logging/__init__.py", line 1449, in log
"""
File "/usr/lib64/python3.7/logging/__init__.py", line 1519, in _log
break
File "/usr/lib64/python3.7/logging/__init__.py", line 1529, in handle
logger hierarchy. If no handler was found, output a one-off error
File "/usr/lib64/python3.7/logging/__init__.py", line 1591, in callHandlers
File "/usr/lib64/python3.7/logging/__init__.py", line 905, in handle
try:
File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163, in emit
stream_handler.emit(record)
File "/usr/lib64/python3.7/logging/__init__.py", line 1038, in emit
Handler.__init__(self)
File "/usr/lib64/python3.7/logging/__init__.py", line 1015, in flush
name += ' '
File "/usr/lib64/python3.7/logging/__init__.py", line 854, in acquire
self.emit(record)
KeyboardInterrupt |
We need a small test case that can reproduce your problem. I believe 3b69993 to be correct. acquiring locks before fork in the thread doing the forking and releasing them afterwards is always the safe thing to do. Example possibility: Does your code use any C code that forks on its own without properly calling the C Python PyOS_BeforeFork(), PyOS_AfterFork_Parent(), and PyOS_AfterFork_Child() APIs? |
No. Is there a web page explaining how to pull a python backtrace from all the threads running within a daemon? |
I'd start with faulthandler.register with all_threads=True and see if that gives you what you need. |
It's also an easy way to cause a deadlock:
If a thread were to grab its logging lock before the global lock then it would deadlock. I'm guessing this isn't allowed - however I didn't see any comments to this effect? Can I suggest documenting this, and also merging the two callbacks so that the ordering of these two acquires is made explicit.
If a thread were to acquire two per-logger locks in a different order then things would deadlock. |
Below is a backtrace from the deadlock. It happens because the logging code is trying to acquire two per-logger locks; and in an order different to the random order used by the fork() handler. The code in question has a custom class DebugHandler(logging.Handler). The default logging.Handler.handle() method grabs its lock and calls .emit() vis: if rv:
self.acquire()
try:
self.emit(record)
finally:
self.release() the custom .emit() then sends the record to a sub-logger stream vis: def emit(self, record):
for stream_handler in self.stream_handlers:
stream_handler.emit(record)
if _DEBUG_STREAM:
_DEBUG_STREAM.emit(record) and one of these emit() functions calls flush() which tries to acquire a further lock. Thread 0x00007f976b7fe700 (most recent call first): def flush(self):
"""
Flushes the stream.
"""
self.acquire() <
File "/usr/lib64/python3.7/logging/init.py", line 1038 in emit
File "/home/build/libreswan-web/master/testing/utils/fab/logutil.py", line 163 in emit def emit(self, record):
for stream_handler in self.stream_handlers:
stream_handler.emit(record) <---
if _DEBUG_STREAM:
_DEBUG_STREAM.emit(record) File "/usr/lib64/python3.7/logging/init.py", line 905 in handle def handle(self, record):
"""
Conditionally emit the specified logging record.
File "/usr/lib64/python3.7/logging/init.py", line 1591 in callHandlers hdlr.handle(record) File "/usr/lib64/python3.7/logging/init.py", line 1529 in handle self.callHandlers(record) File "/usr/lib64/python3.7/logging/init.py", line 1519 in _log self.handle(record) File "/usr/lib64/python3.7/logging/init.py", line 1449 in log self._log(level, msg, args, **kwargs) File "/usr/lib64/python3.7/logging/init.py", line 1768 in log self.logger.log(level, msg, *args, **kwargs) File "/usr/lib64/python3.7/logging/init.py", line 1724 in debug self.log(DEBUG, msg, *args, **kwargs) File "/home/build/libreswan-web/master/testing/utils/fab/shell.py", line 110 in write self.logger.debug(self.message, ascii(text)) |
Thanks for the debugging details! I've filed https://bugs.python.org/issue36533 to specifically track this potential regression in the 3.7 stable branch. lets carry on there where the discussion thread isn't too long for bug tracker sanity. |
I created bpo-40089: Add _at_fork_reinit() method to locks. |
Related issue: |
https://bugs.python.org/issue40442 is a fresh instance of this, entirely self-inflicted. |
See also bpo-25920: PyOS_AfterFork should reset socketmodule's lock. |
While it's true that "Locks in the standard library should be sanitized on fork", IMO having such "meta-issue" to track the issue in the 300+ stdlib modules is a bad idea, since it's hard to track how many modules got fixed and how many modules should still be fixed. Multiple modules have been fixed. I suggest to open more specific issues for remaining ones. I close the issue. Thanks for anyone who was involved in fixing issues! Good luck for people volunteers to fix remaining issues :-) Also, avoid fork without exec, it's no longer supported on macOS, it was never supported on Windows, and it causes tons of very complex bugs on Linux :-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: