-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Daemon thread is crashing in PyEval_RestoreThread() while the main thread is exiting the process #84058
Comments
Sometimes, test_multiprocessing_spawn does crash in PyEval_RestoreThread() on FreeBSD with a coredump. This issue should be the root cause of bpo-39088: "test_concurrent_futures crashed with python.core core dump on AMD64 FreeBSD Shared 3.x", where the second comment is a test_multiprocessing_spawn failure with "... After: ['python.core'] ..." # Thread 1 (gdb) frame (gdb) info threads
# Thread 2 (gdb) thread 2 (gdb) where The problem is that Python already freed the memory of all PyThreadState structures, whereas PyEval_RestoreThread(tstate) dereferences tstate to get the _PyRuntimeState structure: void
PyEval_RestoreThread(PyThreadState *tstate)
{
assert(tstate != NULL);
struct _ceval_runtime_state *ceval = &runtime->ceval;
assert(gil_created(&ceval->gil));
int err = errno;
take_gil(ceval, tstate);
exit_thread_if_finalizing(tstate);
errno = err;
} I modified PyEval_RestoreThread(tstate) to get runtime from tstate: commit 01b1cc1. Extract of the change: diff --git a/Python/ceval.c b/Python/ceval.c
index 9f4b43615e..ee13fd1ad7 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -384,7 +386,7 @@ PyEval_SaveThread(void)
void
PyEval_RestoreThread(PyThreadState *tstate)
{
- _PyRuntimeState *runtime = &_PyRuntime;
+ _PyRuntimeState *runtime = tstate->interp->runtime;
struct _ceval_runtime_state *ceval = &runtime->ceval;
if (tstate == NULL) {
@@ -394,7 +396,7 @@ PyEval_RestoreThread(PyThreadState *tstate)
int err = errno;
take_gil(ceval, tstate);
- exit_thread_if_finalizing(runtime, tstate);
+ exit_thread_if_finalizing(tstate);
errno = err;
_PyThreadState_Swap(&runtime->gilstate, tstate); By the way, exit_thread_if_finalizing(tstate) also get runtime from state. Before 01b1cc1, it was possible to call PyEval_RestoreThread() from a daemon thread even if tstate was a dangling pointer, since tstate wasn't dereferenced: _PyRuntime variable was accessed directly. -- One simple fix is to access directly _PyRuntime in PyEval_RestoreThread() with a comment explaining why runtime is not get from tstate. I'm concerned by the fact that only FreeBSD buildbot spotted the crash. test_multiprocessing_spawn seems to silently ignore crashes. The bug was only spotted because Python produced a coredump in the current directory. My Fedora 31 doesn't write coredump files in the current files, and so the issue is silently ignored even when using --fail-env-changed. IMHO the most reliable solution is to drop support for daemon threads: they are dangerous by design. But that would be an incompatible change. Maybe we should at least deprecate daemon threads. Python 3.9 now denies spawning a daemon thread in a Python subinterpreter: bpo-37266. |
To reproduce the bug on Linux:
Example: $ ./python daemon_threads_exit.py
Start 100 daemon threads
Exit Python main thread
Erreur de segmentation (core dumped) |
I marked bpo-39088 as a duplicate of this issue. |
Patch to get a cleaner error if the bug occurs, but also to make the bug more reliable: diff --git a/Python/ceval.c b/Python/ceval.c
index ef4aac2f9a..8bf1e4766d 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -382,7 +382,8 @@ PyEval_SaveThread(void)
void
PyEval_RestoreThread(PyThreadState *tstate)
{
- assert(tstate != NULL);
+ assert(!_PyMem_IsPtrFreed(tstate));
+ assert(!_PyMem_IsPtrFreed(tstate->interp));
_PyRuntimeState *runtime = tstate->interp->runtime;
struct _ceval_runtime_state *ceval = &runtime->ceval; |
Ok, it should now be fixed. Let me try to revisit the old bpo-19466 issue. |
I reopen the issue, my change introduced a *new* issue. While trying to fix bpo-19466, work on PR 18848, I noticed that my commit eb4e2ae introduced a race condition :-( The problem is that while the main thread is executing Py_FinalizeEx(), daemon threads can be waiting in take_gil(). Py_FinalizeEx() calls _PyRuntimeState_SetFinalizing(runtime, tstate). Later, Py_FinalizeEx() executes arbitrary Python code in _PyImport_Cleanup(tstate) which releases the GIL to give a chance to other threads to execute: if (_Py_atomic_load_relaxed(&ceval->gil_drop_request)) {
/* Give another thread a chance */
if (_PyThreadState_Swap(&runtime->gilstate, NULL) != tstate) {
Py_FatalError("tstate mix-up");
}
drop_gil(ceval, tstate);
if (_PyThreadState_Swap(&runtime->gilstate, tstate) != NULL) {
Py_FatalError("orphan tstate");
}
} At this point, one daemon thread manages to get the GIL: take_gil() completes... even if runtime->finalizing is not NULL. I expected that exit_thread_if_finalizing() would exit the thread, but exit_thread_if_finalizing() is now called *after* take_gil(). -- It's unclear to me when the GIL (the lock object) is destroyed and how we are supposed to destroy it, if an unknown number of daemon threads are waiting for it. The GIL lock is created by PyEval_InitThreads(): create_gil() with MUTEX_INIT(gil->mutex). The GIL lock is destroyed by _PyEval_FiniThreads(): destroy_gil() with MUTEX_FINI(gil->mutex). |
Funny/not funny, bpo-36818 added a similar bug with commit 396e0a8 in June 2019: bpo-37135. I reverted the change to fix the issue. Hopefully, it should now be fixed and the rationale for accessing directly _PyRuntime should now be better documented. |
I tested (run multiple times) daemon_threads_exit.py with slow_exit.patch: no crash. I also tested (run multiple times) stress.py + sleep_at_exit.patch of bpo-37135: no crash. And I tested asyncio_gc.py of bpo-19466: no crash neither. Python finalization now looks reliable. I'm not sure if it's "more" reliable than previously, but at least, I cannot get a crash anymore, even after bpo-19466 has been fixed (clear Python thread states of daemon threads earlier). |
I created bpo-39941 "multiprocessing: Process.join() should emit a warning if the process is killed by a signal" which may help to detect the issue earlier on any platform, not only on FreeBSD. |
The initial issue is now fixed. I close the issue. take_gil() only checks if the thread must exit once the GIL is acquired. Maybe it would be able to exit earlier, but I took the safe approach. If we must exit, drop the GIL and then exit. That's basically Python 3.8 behavior. If someone wants to optimize/enhance take_gil() for daemon thread, I suggest to open a new issue. Note: Supporting daemon threads require tricky code in take_gil(). I would prefer to deprecate daemon threads and prepare their removal. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: