Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in pool_in_threads.py (test_multiprocessing_pool_circular_import) in free-threaded build #119369

Closed
colesbury opened this issue May 21, 2024 · 1 comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented May 21, 2024

Bug report

Running pool_in_threads.py in a loop will occasionally lead to deadlock even with #118745 applied.

Here's a summary of the state I observed:

Thread 28

Thread 28 holds the GIL and is blocked on the QSBR shared mutex:

#6  0x000055989dd7d8a6 in _PySemaphore_PlatformWait (sema=sema@entry=0x7f8ba4ff8c60, timeout=timeout@entry=-1) at Python/parking_lot.c:142
#7  0x000055989dd7d9dc in _PySemaphore_Wait (sema=sema@entry=0x7f8ba4ff8c60, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:213
#8  0x000055989dd7db65 in _PyParkingLot_Park (addr=addr@entry=0x55989e0db1c8 <_PyRuntime+136840>, expected=expected@entry=0x7f8ba4ff8cf7, size=size@entry=1, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x7f8ba4ff8d00, detach=detach@entry=1) at Python/parking_lot.c:316
#9  0x000055989dd7494d in _PyMutex_LockTimed (m=m@entry=0x55989e0db1c8 <_PyRuntime+136840>, timeout=timeout@entry=-1, flags=flags@entry=_PY_LOCK_DETACH) at Python/lock.c:112
#10 0x000055989dd74a4e in _PyMutex_LockSlow (m=m@entry=0x55989e0db1c8 <_PyRuntime+136840>) at Python/lock.c:53
#11 0x000055989dd9c8e1 in PyMutex_Lock (m=0x55989e0db1c8 <_PyRuntime+136840>) at ./Include/internal/pycore_lock.h:75
#12 _Py_qsbr_unregister (tstate=tstate@entry=0x7f8bec00bad0) at Python/qsbr.c:239
#13 0x000055989dd92142 in tstate_delete_common (tstate=tstate@entry=0x7f8bec00bad0) at Python/pystate.c:1797
#14 0x000055989dd92831 in _PyThreadState_DeleteCurrent (tstate=tstate@entry=0x7f8bec00bad0) at Python/pystate.c:1845

It's blocked trying to lock the QSBR shared mutex:

PyMutex_Lock(&shared->mutex);

Normally, this would release the GIL while blocking, but we clear the thread state before calling tstate_delete_common:

cpython/Python/pystate.c

Lines 1844 to 1845 in 9fa206a

current_fast_clear(tstate->interp->runtime);
tstate_delete_common(tstate);

We only detach (and release the GIL) if we both have a thread state and it's currently attached:

if (tstate && _Py_atomic_load_int_relaxed(&tstate->state) ==
_Py_THREAD_ATTACHED) {
// Only detach if we are attached
PyEval_ReleaseThread(tstate);

Note that the PyThreadState in this case is actually attached, just not visible from _PyThreadState_GET().

Thread 4

Thread 4 holds the shared mutex and is blocked trying to acquire the GIL:

#6  take_gil (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:331
#7  0x000055989dd4c93c in _PyEval_AcquireLock (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:585
#8  0x000055989dd92ca3 in _PyThreadState_Attach (tstate=tstate@entry=0x55989eeb6e00) at Python/pystate.c:2071
#9  0x000055989dd4c9bb in PyEval_AcquireThread (tstate=tstate@entry=0x55989eeb6e00) at Python/ceval_gil.c:602
#10 0x000055989dd7d9eb in _PySemaphore_Wait (sema=sema@entry=0x7f8bf3ffe240, timeout=timeout@entry=-1, detach=detach@entry=1) at Python/parking_lot.c:215
#11 0x000055989dd7db65 in _PyParkingLot_Park (addr=addr@entry=0x55989e0c0108 <_PyRuntime+26056>, expected=expected@entry=0x7f8bf3ffe2c8, size=size@entry=8, timeout_ns=timeout_ns@entry=-1, park_arg=park_arg@entry=0x0, detach=detach@entry=1) at Python/parking_lot.c:316
#12 0x000055989dd747ce in rwmutex_set_parked_and_wait (rwmutex=rwmutex@entry=0x55989e0c0108 <_PyRuntime+26056>, bits=bits@entry=1) at Python/lock.c:386
#13 0x000055989dd74e38 in _PyRWMutex_RLock (rwmutex=0x55989e0c0108 <_PyRuntime+26056>) at Python/lock.c:404
#14 0x000055989dd9357a in stop_the_world (stw=stw@entry=0x55989e0db190 <_PyRuntime+136784>) at Python/pystate.c:2234
#15 0x000055989dd936d3 in _PyEval_StopTheWorld (interp=interp@entry=0x55989e0d8780 <_PyRuntime+126016>) at Python/pystate.c:2331
#16 0x000055989dd9c761 in _Py_qsbr_reserve (interp=interp@entry=0x55989e0d8780 <_PyRuntime+126016>) at Python/qsbr.c:201
#17 0x000055989dd908d4 in new_threadstate (interp=0x55989e0d8780 <_PyRuntime+126016>, whence=whence@entry=2) at Python/pystate.c:1543
#18 0x000055989dd92769 in _PyThreadState_New (interp=<optimized out>, whence=whence@entry=2) at Python/pystate.c:1624

Linked PRs

@colesbury colesbury added type-bug An unexpected behavior, bug, or error topic-free-threading 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels May 21, 2024
@colesbury
Copy link
Contributor Author

Two possible strategies:

  1. Make sure there is a valid PyThreadState when _Py_qsbr_unregister() is called so that if it blocks, the GIL is released
  2. Or make sure the _Py_qsbr_unregister() is called after the GIL is released.

colesbury added a commit to colesbury/cpython that referenced this issue May 24, 2024
Release the GIL before calling `_Py_qsbr_unregister`.

The deadlock could occur when the GIL was enabled at runtime. The
`_Py_qsbr_unregister` call might block while holding the GIL because the
thread state was not active, but the GIL was still held.
colesbury added a commit that referenced this issue May 31, 2024
…19528)

Release the GIL before calling `_Py_qsbr_unregister`.

The deadlock could occur when the GIL was enabled at runtime. The
`_Py_qsbr_unregister` call might block while holding the GIL because the
thread state was not active, but the GIL was still held.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 31, 2024
…ld (pythonGH-119528)

Release the GIL before calling `_Py_qsbr_unregister`.

The deadlock could occur when the GIL was enabled at runtime. The
`_Py_qsbr_unregister` call might block while holding the GIL because the
thread state was not active, but the GIL was still held.
(cherry picked from commit 078b8c8)

Co-authored-by: Sam Gross <colesbury@gmail.com>
colesbury added a commit that referenced this issue May 31, 2024
…ild (GH-119528) (#119868)

Release the GIL before calling `_Py_qsbr_unregister`.

The deadlock could occur when the GIL was enabled at runtime. The
`_Py_qsbr_unregister` call might block while holding the GIL because the
thread state was not active, but the GIL was still held.
(cherry picked from commit 078b8c8)

Co-authored-by: Sam Gross <colesbury@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

1 participant