Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The main thread hang at drop_gil due to FORCE_SWITCHING after daemon thread exited #96387

Closed
milizao opened this issue Aug 29, 2022 · 12 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@milizao
Copy link

milizao commented Aug 29, 2022

Bug report

Main thread hang at drop_gil after daemon thread exited, my Python version is 3.9.2, but I think the same issue is exists in the Python 3.10 newest version.

(gdb) info threads
  Id   Target Id                                   Frame 
* 1    Thread 0xffff909dd010 (LWP 226769) "python" 0x0000ffff909355c0 in futex_wait_cancelable (private=<optimized out>, expected=0, 
    futex_word=0xaaaadf99e8d4 <_PyRuntime+524>) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) bt
#0  0x0000ffff909355c0 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0xaaaadf99e8d4 <_PyRuntime+524>)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0xaaaadf99e8d8 <_PyRuntime+528>, cond=0xaaaadf99e8a8 <_PyRuntime+480>) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=cond@entry=0xaaaadf99e8a8 <_PyRuntime+480>, mutex=mutex@entry=0xaaaadf99e8d8 <_PyRuntime+528>) at pthread_cond_wait.c:638
#3  0x0000aaaadf35d184 in drop_gil (ceval=0xaaaadf99e820 <_PyRuntime+344>, ceval@entry=0xaaaadf300274 <PyObject_Hash+116>, ceval2=ceval2@entry=0xaaaafbc61ff0, 
    tstate=tstate@entry=0xaaaafbc64a00) at Python/ceval_gil.h:178
#4  0x0000aaaadf2c05c4 in eval_frame_handle_pending (tstate=0xaaaafbc64a00) at Python/ceval.c:882
#5  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=0xffff83929d40, throwflag=<optimized out>) at Python/ceval.c:1402
#6  0x0000aaaadf2bee1c in _PyEval_EvalFrame (throwflag=0, f=0xffff83929d40, tstate=0xaaaafbc64a00) at ./Include/internal/pycore_ceval.h:40
#7  function_code_fastcall (tstate=0xaaaafbc64a00, co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Objects/call.c:330
#8  0x0000aaaadf31ac04 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, 
    tstate=<optimized out>) at ./Include/cpython/abstract.h:118
#9  _PyObject_CallNoArg (func=<optimized out>) at ./Include/cpython/abstract.h:172
#10 call_unbound_noarg (self=<optimized out>, func=0xffff83691ee0, unbound=<optimized out>) at Objects/typeobject.c:1515
#11 slot_tp_finalize (self=<optimized out>) at Objects/typeobject.c:6992
#12 0x0000aaaadf30011c in PyObject_CallFinalizer (self=self@entry=0xffff837ea6d0) at Objects/object.c:195
#13 0x0000aaaadf302694 in PyObject_CallFinalizerFromDealloc (self=self@entry=0xffff837ea6d0) at Objects/object.c:213
#14 0x0000aaaadf314b1c in subtype_dealloc (self=0xffff837ea6d0) at Objects/typeobject.c:1265
#15 0x0000aaaadf2f3f4c in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#16 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#17 insertdict (mp=mp@entry=0xffff8383de80, key=0xffff837450b0, hash=-5519671433175608536, value=value@entry=0xaaaadf94be68 <_Py_NoneStruct>)
    at Objects/dictobject.c:1123
#18 0x0000aaaadf2f52d4 in PyDict_SetItem (op=op@entry=0xffff8383de80, key=<optimized out>, value=value@entry=0xaaaadf94be68 <_Py_NoneStruct>)
    at Objects/dictobject.c:1573
#19 0x0000aaaadf2ffa48 in _PyModule_ClearDict (d=0xffff8383de80) at Objects/moduleobject.c:612
#20 0x0000aaaadf2ffc84 in _PyModule_Clear (m=m@entry=0xffff837f1a40) at Objects/moduleobject.c:560
#21 0x0000aaaadf3798e8 in _PyImport_Cleanup (tstate=tstate@entry=0xaaaafbc64a00) at Python/import.c:606
#22 0x0000aaaadf38e63c in Py_FinalizeEx () at Python/pylifecycle.c:1426
#23 0x0000aaaadf2c9110 in Py_RunMain () at Modules/main.c:679
#24 0x0000aaaadf2c99f8 in pymain_main (args=args@entry=0xffffd3754d88) at Modules/main.c:707
#25 0x0000aaaadf2c9ad4 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:731
#26 0x0000ffff906c1f80 in __libc_start_main (main=0x0, argc=0, argv=0x0, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=<optimized out>) at ../csu/libc-start.c:308
#27 0x0000aaaadf2c87bc in _start ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) py-bt
Traceback (most recent call first):
  File "/home/ossadm/zhao/runtime.py", line 8, in __del__
    def __del__(self):
(gdb) 

Rough analysis

In Python version 3.9.2, the take_gil is implemented as following:

static void
take_gil(PyThreadState *tstate)
{
...
    while (_Py_atomic_load_relaxed(&gil->locked)) {
        unsigned long saved_switchnum = gil->switch_number;

        unsigned long interval = (gil->interval >= 1 ? gil->interval : 1);
        int timed_out = 0;
        COND_TIMED_WAIT(gil->cond, gil->mutex, interval, timed_out);

        /* If we timed out and no switch occurred in the meantime, it is time
           to ask the GIL-holding thread to drop it. */
        if (timed_out &&
            _Py_atomic_load_relaxed(&gil->locked) &&
            gil->switch_number == saved_switchnum)
        {
            if (tstate_must_exit(tstate)) {
                MUTEX_UNLOCK(gil->mutex);
                PyThread_exit_thread();   // [2]
            }
            assert(is_tstate_valid(tstate));

            SET_GIL_DROP_REQUEST(interp);  // [1]
        }
    }
...
}

A daemon thread may call SET_GIL_DROP_REQUEST and then exits from PyThread_exit_thread, so this thread will not get GIL really. This happens if the main thread is on the finalizing procedure and it calls _PyRuntimeState_SetFinalizing in Py_FinalizeEx before the daemon thread gets GIL (The daemon thread wait 5ms at most before exit).

After that, the flag gil_drop_request is set, but only the main thread is left. Then, if the main thread runs into drop_gil again, it will hang at FORCE_SWITCHING:

COND_WAIT(gil->switch_cond, gil->switch_mutex);

Reproducible example

This issue could be reproduced by:

$ python -m http.server 5678 &
$ for ((i=0;;i++)); do echo $i; python -W all service.py; done

And the scripts used are:

runtime.py

import socket

class TestUnclosedSocket:
    def __init__(self):
        self.s = socket.socket()
    def __del__(self):
        self.s.connect(('127.0.0.1', 5678))

ins = TestUnclosedSocket()

service.py

import threading
import runtime
import sys

sys.setswitchinterval(0.001)

def calc():
    sum = 0
    for i in range(100000):
        sum += i * i

def daemon_func():
    while True:
        calc()

if __name__ == '__main__':
    threading.Thread(target=daemon_func, name="daemon", daemon=True).start()
    calc()

I change GIL switch interval to 1ms to increase the probability here.

@milizao milizao added the type-bug An unexpected behavior, bug, or error label Aug 29, 2022
@113xiaoji
Copy link

Reproducible example

This issue could be reproduced by:

No need to__del__ Call self.s.connect(('127.0.0.1', 5678)), only initialize in init

call self.s = socket.socket()

And the scripts used are:

runtime.py

import socket

class TestUnclosedSocket:
    def __init__(self):
        self.s = socket.socket()
ins = TestUnclosedSocket()

service.py

import threading
import runtime
import sys

sys.setswitchinterval(0.001)

def calc():
    sum = 0
    for i in range(100000):
        sum += i * i

def daemon_func():
    while True:
        calc()

if __name__ == '__main__':
    threading.Thread(target=daemon_func, name="daemon", daemon=True).start()
    calc()

change GIL switch interval to 1ms to increase the probability here.

gdb trace is

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f2266928740 (LWP 192050) "python3.9" 0x00007f22662ce995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

(gdb) bt
#0  0x00007f22662ce995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004e9246 in drop_gil (ceval=0x993c38 <_PyRuntime+344>, ceval2=<optimized out>, tstate=<optimized out>) at Python/ceval_gil.h:178
#2  0x00000000004206d6 in eval_frame_handle_pending (tstate=0x23b5a80) at Python/ceval.c:882
#3  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:1402
#4  0x000000000041eb18 in _PyEval_EvalFrame (throwflag=0, f=0x7f22668d9610, tstate=0x23b5a80) at ./Include/internal/pycore_ceval.h:40
#5  function_code_fastcall (tstate=0x23b5a80, co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Objects/call.c:330
#6  0x000000000048ecf1 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>)
    at ./Include/cpython/abstract.h:118
#7  _PyObject_CallNoArg (func=<optimized out>) at ./Include/cpython/abstract.h:172
#8  call_unbound_noarg (self=<optimized out>, func=0x7f225e808af0, unbound=<optimized out>) at Objects/typeobject.c:1515
#9  slot_tp_repr (self=<optimized out>) at Objects/typeobject.c:6630
#10 0x000000000047151d in PyObject_Repr (v=0x7f22667b7ac0) at Objects/object.c:420
#11 0x00000000004cea36 in unicode_fromformat_arg (vargs=0x7fff4339f510, f=0x7f225ec6e1b8 "R", writer=0x7fff4339f530) at Objects/unicodeobject.c:2989
#12 PyUnicode_FromFormatV (format=format@entry=0x7f225ec6e1ae "unclosed %R", vargs=vargs@entry=0x7fff4339f5b8) at Objects/unicodeobject.c:3053
#13 0x00000000004e7628 in _PyErr_WarnFormatV (vargs=0x7fff4339f5b8, format=0x7f225ec6e1ae "unclosed %R", stack_level=1, category=0x942be0 <_PyExc_ResourceWarning>, source=0x7f22667b7ac0)
    at Python/_warnings.c:1124
#14 PyErr_ResourceWarning (source=source@entry=0x7f22667b7ac0, stack_level=stack_level@entry=1, format=format@entry=0x7f225ec6e1ae "unclosed %R") at Python/_warnings.c:1179
#15 0x00007f225ec62aea in sock_finalize (s=0x7f22667b7ac0) at /opt/cph/Python-3.9.2/Modules/socketmodule.c:5044
#16 0x0000000000475b7a in PyObject_CallFinalizer (self=0x7f22667b7ac0) at Objects/object.c:195
#17 PyObject_CallFinalizerFromDealloc (self=self@entry=0x7f22667b7ac0) at Objects/object.c:213
#18 0x0000000000486afa in subtype_dealloc (self=0x7f22667b7ac0) at Objects/typeobject.c:1265
#19 0x000000000045ded5 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#20 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#21 dict_dealloc (mp=0x7f226676fd80) at Objects/dictobject.c:2018
#22 0x0000000000486df2 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#23 subtype_dealloc (self=0x7f2266796f70) at Objects/typeobject.c:1321
#24 0x0000000000463a75 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#25 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#26 insertdict (value=0x94e5e0 <_Py_NoneStruct>, hash=-5400324744021460460, key=0x7f226679a870, mp=0x7f2266794400) at Objects/dictobject.c:1123
#27 PyDict_SetItem (op=op@entry=0x7f2266794400, key=0x7f226679a870, value=0x94e5e0 <_Py_NoneStruct>) at Objects/dictobject.c:1573
#28 0x0000000000470ca6 in _PyModule_ClearDict (d=0x7f2266794400) at Objects/moduleobject.c:612
#29 0x0000000000470e1e in _PyModule_Clear (m=m@entry=0x7f2266798ea0) at Objects/moduleobject.c:560
#30 0x000000000050c8bf in _PyImport_Cleanup (tstate=tstate@entry=0x23b5a80) at Python/import.c:606
#31 0x000000000052a8ca in Py_FinalizeEx () at Python/pylifecycle.c:1426
#32 0x000000000052d3e5 in Py_FinalizeEx () at Python/pylifecycle.c:1510
#33 0x000000000042a230 in Py_RunMain () at Modules/main.c:679
#34 0x000000000042abef in pymain_main (args=0x7fff4339fa90) at Modules/main.c:707
#35 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:731
#36 0x00007f226580b505 in __libc_start_main () from /lib64/libc.so.6
#37 0x00000000004299b5 in _start ()

referance https://bugs.python.org/issue39877

#84058 #84058
@vstinner

@ronaldoussoren ronaldoussoren added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Aug 30, 2022
@gpshead
Copy link
Member

gpshead commented Sep 13, 2022

This looks like a duplicate of #91414.

@gpshead
Copy link
Member

gpshead commented Sep 13, 2022

Semi related to #87135 given it concerns daemon threads and finalizing.

@113xiaoji
Copy link

Semi related to #87135 given it concerns daemon threads and finalizing.

Excuse me, is this a question similar to # 87135? Is there a solution plan for this problem? If there is no solution plan in the short term, how can users avoid this problem. By directly calling os Exit () or change the thread into a coroutine, instead of using multithreading capability or other methods. Please give us a way that you think is best to avoid this problem

@vstinner
Copy link
Member

See also issue #95820.

@vstinner
Copy link
Member

You should avoid daemon threads. There are other ways to implement a similar feature in a safer approach:

  • Use regular threads and make sure that there is a way to stop them in a timely manner at Python exit
  • Use a process and kill the process at exit

Daemon threads cause various very complicated issues at Python exit.

mod_wsgi is trying a different approach: don't attempt to shutdown Python cleanly, just exit the process immediately: GrahamDumpleton/mod_wsgi#730

@vstinner
Copy link
Member

(gdb) bt
#0  0x00007f22662ce995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004e9246 in drop_gil (ceval=0x993c38 <_PyRuntime+344>, ceval2=<optimized out>, tstate=<optimized out>) at Python/ceval_gil.h:178
...
#13 0x00000000004e7628 in _PyErr_WarnFormatV (vargs=0x7fff4339f5b8, format=0x7f225ec6e1ae "unclosed %R", stack_level=1, category=0x942be0 <_PyExc_ResourceWarning>, source=0x7f22667b7ac0)
    at Python/_warnings.c:1124
...
#30 0x000000000050c8bf in _PyImport_Cleanup (tstate=tstate@entry=0x23b5a80) at Python/import.c:606
#31 0x000000000052a8ca in Py_FinalizeEx () at Python/pylifecycle.c:1426
...
#33 0x000000000042a230 in Py_RunMain () at Modules/main.c:679
...

In Python 3.12, first we call _PyRuntimeState_SetFinalizing():

    /* Remaining daemon threads will automatically exit
       when they attempt to take the GIL (ex: PyEval_RestoreThread()). */
    _PyRuntimeState_SetFinalizing(runtime, tstate);
    runtime->initialized = 0;
    runtime->core_initialized = 0;

Then we delete modules:

    /* Destroy all modules */
    finalize_modules(tstate);

my Python version is 3.9.2

Python finalization evolves a lot in last years to attempt to make it more reliable. Please test Python 3.12, or at least Python 3.11 or 3.10.

My notes about Python finalization: https://pythondev.readthedocs.io/finalization.html

@vstinner
Copy link
Member

I fail to reproduce your issue with Python 3.9, 3.11 or 3.12. Can you please explain how to reproduce it? What is your OS? Is the hang more likely if the system is busy?

@113xiaoji
Copy link

I fail to reproduce your issue with Python 3.9, 3.11 or 3.12. Can you please explain how to reproduce it? What is your OS? Is the hang more likely if the system is busy?

wait a miniute please.I'll sort out the reproduced environmental information

@113xiaoji
Copy link

113xiaoji commented Sep 16, 2022

I fail to reproduce your issue with Python 3.9, 3.11 or 3.12. Can you please explain how to reproduce it? What is your OS? Is the hang more likely if the system is busy?

Your environment

  • CPython versions tested on: 3.9.2 or 3.9.6
  • Operating system and architecture: linux / x86_64

Reproducible example

There are three files, test.sh,service.py,runtime.py

test.sh call the service.py, service.py import runtime

And the scripts used are:

test.sh

for i in `seq 1 100000`;
do
    ../target2/bin/python3.9 service.py
    echo call cpython$i
done

runtime.py

import socket

class TestUnclosedSocket:
    def __init__(self):
        self.s = socket.socket()
ins = TestUnclosedSocket()

service.py

import threading
import runtime
import sys

sys.setswitchinterval(0.001)

def calc():
    sum = 0
    for i in range(100000):
        sum += i * i

def daemon_func():
    while True:
        calc()

if __name__ == '__main__':
    threading.Thread(target=daemon_func, name="daemon", daemon=True).start()
    calc()

change GIL switch interval to 1ms to increase the probability here.

The phenomenon is that the script printing is stuck,my attempts is stopped at 20th.

call cpython1
call cpython2
call cpython3
call cpython4
call cpython5
call cpython6
call cpython7
call cpython8
call cpython9
call cpython10
call cpython11
call cpython12
call cpython13
call cpython14
call cpython15
call cpython16
call cpython17
call cpython18
call cpython19
call cpython20

The trace is :

(gdb) bt
#0  0x00007f355ccda995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00000000004e9246 in drop_gil (ceval=0x993c38 <_PyRuntime+344>, ceval2=<optimized out>, tstate=<optimized out>) at Python/ceval_gil.h:178
#2  0x00000000004206d6 in eval_frame_handle_pending (tstate=0x2720a80) at Python/ceval.c:882
#3  _PyEval_EvalFrameDefault (tstate=<optimized out>, f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:1402
#4  0x000000000041eb18 in _PyEval_EvalFrame (throwflag=0, f=0x7f355d2e5610, tstate=0x2720a80) at ./Include/internal/pycore_ceval.h:40
#5  function_code_fastcall (tstate=0x2720a80, co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Objects/call.c:330
#6  0x000000000048ecf1 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>)
    at ./Include/cpython/abstract.h:118
#7  _PyObject_CallNoArg (func=<optimized out>) at ./Include/cpython/abstract.h:172
#8  call_unbound_noarg (self=<optimized out>, func=0x7f3555213af0, unbound=<optimized out>) at Objects/typeobject.c:1515
#9  slot_tp_repr (self=<optimized out>) at Objects/typeobject.c:6630
#10 0x000000000047151d in PyObject_Repr (v=0x7f355d1c3ac0) at Objects/object.c:420
#11 0x00000000004cea36 in unicode_fromformat_arg (vargs=0x7ffe40c77b30, f=0x7f355567a1b8 "R", writer=0x7ffe40c77b50) at Objects/unicodeobject.c:2989
#12 PyUnicode_FromFormatV (format=format@entry=0x7f355567a1ae "unclosed %R", vargs=vargs@entry=0x7ffe40c77bd8) at Objects/unicodeobject.c:3053
#13 0x00000000004e7628 in _PyErr_WarnFormatV (vargs=0x7ffe40c77bd8, format=0x7f355567a1ae "unclosed %R", stack_level=1, category=0x942be0 <_PyExc_ResourceWarning>, source=0x7f355d1c3ac0)
    at Python/_warnings.c:1124
#14 PyErr_ResourceWarning (source=source@entry=0x7f355d1c3ac0, stack_level=stack_level@entry=1, format=format@entry=0x7f355567a1ae "unclosed %R") at Python/_warnings.c:1179
#15 0x00007f355566eaea in sock_finalize (s=0x7f355d1c3ac0) at /opt/cph/Python-3.9.2/Modules/socketmodule.c:5044
#16 0x0000000000475b7a in PyObject_CallFinalizer (self=0x7f355d1c3ac0) at Objects/object.c:195
#17 PyObject_CallFinalizerFromDealloc (self=self@entry=0x7f355d1c3ac0) at Objects/object.c:213
#18 0x0000000000486afa in subtype_dealloc (self=0x7f355d1c3ac0) at Objects/typeobject.c:1265
#19 0x000000000045ded5 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#20 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#21 dict_dealloc (mp=0x7f355d1a1980) at Objects/dictobject.c:2018
#22 0x0000000000486df2 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#23 subtype_dealloc (self=0x7f355d1a2f70) at Objects/typeobject.c:1321
#24 0x0000000000463a75 in _Py_DECREF (op=<optimized out>) at ./Include/object.h:430
#25 _Py_XDECREF (op=<optimized out>) at ./Include/object.h:497
#26 insertdict (value=0x94e5e0 <_Py_NoneStruct>, hash=5127288570824266835, key=0x7f355d1a6770, mp=0x7f355d1a1340) at Objects/dictobject.c:1123
#27 PyDict_SetItem (op=op@entry=0x7f355d1a1340, key=0x7f355d1a6770, value=0x94e5e0 <_Py_NoneStruct>) at Objects/dictobject.c:1573
#28 0x0000000000470ca6 in _PyModule_ClearDict (d=0x7f355d1a1340) at Objects/moduleobject.c:612
#29 0x0000000000470e1e in _PyModule_Clear (m=m@entry=0x7f355d1ab950) at Objects/moduleobject.c:560
#30 0x000000000050c8bf in _PyImport_Cleanup (tstate=tstate@entry=0x2720a80) at Python/import.c:606
#31 0x000000000052a8ca in Py_FinalizeEx () at Python/pylifecycle.c:1426
#32 0x000000000052d3e5 in Py_FinalizeEx () at Python/pylifecycle.c:1510
#33 0x000000000042a230 in Py_RunMain () at Modules/main.c:679
#34 0x000000000042abef in pymain_main (args=0x7ffe40c780b0) at Modules/main.c:707
#35 Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:731
#36 0x00007f355c217505 in __libc_start_main () from /lib64/libc.so.6
#37 0x00000000004299b5 in _start ()
(gdb)

It will reappear after hundreds of attempts. I reproduce 5 miniutes ago
@vstinner
The system is not busy, and it is not found that the system is busy, which will increase the probability of problem recurrence.

@vstinner
Copy link
Member

Thanks for test.sh, I managed to reproduce the hang by running this script 6x times in parallel (in 6 terminal tabs) and doing other things on my computer. With the bug, the hang occurs in less than a minute. Less than 250 iterations if I recall correctly, on a Python debug build (Python 3.12).

(...) After that, the flag gil_drop_request is set, but only the main thread is left. Then, if the main thread runs into drop_gil again, it will hang at FORCE_SWITCHING: (...)

Oh, you made an accurate and correct analysis, well done!

I wrote PR #96869 to fix the issue. Can you please check that my change fix your hang using test.sh? Tell me if you need help to test the fix. Also, please review my PR, I'm not sure if I wrote the correct name in the changelog ;-)

@milizao
Copy link
Author

milizao commented Sep 19, 2022

I think that's would fix my issue, and the name is corret, thank you :-)

vstinner added a commit that referenced this issue Sep 19, 2022
At Python exit, sometimes a thread holding the GIL can wait forever
for a thread (usually a daemon thread) which requested to drop the
GIL, whereas the thread already exited. To fix the race condition,
the thread which requested the GIL drop now resets its request before
exiting.

take_gil() now calls RESET_GIL_DROP_REQUEST() before
PyThread_exit_thread() if it called SET_GIL_DROP_REQUEST to fix a
race condition with drop_gil().

Issue discovered and analyzed by Mingliang ZHAO.
vstinner added a commit that referenced this issue Sep 20, 2022
At Python exit, sometimes a thread holding the GIL can wait forever
for a thread (usually a daemon thread) which requested to drop the
GIL, whereas the thread already exited. To fix the race condition,
the thread which requested the GIL drop now resets its request before
exiting.

take_gil() now calls RESET_GIL_DROP_REQUEST() before
PyThread_exit_thread() if it called SET_GIL_DROP_REQUEST to fix a
race condition with drop_gil().

Issue discovered and analyzed by Mingliang ZHAO.

(cherry picked from commit 04f4977)
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Sep 20, 2022
…96869) (pythonGH-96941)

At Python exit, sometimes a thread holding the GIL can wait forever
for a thread (usually a daemon thread) which requested to drop the
GIL, whereas the thread already exited. To fix the race condition,
the thread which requested the GIL drop now resets its request before
exiting.

take_gil() now calls RESET_GIL_DROP_REQUEST() before
PyThread_exit_thread() if it called SET_GIL_DROP_REQUEST to fix a
race condition with drop_gil().

Issue discovered and analyzed by Mingliang ZHAO.

(cherry picked from commit 04f4977)
(cherry picked from commit 6ff5471)

Co-authored-by: Victor Stinner <vstinner@python.org>
miss-islington added a commit that referenced this issue Sep 20, 2022
…6941)

At Python exit, sometimes a thread holding the GIL can wait forever
for a thread (usually a daemon thread) which requested to drop the
GIL, whereas the thread already exited. To fix the race condition,
the thread which requested the GIL drop now resets its request before
exiting.

take_gil() now calls RESET_GIL_DROP_REQUEST() before
PyThread_exit_thread() if it called SET_GIL_DROP_REQUEST to fix a
race condition with drop_gil().

Issue discovered and analyzed by Mingliang ZHAO.

(cherry picked from commit 04f4977)
(cherry picked from commit 6ff5471)

Co-authored-by: Victor Stinner <vstinner@python.org>
bdraco added a commit to bdraco/docker-base that referenced this issue Feb 12, 2023
pvizeli pushed a commit to home-assistant/docker-base that referenced this issue Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants