Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during garbage collection of the asyncio event loop #113566

Closed
mathieui opened this issue Dec 29, 2023 · 8 comments
Closed

Segmentation fault during garbage collection of the asyncio event loop #113566

mathieui opened this issue Dec 29, 2023 · 8 comments
Labels
topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@mathieui
Copy link
Contributor

mathieui commented Dec 29, 2023

Crash report

What happened?

Original details are on: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1040057 ; despite what is written, the issue also happens on python 3.12, and can be reliably reproduced when running the test from inside a PDB, suggesting some kind of timing issue.

I could reproduce independently on Archlinux an python 3.11 as well as on 3.12 in a debian docker image (https://ci.codeberg.org/repos/12939/pipeline/7/26 ).

--Return--
> /home/mathieui/dev/slixmpp/run_tests.py(73)<module>()->None
-> sys.exit(not result.wasSuccessful())
(Pdb) quit

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
PyObject_CallOneArg (func=0x0, arg=0x7ffff41be950) at Objects/call.c:376
376         return _PyObject_VectorcallTstate(tstate, func, args, nargsf, NULL);                                                                                                                                                               
(gdb) bt 

#0  PyObject_CallOneArg (func=0x0, arg=0x7ffff41beb90) at Objects/call.c:376
#1  0x00007ffff7a1b7b5 in PyObject_Repr (v=0x7ffff41beb90) at Objects/object.c:433
#2  0x00007ffff79df41e in unicode_fromformat_arg (vargs=0x7fffffffdb50, f=0x7ffff7b42af4 "R is not in deque", writer=0x7fffffffdaf0) at Objects/unicodeobject.c:3036
#3  PyUnicode_FromFormatV (format=<optimized out>, vargs=<optimized out>) at Objects/unicodeobject.c:3100
#4  0x00007ffff7a0f007 in _PyErr_FormatV (vargs=0x7fffffffdbd0, format=0x7ffff7b42af3 "%R is not in deque", exception=0x7ffff7c73a60 <_PyExc_ValueError>, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at Python/errors.c:1106
#5  PyErr_Format (exception=0x7ffff7c73a60 <_PyExc_ValueError>, format=0x7ffff7b42af3 "%R is not in deque") at Python/errors.c:1149
#6  0x00007ffff7958751 in deque_remove (deque=0x7ffff41ef010, value=0x7ffff41beb90) at ./Modules/_collectionsmodule.c:1256
#7  0x00007ffff7a14e94 in method_vectorcall_O (func=0x7ffff7081e40, args=0x7ffff404cfd8, nargsf=<optimized out>, kwnames=0x0) at Objects/descrobject.c:481
#8  0x00007ffff79f2987 in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=0x7ffff7081e40, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_call.h:92
#9  PyObject_Vectorcall (callable=0x7ffff7081e40, args=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:299
#10 0x00007ffff79e4c23 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4760
#11 0x00007ffff7aafcf2 in _PyEval_EvalFrame (throwflag=<optimized out>, frame=0x7ffff404cf70, tstate=0x7ffff7d89378 <_PyRuntime+166328>) at ./Include/internal/pycore_ceval.h:73
#12 gen_send_ex2 (gen=gen@entry=0x7ffff404cf20, arg=arg@entry=0x7ffff7d58ac0 <_Py_NoneStruct>, presult=presult@entry=0x7fffffffdf70, exc=exc@entry=1, closing=closing@entry=1) at Objects/genobject.c:219
#13 0x00007ffff7ab08e7 in gen_send_ex (gen=gen@entry=0x7ffff404cf20, arg=arg@entry=0x7ffff7d58ac0 <_Py_NoneStruct>, exc=exc@entry=1, closing=closing@entry=1) at Objects/genobject.c:287
#14 0x00007ffff7ab05c4 in gen_close (gen=gen@entry=0x7ffff404cf20, args=args@entry=0x0) at Objects/genobject.c:391
#15 0x00007ffff7b1c5fe in gen_close_iter (yf=yf@entry=0x7ffff404cf20) at Objects/genobject.c:327
#16 0x00007ffff7ab0620 in gen_close (gen=gen@entry=0x7ffff455be00, args=args@entry=0x0) at Objects/genobject.c:385
#17 0x00007ffff7a2e61e in _PyGen_Finalize (self=0x7ffff455be00) at Objects/genobject.c:96
#18 0x00007ffff79d01b8 in finalize_garbage (tstate=0x7ffff7d89378 <_PyRuntime+166328>, collectable=0x7fffffffe110) at Modules/gcmodule.c:978
#19 gc_collect_main (tstate=0x7ffff7d89378 <_PyRuntime+166328>, generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1) at Modules/gcmodule.c:1274
#20 0x00007ffff7acc77c in _PyGC_CollectNoFail (tstate=<optimized out>) at Modules/gcmodule.c:2110
#21 0x00007ffff7ab8f18 in Py_FinalizeEx () at Python/pylifecycle.c:1833
#22 0x00007ffff7ac5093 in Py_RunMain () at Modules/main.c:682
#23 0x00007ffff7a8e79b in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:734
#24 0x00007ffff7645cd0 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=5, argv=argv@entry=0x7fffffffe4a8) at ../sysdeps/nptl/libc_start_call_main.h:58
#25 0x00007ffff7645d8a in __libc_start_main_impl (main=0x555555555120 <main>, argc=5, argv=0x7fffffffe4a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe498) at ../csu/libc-start.c:360

py-bt does not catch anything, suggesting it happens during interpreter shutdown, which would make sense. Our workaround is registering a loop.close() in atexit(), which makes the issue disappear, but my feeling is that this should not happen.
I can provide the full backtrace but most values are optimized out, making it much less useful.

The setup requires going through quite a bit of code so we do not have a minimal test case (in terms of code), albeit running our test suite should not be too difficult.

CPython versions tested on:

3.10, 3.11, 3.12

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

No response

Linked PRs

@mathieui mathieui added the type-crash A hard crash of the interpreter, possibly with a core dump label Dec 29, 2023
@Eclips4
Copy link
Member

Eclips4 commented Dec 29, 2023

Hello!
Can you provide a MRE, please?

@gvanrossum
Copy link
Member

The repro you have so far is very complex and we may not be able to debug this meaningfully, even if we can reproduce it.

Let's first start with the backtrace you have. What Python version was that exactly? (And was it built with debug mode?) I can't even find a recent version of CPython where PyObject_Repr calls PyObject_CallOneArg (but I dodn't search very hard). The immediate problem seems to be that the latter is being called with a NULL pointer, but without being able to read the exact source code I don't know where that is coming from. The rest of the stack trace makes me think that we're in some finalization stage where some code, invoked by a generator (or async def) finalization, is trying to remove something from a deque that's not there. The problem occurs during printing the error message.

I doubt that this is timing related exactly, though it is quite possible that the root cause is that the thing that's being called has already been removed by the finalization code -- this would be a bug in finalization order.

@mathieui
Copy link
Contributor Author

The repro you have so far is very complex and we may not be able to debug this meaningfully, even if we can reproduce it.

Let's first start with the backtrace you have. What Python version was that exactly? (And was it built with debug mode?) I can't even find a recent version of CPython where PyObject_Repr calls PyObject_CallOneArg (but I dodn't search very hard). The immediate problem seems to be that the latter is being called with a NULL pointer, but without being able to read the exact source code I don't know where that is coming from. The rest of the stack trace makes me think that we're in some finalization stage where some code, invoked by a generator (or async def) finalization, is trying to remove something from a deque that's not there. The problem occurs during printing the error message.

I doubt that this is timing related exactly, though it is quite possible that the root cause is that the thing that's being called has already been removed by the finalization code -- this would be a bug in finalization order.

I have indeed a hard time producing a MRE, my tries getting a single file exhibiting the behavior without pulling the library were not met with success yet.
The backtrace is from Python 3.11.6 (main, Nov 14 2023, 09:36:21) [GCC 13.2.1 20230801] (current archlinux version).
According to https://gitlab.archlinux.org/archlinux/packaging/packages/python/-/blob/main/PKGBUILD?ref_type=heads it is built using:

./configure --prefix=/usr \
              --enable-shared \
              --with-computed-gotos \
              --enable-optimizations \
              --with-lto \
              --enable-ipv6 \
              --with-system-expat \
              --with-dbmliborder=gdbm:ndbm \
              --with-system-libmpdec \
              --enable-loadable-sqlite-extensions \
              --without-ensurepip \
              --with-tzpath=/usr/share/zoneinfo

@gvanrossum
Copy link
Member

All right, line 433 in object.c is indeed in PyObject_Repr and reads:

    res = (*Py_TYPE(v)->tp_repr)(v);

Presumably this calls some object's tp_repr function, which in turn calls PyObject_CallOneArg. Given that we're in asyncio land, could it be FutureObj_repr in _asynciomodule.c? That ends with

    return PyObject_CallOneArg(asyncio_future_repr_func, (PyObject *)fut);

which could conceivably have been tail-optimized so that the debugger sees its direct caller as PyObject_Repr, explaining the traceback.

And this leads to the hypothesis that asyncio_future_repr_func, which in 3.11 is a global variable, may have been cleared already by a premature call to module_clear in the same file. (The value of that variable is a Python function, but that's not very important at this point -- what matters is that it was cleared prematurely.)

I think to fix this, we may have to add a null pointer check for this variable to FutureObj_repr -- fixing the scenario that leads to premature calling of module_clear would be quite complicated.

I do ask you to ensure if you can repeat the same result in 3.12, where those global variables have been moved to state attributes, and the control flow is a bit different. And again, if you can repeat it on the main branch, where the situation might be slightly different again (not sure).

@kumaraditya303 Do you have any thoughts on the premature finalization of the _asyncio module?

@mathieui
Copy link
Contributor Author

@gvanrossum I could narrow down the issue on python 3.11 to doing a put_nowait in an asyncio.Queue that would trigger the issue, but still nothing minimal to run easily.

On the other hand, I also produced a (different) stack trace on Python 3.12.1 (main, Dec 19 2023, 20:14:15) [GCC 12.2.0]) with the following backtrace, and where toggling that put_nowait call does not seem to work, which is more with etree than asyncio:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007f36000e8f90 in xmlparser_gc_clear (self=self@entry=0x7f35ffb1b910) at ./Modules/_elementtree.c:3788
3788    ./Modules/_elementtree.c: No such file or directory.
(gdb) bt 
#0  0x00007f36000e8f90 in xmlparser_gc_clear (self=self@entry=0x7f35ffb1b910) at ./Modules/_elementtree.c:3788
#1  0x00007f36000eaba7 in xmlparser_dealloc (self=0x7f35ffb1b910) at ./Modules/_elementtree.c:3813
#2  0x00007f36010dfd95 in _Py_Dealloc (op=<optimized out>) at Objects/object.c:2625
#3  Py_DECREF (op=<optimized out>) at ./Include/object.h:706
#4  Py_XDECREF (op=<optimized out>) at ./Include/object.h:799
#5  _PyObject_FreeInstanceAttributes (self=<XMLPullParser(_events_queue=<collections.deque at remote 0x7f35ff941f30>, _parser=<xml.etree.ElementTree.XMLParser at remote 0x7f35ffb1b910>) at remote 0x7f35ff9a93d0>)
    at Objects/dictobject.c:5571
#6  subtype_dealloc (self=<XMLPullParser(_events_queue=<collections.deque at remote 0x7f35ff941f30>, _parser=<xml.etree.ElementTree.XMLParser at remote 0x7f35ffb1b910>) at remote 0x7f35ff9a93d0>) at Objects/typeobject.c:2017
#7  0x00007f36010a6308 in _Py_Dealloc (op=<optimized out>) at Objects/object.c:2625
#8  Py_DECREF (op=<optimized out>) at ./Include/object.h:706
#9  Py_XDECREF (op=<optimized out>) at ./Include/object.h:799
#10 free_keys_object (interp=0x7f3601434728 <_PyRuntime+76392>, keys=0x56250956f490) at Objects/dictobject.c:673
#11 0x00007f36010a6169 in dictkeys_decref (dk=0x56250956f490, interp=0x7f3601434728 <_PyRuntime+76392>) at Objects/dictobject.c:333
#12 dict_dealloc (mp=0x7f35ff960f80) at Objects/dictobject.c:2374
#13 0x00007f360113aab4 in subtype_clear (self=<ClientXMPP at remote 0x7f36008f2060>) at Objects/typeobject.c:1863
#14 0x00007f36010ba54d in delete_garbage (old=0x7f36014347e0 <_PyRuntime+76576>, collectable=0x7ffdac77b6b0, gcstate=0x7f3601434798 <_PyRuntime+76504>, tstate=0x7f3601492048 <_PyRuntime+459656>) at Modules/gcmodule.c:1029
#15 gc_collect_main (tstate=0x7f3601492048 <_PyRuntime+459656>, generation=generation@entry=2, n_collected=n_collected@entry=0x0, n_uncollectable=n_uncollectable@entry=0x0, nofail=nofail@entry=1) at Modules/gcmodule.c:1303
#16 0x00007f360113a51c in _PyGC_CollectNoFail (tstate=<optimized out>) at Modules/gcmodule.c:2135
#17 0x00007f3601123cb2 in Py_FinalizeEx () at Python/pylifecycle.c:1889
#18 0x00007f3601133ddd in Py_RunMain () at Modules/main.c:711
#19 0x00007f36010ee0d7 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:763
#20 0x00007f3600cb41ca in __libc_start_call_main (main=main@entry=0x562508664140 <main>, argc=argc@entry=2, argv=argv@entry=0x7ffdac77baa8) at ../sysdeps/nptl/libc_start_call_main.h:58
#21 0x00007f3600cb4285 in __libc_start_main_impl (main=0x562508664140 <main>, argc=2, argv=0x7ffdac77baa8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffdac77ba98) at ../csu/libc-start.c:360
#22 0x0000562508664071 in _start ()

@mathieui
Copy link
Contributor Author

mathieui commented Dec 29, 2023

Ok, here is a minimal code reproducing the first issue, but it does seem like python3.12 has it fixed:

import asyncio
loop = asyncio.get_event_loop()
q = asyncio.Queue()

async def consume():
    while True:
        await q.get()

async def coro():
    q.put_nowait(1)
    await asyncio.sleep(500)

async def main():
    asyncio.ensure_future(consume())
    asyncio.ensure_future(coro())

loop.run_until_complete(main())

@gvanrossum
Copy link
Member

Okay, your example fails for me too with 3.11 on my Mac. And with this diff it passes:

diff --git a/Modules/_asynciomodule.c b/Modules/_asynciomodule.c
index b2fef017050..a92feebcdbc 100644
--- a/Modules/_asynciomodule.c
+++ b/Modules/_asynciomodule.c
@@ -1377,6 +1377,9 @@ static PyObject *
 FutureObj_repr(FutureObj *fut)
 {
     ENSURE_FUTURE_ALIVE(fut)
+    if (asyncio_future_repr_func == NULL) {
+        return PyUnicode_FromFormat("<Future at %p>", fut);
+    }
     return PyObject_CallOneArg(asyncio_future_repr_func, (PyObject *)fut);
 }
 

I think this was fixed in 3.12 by the refactor to remove all the globals.

For your other segfault I have no idea -- I don't know that part of the code that well, and looking at that line, nothing comes to my mind. I recommend that you try to repro it on main, and then file a separate bug report, which someone with more expat knowledge will have to address. (If it doesn't repro on main, it's still worth filing, just state as much in your bug report.)

I'll cook up a PR for the asyncio crash.

@kumaraditya303
Copy link
Contributor

kumaraditya303 commented Dec 30, 2023

Do you have any thoughts on the premature finalization of the _asyncio module?

In 3.11 there only module_free which could have been called prematurely by gc at runtime finalization and it cleared the globals. The best I think is to check for null globals, I have approved the PR. This should not be a problem in 3.12+ as globals were isolated to module state and multiple phase initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-asyncio type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Status: Done
Development

No branches or pull requests

4 participants