[incomplete] segfault in python plugin #215

Closed
prymitive opened this Issue Apr 5, 2013 · 21 comments

Projects

None yet

3 participants

Contributor
Apr  5 11:28:51 localhost onion: [busyness] 15s average busyness is at 0%, cheap one of 17 running workers
Apr  5 11:28:52 localhost onion: *** glibc detected *** [onion] uWSGI worker 5: free(): invalid next size (fast): 0x00000000021daca0 ***
Apr  5 11:28:52 localhost onion: ======= Backtrace: =========
Apr  5 11:28:52 localhost onion: /lib/libc.so.6(+0x78bb6)[0x7fc4c27a8bb6]
Apr  5 11:28:52 localhost onion: /lib/libc.so.6(cfree+0x73)[0x7fc4c27af483]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x83955)[0x7fc4c0035955]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x8f717)[0x7fc4c0041717]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(PyDict_SetItem+0xa7)[0x7fc4c0044047]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(_PyModule_Clear+0x16d)[0x7fc4c00460bd]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(PyImport_Cleanup+0x46a)[0x7fc4c00bbfda]
Apr  5 11:28:52 localhost onion: /usr/lib/libpython2.6.so.1.0(Py_Finalize+0x11b)[0x7fc4c00c8acb]
Apr  5 11:28:52 localhost onion: [onion] uWSGI worker 5(uwsgi_plugins_atexit+0x6d)[0x44d38d]
Apr  5 11:28:52 localhost onion: /lib/libc.so.6(exit+0xe2)[0x7fc4c2769312]
Apr  5 11:28:52 localhost onion: [onion] uWSGI worker 5(uwsgi_ignition+0xe2)[0x44d522]
Apr  5 11:28:52 localhost onion: [onion] uWSGI worker 5(uwsgi_start+0x126c)[0x44e86c]
Apr  5 11:28:52 localhost onion: /lib/libc.so.6(clone+0x6d)[0x7fc4c281921d]
Apr  5 12:09:45 localhost onion: *** glibc detected *** [onion] uWSGI worker 2: free(): invalid next size (fast): 0x00000000021daca0 ***
Apr  5 12:09:45 localhost onion: ======= Backtrace: =========
Apr  5 12:09:45 localhost onion: /lib/libc.so.6(+0x78bb6)[0x7fc4c27a8bb6]
Apr  5 12:09:45 localhost onion: /lib/libc.so.6(cfree+0x73)[0x7fc4c27af483]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x83955)[0x7fc4c0035955]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x8f717)[0x7fc4c0041717]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(PyDict_SetItem+0xa7)[0x7fc4c0044047]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(_PyModule_Clear+0x16d)[0x7fc4c00460bd]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(PyImport_Cleanup+0x46a)[0x7fc4c00bbfda]
Apr  5 12:09:45 localhost onion: /usr/lib/libpython2.6.so.1.0(Py_Finalize+0x11b)[0x7fc4c00c8acb]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(uwsgi_plugins_atexit+0x6d)[0x44d38d]
Apr  5 12:09:45 localhost onion: /lib/libc.so.6(exit+0xe2)[0x7fc4c2769312]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2[0x450091]
Apr  5 12:09:45 localhost onion: /lib/libpthread.so.0(+0xf8f0)[0x7fc4c4e168f0]
Apr  5 12:09:45 localhost onion: /lib/libc.so.6(epoll_wait+0x33)[0x7fc4c2819813]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(event_queue_wait+0x22)[0x4462e2]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(wsgi_req_accept+0x90)[0x41c0d0]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(simple_loop_run+0xb7)[0x44a517]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(uwsgi_ignition+0x197)[0x44d5d7]
Apr  5 12:09:45 localhost onion: [onion] uWSGI worker 2(uwsgi_start+0x126c)[0x44e86c]
Apr  5 12:09:45 localhost onion: /lib/libc.so.6(clone+0x6d)[0x7fc4c281921d]

I had few more of those in my logs. I've noticed those segfaults because my node had only 4 workers running during ab run, it should start up to 20, but cheaper_bussynes didn't started more since it was detecting them as busy (they had status "cheap" but >0 pid).

I'll do more testing to see if I can reproduce it. If not please keep this bug open for future reports, I had such issue before during 1.9 dev cycle, but I was unable to reproduce it.

Contributor

using --idle kills all workers but pid is not zeroed:

"pid":138,
"status":"cheap",

ps shows only master process running. Might be related or it's separate bug, but it is zeroed in the code, so there is something weird happening here.

Next I do few requests to wake up workers, only 1 is started since others have non-zero pids and cheaper_busyness thinks they are busy. So I try to reload my vassal by touching config file and my single running worker segfaults:

Apr  5 12:26:27 localhost onion: *** glibc detected *** [onion@4965] uWSGI worker 1: free(): invalid next size (fast): 0x000000000201bca0 ***
Apr  5 12:26:27 localhost onion: ======= Backtrace: =========
Apr  5 12:26:27 localhost onion: /lib/libc.so.6(+0x78bb6)[0x7f5ad87bbbb6]
Apr  5 12:26:27 localhost onion: /lib/libc.so.6(cfree+0x73)[0x7f5ad87c2483]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x83955)[0x7f5ad6048955]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(+0x8f717)[0x7f5ad6054717]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(PyDict_SetItem+0xa7)[0x7f5ad6057047]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(_PyModule_Clear+0x16d)[0x7f5ad60590bd]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(PyImport_Cleanup+0x46a)[0x7f5ad60cefda]
Apr  5 12:26:27 localhost onion: /usr/lib/libpython2.6.so.1.0(Py_Finalize+0x11b)[0x7f5ad60dbacb]
Apr  5 12:26:27 localhost onion: [onion@4965] uWSGI worker 1(uwsgi_plugins_atexit+0x6d)[0x44d38d]
Apr  5 12:26:27 localhost onion: /lib/libc.so.6(exit+0xe2)[0x7f5ad877c312]
Apr  5 12:26:27 localhost onion: [onion@4965] uWSGI worker 1[0x450091]
Apr  5 12:26:27 localhost onion: /lib/libpthread.so.0(+0xf8f0)[0x7f5adae298f0]
Apr  5 12:26:27 localhost onion: /lib/libc.so.6(epoll_wait+0x33)[0x7f5ad882c813]
Apr  5 12:26:27 localhost onion: [onion] uWSGI worker 1(event_queue_wait+0x22)[0x4462e2]
Apr  5 12:26:27 localhost onion: [onion] uWSGI worker 1(wsgi_req_accept+0x90)[0x41c0d0]
Apr  5 12:26:27 localhost onion: [onion] uWSGI worker 1(simple_loop_run+0xb7)[0x44a517]
Apr  5 12:26:27 localhost onion: [onion] uWSGI worker 1(uwsgi_ignition+0x197)[0x44d5d7]
Apr  5 12:26:27 localhost onion: [onion] uWSGI worker 1(uwsgi_start+0x126c)[0x44e86c]
Apr  5 12:26:27 localhost onion: /lib/libc.so.6(clone+0x6d)[0x7f5ad882c21d]
Apr  5 12:26:27 localhost onion: ======= Memory map: ========
Contributor

happens every time, I'll check if any option makes a difference

Contributor
onion: Fri Apr  5 12:50:00 2013 - worker 8 (pid: 191) is taking too much time to die...NO MERCY !!!

master tries to kill those workers, but they are not running, it's just the pid that is not zeroed for some reason

Collaborator
xrmx commented Apr 5, 2013

Since you can reproduce, can you run a debug build under valgrind? It would be way more easier to debug.

Contributor

Sure, I'll do some basic tests first to verify few things, after that I'll move to current master debug build.

Contributor

pids are not zeroed due to --smart-attach-daemon usage:

exec-as-root = chown -R www-data /etc/dropbear
namespace-keep-mount = /dev/pts
smart-attach-daemon = /tmp/dropbear.pid /usr/sbin/dropbear -b /etc/dropbear/banner -w -s -j -k -p 5001 -E -F -I 300 -P /tmp/dropbear.pid
Contributor

segfault is more tricky, I could easily reproduce it with reload, but not anymore, it's gone and I didn't changed anything.
Is it possible that this is python-ldap bug (it's C module for python) ?

Owner
unbit commented Apr 5, 2013

have you built uWSGi with ldap support ?

Contributor

yes, ldap is linked (but I'm not using ldap with uwsgi itself), my vassal uses ldap (it's fronted for managing ldap users)

Contributor

it seems that I need to leave it running for a while to trigger it, but I have issue with valgrind:

[uwsgi-logger] registered "python"
optind:3 argc:3
[uwsgi-static] added mapping for /static => /home/onion/static
python plugin already available
python plugin already available
==3622== Syscall param clone(child_tidptr) contains uninitialised byte(s)
==3622==    at 0x71F71E1: clone (clone.S:84)
==3622==    by 0x35A59C9F0386A6FF: ???
==3622==    by 0x13: ???
==3622==    by 0x697D7F: ???
==3622==  Uninitialised value was created by a stack allocation
==3622==    at 0x45559B: linux_namespace_start (linux_ns.c:55)
==3622== 
==1== 
==1== HEAP SUMMARY:
==1==     in use at exit: 91,838 bytes in 268 blocks
==1==   total heap usage: 508 allocs, 240 frees, 244,631 bytes allocated
=

and then uWSGI exits. Is it --namespace issue?

Owner
unbit commented Apr 5, 2013

can you try disabling ldap support in uWSGI ?

Contributor

still segfaults, and final procedure to trigger it is:

  • run ab for a while until ~half of all workers are spawned
  • stop ab and let --idle cheap all running workers
  • run ab for 5-10 minutes
  • stop ab and touch vassal config so that it will reload itself

Note: after --idle kicks in and you make new requests only first worker is reloaded but it also has non-zero pid.

It still does not happen everytime, but often enough.
I can't reproduce it with other apps, only one particular.

But it always happen when calling Py_Finalize() and google shows that such segfaults are common, maybe there is a rare condition uWSGI does not handle properly (for example additional PyErr_Clear(); is needed in some cases before Py_Finalize).

Owner
unbit commented Apr 5, 2013

can you confirm that returning soon before Py_Finalize solve the issue ?

Contributor

I'll do some testing with that after weekend, it's gonna take a while to make sure since I can't trigger it every time.

Contributor

Again I can't reproduce it, probably it depends on some rare condition I did not spotted.
All I can confirm is that removing Py_Finalize() doesn't break anything and seems to be needed only if app wants to shut down embedded python interpreter but continue running. If uWSGI doesn't do anything after that then it seems that we can safely remove Py_Finalize to protect ourselves from such rare and hard to debug issues.

Owner
unbit commented Apr 8, 2013

It is needed for user having atexit hooks directly in their code. i will try to investigate it a bit further (even if googling it seems useless as those segfaults are all over the place, even in blender). Eventually i will add a --py-no-finalize option

Contributor

I had this crash repeating in every worker after network problems:

*** glibc detected *** onion uWSGI worker 3: double free or corruption (!prev): 0x08552150 ***
======= Backtrace: =========
/lib/tls/i686/cmov/libc.so.6(+0x6e341)[0xb701d341]
/lib/tls/i686/cmov/libc.so.6(+0x6fb98)[0xb701eb98]
/lib/tls/i686/cmov/libc.so.6(cfree+0x6d)[0xb7021c7d]
/usr/lib/libpython2.6.so.1.0(PyObject_Free+0x54)[0xb71fb014]
/usr/lib/libpython2.6.so.1.0(+0x7da87)[0xb7202a87]
/usr/lib/libpython2.6.so.1.0(+0x5c2dc)[0xb71e12dc]
/usr/lib/libpython2.6.so.1.0(+0x6d8e9)[0xb71f28e9]
/usr/lib/libpython2.6.so.1.0(PyDict_SetItem+0x94)[0xb71f5334]
/usr/lib/libpython2.6.so.1.0(_PyModule_Clear+0x16e)[0xb71f75de]
/usr/lib/libpython2.6.so.1.0(PyImport_Cleanup+0x51b)[0xb726a8eb]
/usr/lib/libpython2.6.so.1.0(Py_Finalize+0x128)[0xb72777d8]
onion uWSGI worker 3(uwsgi_plugins_atexit+0x58)[0x809d658]
/lib/tls/i686/cmov/libc.so.6(+0x2f1ef)[0xb6fde1ef]
/lib/tls/i686/cmov/libc.so.6(+0x2f25f)[0xb6fde25f]
onion uWSGI worker 3[0x80a05e2]
[0xb7738400]
onion uWSGI worker 3(wsgi_req_accept+0x8a)[0x8065a1a]
onion uWSGI worker 3(simple_loop_run+0xba)[0x809a20a]
onion uWSGI worker 3(simple_loop+0x12)[0x809a332]
onion uWSGI worker 3(uwsgi_ignition+0x145)[0x809d895]
onion uWSGI worker 3(uwsgi_worker_run+0x278)[0x809db78]
onion uWSGI worker 3(uwsgi_start+0x12b5)[0x809eed5]
onion uWSGI worker 3(main+0x1269)[0x80a2279]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6)[0xb6fc5bd6]
onion uWSGI worker 3[0x8060361]
*** glibc detected *** onion uWSGI worker 1: double free or corruption (!prev): 0x08552150 ***
======= Backtrace: =========
/lib/tls/i686/cmov/libc.so.6(+0x6e341)[0xb701d341]
/lib/tls/i686/cmov/libc.so.6(+0x6fb98)[0xb701eb98]
/lib/tls/i686/cmov/libc.so.6(cfree+0x6d)[0xb7021c7d]
/usr/lib/libpython2.6.so.1.0(PyObject_Free+0x54)[0xb71fb014]
/usr/lib/libpython2.6.so.1.0(+0x7da87)[0xb7202a87]
/usr/lib/libpython2.6.so.1.0(+0x5c2dc)[0xb71e12dc]
/usr/lib/libpython2.6.so.1.0(+0x6d8e9)[0xb71f28e9]
/usr/lib/libpython2.6.so.1.0(PyDict_SetItem+0x94)[0xb71f5334]
/usr/lib/libpython2.6.so.1.0(_PyModule_Clear+0x16e)[0xb71f75de]
/usr/lib/libpython2.6.so.1.0(PyImport_Cleanup+0x51b)[0xb726a8eb]
/usr/lib/libpython2.6.so.1.0(Py_Finalize+0x128)[0xb72777d8]
onion uWSGI worker 1(uwsgi_plugins_atexit+0x58)[0x809d658]
/lib/tls/i686/cmov/libc.so.6(+0x2f1ef)[0xb6fde1ef]
/lib/tls/i686/cmov/libc.so.6(+0x2f25f)[0xb6fde25f]
onion uWSGI worker 1(end_me+0x28)[0x809d738]
onion uWSGI worker 1(uwsgi_ignition+0xcf)[0x809d81f]
onion uWSGI worker 1(uwsgi_worker_run+0x278)[0x809db78]
onion uWSGI worker 1(uwsgi_start+0x12b5)[0x809eed5]
onion uWSGI worker 1(main+0x1269)[0x80a2279]
/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xe6)[0xb6fc5bd6]
onion uWSGI worker 1[0x8060361]
Owner
unbit commented Jun 9, 2013

It looks Py_Finalize generates more problems than the one it solves (calling python atexit). --skip-atexit should solve your issue (unless you need atexit hooks defined in python apps).

Just in case, are you able to test it with python 2.7.x ?

Contributor

I can run it under 2.7 on test box, but since it happens rather rarely I don't think I will get any useful info that way.

Contributor

I don't have access to the app that had this problem so let's close it.

@prymitive prymitive closed this May 5, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment