Built-in poller deadlock #4

ljanyst · 2013-04-12T08:38:28Z

It looks like the built-in poller is deadlocking in:

XrdSys::IOEvents::Channel::Enable vs XrdSys::IOEvents::Poller::CbkTMO. In threads 26 and 24.

Thread 26 (Thread 0x40304940 (LWP 17344)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f222637 in Lock (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaae41d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x000000326f2229bb in XrdSys::IOEvents::PollE::Begin (this=0x2aaaae41d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x000000326f221aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x40e802b0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x42738940 (LWP 17345)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002aaaad668394 in Lock (this=0x2aaaacf94140, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdCl::Stream::Tick (this=0x2aaaacf94140, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:310
#5  0x00002aaaad665f2f in XrdCl::Channel::Tick (this=<value optimized out>, now=1365673017) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:303
#6  0x00002aaaad666acf in (anonymous namespace)::TickGeneratorTask::Run (this=<value optimized out>, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:171
#7  0x00002aaaad671d50 in XrdCl::TaskManager::RunTasks (this=0x2aaaae41d0c0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClTaskManager.cc:216
#8  0x00002aaaad671e59 in RunRunnerThread (arg=0x2aaaacf94170) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClTaskManager.cc:36
---Type <return> to continue, or q <return> to quit---
#9  0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#10 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x41b3d940 (LWP 17368)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f22211c in Lock (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x000000326f22300a in XrdSys::IOEvents::Channel::Enable (this=0x2aaaacf4f200, events=<value optimized out>, timeout=1, eText=0x41b3c768)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x00002aaaad663293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2aaaacf99190, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x00002aaaad669423 in EnableUplink (this=0x2aaaacf94140, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2aaaacf94140, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x00002aaaad669729 in XrdCl::Stream::Send (this=0x2aaaacf94140, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x00002aaaad665ede in XrdCl::Channel::Send (this=0x2aaaac1edec0, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x00002aaaad66539e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x00002aaaad684f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2aaaaf3dee20, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x00002aaaad68e836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2aaaac1ed600, url=<value optimized out>, msg=0x2aaaaf3dee20, handler=0x2aaaaf02d9e0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x00002aaaad690716 in XrdCl::FileStateHandler::Stat (this=0x2aaaac1ed600, force=<value optimized out>, handler=0x41b3cf00, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x00002aaaad687904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x00002aaaad68858e in XrdCl::File::Stat (this=0x2aaaac388810, force=true, response=@0x41b3d020, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002aaaad2db2e9 in XrdMqClient::RecvMessage (this=0x2aaaad62ce60) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqClient.cc:304
#19 0x00002aaaad083f58 in eos::mgm::Messaging::Listen (this=0x2aaaac1c9600) at /afs/cern.ch/work/a/apeters/eos-master/eos/mgm/Messaging.cc:185
#20 0x00002aaaad2e1669 in XrdMqMessaging::Start (pp=<value optimized out>) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqMessaging.cc:35
#21 0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#22 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#23 0x0000003da62d44bd in clone () from /lib64/libc.so.6

The text was updated successfully, but these errors were encountered:

ljanyst · 2013-04-12T09:35:24Z

More feedback from Andres:

The owner of the lock in question is here:

(gdb) print *this
$3 = {cs = {__data = {__lock = 2, __count = 1, __owner = 17344, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\001\000\000\000\300C\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}

owner=17344

... unfortunately this thread is not alive anymore :-(

ljanyst · 2013-04-12T09:36:16Z

It looks like an internal mutex of the poller is not released cleanly in some circumstances.

apeters1971 · 2013-04-12T09:58:13Z

Here is another case where it is more clear:

(gdb) thread 251
[Switching to thread 251 (Thread 0x2ad501d36940 (LWP 11329))]#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d66408f65 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003d66408e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003dec42211c in Lock (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x0000003dec42300a in XrdSys::IOEvents::Channel::Enable (this=0x2ad4f6cfde20, events=<value optimized out>, timeout=1, eText=0x2ad501d35768)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x0000003ded035293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2ad4fb62b580, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x0000003ded03b423 in EnableUplink (this=0x2ad4fb60fc80, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2ad4fb60fc80, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x0000003ded03b729 in XrdCl::Stream::Send (this=0x2ad4fb60fc80, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x0000003ded037ede in XrdCl::Channel::Send (this=0x2ad4fb60fb40, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x0000003ded03739e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x0000003ded056f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2ad4fbe3a700, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x0000003ded060836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2ad4fb60f280, url=<value optimized out>, msg=0x2ad4fbe3a700, handler=0x2ad4fbc2b1a0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x0000003ded062716 in XrdCl::FileStateHandler::Stat (this=0x2ad4fb60f280, force=<value optimized out>, handler=0x2ad501d35f00, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x0000003ded059904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x0000003ded05a58e in XrdCl::File::Stat (this=0x2ad4f6c0f710, force=true, response=@0x2ad501d36020, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002ad4f72aa299 in XrdMqClient::RecvMessage (this=0x2ad4f75fdca0) at /usr/src/debug/eos-0.3.0-1/mq/XrdMqClient.cc:304
#19 0x00002ad4f70874f8 in eos::mgm::Messaging::Listen (this=0x2ad4f61cdf80) at /usr/src/debug/eos-0.3.0-1/mgm/Messaging.cc:185
#20 0x00002ad4f72b0b29 in XrdMqMessaging::Start (pp=<value optimized out>) at /usr/src/debug/eos-0.3.0-1/mq/XrdMqMessaging.cc:35
#21 0x0000003dec41fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#22 0x0000003d6640683d in start_thread () from /lib64/libpthread.so.0
#23 0x0000003d65cd503d in clone () from /lib64/libc.so.6

(gdb) frame 3
(gdb) print *this
$7 = {cs = {__data = {__lock = 2, __count = 1, __owner = 11305, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\001\000\000\000),\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}

Now looking at the owner of the lock (which in this case still exists)

(gdb) thread 253
[Switching to thread 253 (Thread 0x2ad4f9302940 (LWP 11305))]#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d66408f65 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003d66408e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003dec422637 in Lock (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x0000003dec42286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2ad4f941d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x0000003dec4229bb in XrdSys::IOEvents::PollE::Begin (this=0x2ad4f941d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x0000003dec421aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x2ad4f90ff2e0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x0000003dec41fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003d6640683d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003d65cd503d in clone () from /lib64/libc.so.6

Even the owner wants to lock the same thing again .... look's like a reentrant lock ...

abh3 · 2013-04-13T00:34:09Z

In Lukasz's example, the thread holding the lock is no longer there. So, it would seem that the thread was cancelled while it was holding that lock. I can't do anything about that. The thread should not have been cancelled. In Andreas' second example, had he gone to thread 253 and printed out the lock information for that thread he would have likely seen that the thread holding the lock which 253 is waiting for is gone as well. Both the timeout queue lock and the callback lock are recursive so no problem with multiple locking by the same thread.

So, who is cancelling these threads? Certainly not the poller.

apeters1971 · 2013-04-16T11:23:37Z

Have another case and it looks like a dead-lock triggered by lock inversion in the poller implementation.

This are two threads blocking each other:

 26 Thread 0x41cbc940 (LWP 2364)  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
 21 Thread 0x41dbd940 (LWP 2391)  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0

gdb) thread 26
[Switching to thread 26 (Thread 0x41cbc940 (LWP 2364))]#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaaec1d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
548         CbkXeq(cP, cP->dlType, 0, 0);
(gdb) where
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f222637 in Lock (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaaec1d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x000000326f2229bb in XrdSys::IOEvents::PollE::Begin (this=0x2aaaaec1d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x000000326f221aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x415652b0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003da62d44bd in clone () from /lib64/libc.so.6


(gdb) thread 21
[Switching to thread 21 (Thread 0x41dbd940 (LWP 2391))]#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f22211c in Lock (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x000000326f22300a in XrdSys::IOEvents::Channel::Enable (this=0x2aaaacc91200, events=<value optimized out>, timeout=1, eText=0x41dbb458) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x00002aaaad663293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2aaaacf633d0, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x00002aaaad669423 in EnableUplink (this=0x2aaaacf93280, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2aaaacf93280, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x00002aaaad669729 in XrdCl::Stream::Send (this=0x2aaaacf93280, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x00002aaaad665ede in XrdCl::Channel::Send (this=0x2aaaacf93140, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x00002aaaad66539e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x00002aaaad684f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2aaab001d100, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x00002aaaad68e836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2aaaae517ac0, url=<value optimized out>, msg=0x2aaab001d100, handler=0x2aaab002c5c0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x00002aaaad690716 in XrdCl::FileStateHandler::Stat (this=0x2aaaae517ac0, force=<value optimized out>, handler=0x41dbbbf0, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x00002aaaad687904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x00002aaaad68858e in XrdCl::File::Stat (this=0x2aaaaf0256a0, force=true, response=@0x41dbbd10, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002aaaad2db749 in XrdMqClient::RecvMessage (this=0x2aaaad627130) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqClient.cc:304
#19 0x00002aaaad1a27bd in eos::mgm::Iostat::Receive (this=0x2aaaad626a48) at /afs/cern.ch/work/a/apeters/eos-master/eos/mgm/Iostat.cc:258
#20 0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#21 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#22 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 26 is stuck here:

cbkMHelp.Lock(&(cP->chMutex));

Thread 21 is stuck here:

XrdSysMutexHelper mHelper(toMutex);

In Thread 26 the owner of chMutex is ID=2391 => Thread21

(gdb) print cP->chMutex
$28 = {<XrdSysMutex> = {cs = {__data = {__lock = 2, __count = 1, __owner = 2391, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\001\000\000\000W\t\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}, <No data fields>}

In Thread 21 the owner of toMutex is ID=2364 => Thread 26
(gdb) print this->toMutex
$29 = {<XrdSysMutex> = {cs = {__data = {__lock = 2, __count = 1, __owner = 2364, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\001\000\000\000<\t\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}, <No data fields>}

===> deadlock between both threads.

abh3 · 2013-04-17T04:25:06Z

OK, I see the issue now. It's how timeouts are handled. Seems that just as there was a timeout, the Enable() method was called from another thread for the same object that is in the process of calling the timeout callback. It seems rather suspicious that someone should do that because it means that the channel is being re-enabled while it's was enabled for timeouts. That does mean that it quite likely that the timeout callback will be called in any case. Was this really intended? I pushed a patch that should hopefully fix the deadlock.

ljanyst · 2013-04-17T06:56:46Z

Thanks for the patch! We'll test it.

The timeout is related to reading subscription, which never gets disabled. Here, the writing subscription was being enabled. Yes, this is intended.

abh3 · 2013-04-17T07:07:03Z

OK, since I have to review the timeout section to see if I can get it to be more efficient I may have to split the read and write queues. At the moment there is a minimization algorithm that tries to use only one of the two timeout values to keep the timeout queues short.

ljanyst · 2013-04-17T07:12:42Z

I think that it's good enough for me for the moment.

abh3 · 2013-04-17T07:21:12Z

Yes, but I still have to go through the cpugrind output to see where I am inefficient, sigh.

Added additional statistics to ceph access.

ghost assigned abh3 Apr 12, 2013

abh3 closed this as completed Apr 17, 2013

esindril mentioned this issue May 28, 2013

[XrdCl] Segmentation fault with eoscp #24

Closed

esindril mentioned this issue Nov 4, 2013

SSS segmentation fault #56

Closed

alja mentioned this issue Nov 25, 2013

kXR_locate error messages on exported paths #58

Closed

abh3 mentioned this issue Nov 30, 2013

A problem (deadlock) in krb5 on SLC6 under heavy load #60

Closed

This was referenced Dec 6, 2013

setting monitoring with fstat but without redir crashes redirector with xrootd 3.3.3 #65

Closed

crash of xrootd 3.3.3 redirector #64

Closed

smithdh mentioned this issue Apr 10, 2014

xrdfs 3.3.6 can hang on exit #104

Closed

apeters1971 mentioned this issue Apr 29, 2014

Eventual bug in new XrdSys::LinuxSemaphore::Wait() ? #110

Closed

tommasoboccali mentioned this issue Nov 5, 2014

Xrootd crashed (as manager) when using f-stream monitoring? #155

Closed

malleite mentioned this issue Jan 15, 2015

Erratic file/directory creation when using xrootd mounted via fuse #192

Closed

sponce mentioned this issue Feb 3, 2015

xrootd crashing on startup if double xrootd.chksum entry in config file #196

Closed

apeters1971 mentioned this issue Feb 25, 2015

SEGV when using proxy module PSS #205

Closed

apeters1971 mentioned this issue Jul 6, 2015

Deadlock in Monitor Object #248

Closed

esindril mentioned this issue Jul 7, 2015

TPC: Fix deadlock in case of error in the TPC authentication #250

Merged

simonmichal mentioned this issue Sep 21, 2015

XrdCl: FileStateHandler::OnOpen seqfaults when both write timeout and OpenHandler timeout at the same time #289

Closed

esindril mentioned this issue Sep 22, 2015

XrdOfsTPCAuth deadlock #290

Closed

abh3 mentioned this issue Sep 23, 2015

xrootd-4.2 and the same supervisor problem as it was in 4.1 #291

Closed

smithdh mentioned this issue Dec 2, 2015

segmentation fault in XrdSysLinuxSemaphore::Wait (often seen with gcc 5.1.1 + xrootd 4.2.3) #313

Closed

osschar mentioned this issue Mar 6, 2018

[4.8.1] crash in XrdCl::FileStateHandler::OnOpen #666

Closed

alja mentioned this issue Mar 6, 2018

PSS with disk caching enabled produces bad files on the local cache when attempting writes #663

Closed

simonmichal mentioned this issue Apr 6, 2018

Segv in XrdCl::InQueue::ReportTimeout after virtual redirection #682

Closed

abh3 mentioned this issue Apr 30, 2018

SEGV in XrdCryptosslX509SignProxyReq() #696

Closed

osschar mentioned this issue Jun 25, 2018

Crash in XrdHttpReq::ProcessHTTPReq #753

Closed

abh3 mentioned this issue Oct 3, 2018

abort() in XrdSys::IOEvents::PollE::Begin with EFAULT #746

Closed

abh3 mentioned this issue Aug 19, 2019

xrdcp crash below XrdCpFile::Extend() #1040

Closed

apeters1971 mentioned this issue Dec 3, 2019

XrdCl dead-locks during shut-down in v4.11 #1095

Closed

This was referenced Jan 10, 2020

[XrdHttp] 4.10.0 crash in XrdHttpProtocol::Process #1106

Closed

XRootDFS was aborted due to XrdFfsWCache_destroy function. #1116

Closed

ffurano mentioned this issue Apr 14, 2020

XrdHttp crash when dealing with incomplete headers #1176

Closed

wyang007 mentioned this issue Jun 19, 2020

Xcache in Xrootd 5 rc4 crash when access a fully cached file created by xrootd 4.12.1 #1217

Closed

abh3 mentioned this issue Jul 15, 2020

XRootD 5 with HTTP TLS: Connections fail with timeout if client relies on session cache #1252

Closed

wyang007 mentioned this issue Oct 17, 2020

crash in http tpc #1309

Closed

xrootd-dev mentioned this issue Nov 2, 2020

Need XrdSecEntity information to save checksum #1294

Closed

abh3 mentioned this issue Mar 8, 2021

xrootd manager crash shortly after restart when cmsd already running #1421

Closed

abh3 mentioned this issue Jun 3, 2021

Clear OpenSSL error queue ahead of SSL_read/write/accept() #1464

Merged

abh3 mentioned this issue Jan 18, 2022

Prime value used for Diffie-Hellman handshake is too small #1556

Closed

mdoidge mentioned this issue Feb 10, 2022

XrdMacaroon segfaults on client auth (RHEL8 host) #1614

Closed

abh3 mentioned this issue Feb 11, 2022

Xrootd Cache 5.4 crashes #1616

Closed

xrootd-dev mentioned this issue Nov 5, 2022

segfault from XrdXrootdProtocol::do_OffloadIO #1821

Closed

xrootd-dev mentioned this issue Dec 1, 2022

Traffic stops when threads saturate #1799

Closed

stlammel mentioned this issue Dec 15, 2022

xrdfs stat SEGV #1870

Closed

abh3 mentioned this issue Mar 27, 2023

Deadlock in XCache's XrdCl instance #1979

Closed

abh3 mentioned this issue Apr 10, 2023

Request to resolve frequent xrootd service restarts. #1995

Closed

amadio pushed a commit to amadio/xrootd that referenced this issue May 12, 2023

Merge pull request xrootd#4 from ericcano/master

2ad8421

Added additional statistics to ceph access.

abh3 mentioned this issue Oct 16, 2023

Potential unsafe access with Wait4Midnight & getenv/setenv #2107

Closed

smithdh mentioned this issue Feb 1, 2024

Occasional 'xrdcp killed by signal 11' during TPC #2164

Closed

smithdh mentioned this issue Mar 20, 2024

segfault around InitTLS() #2220

Closed

smithdh mentioned this issue Apr 26, 2024

[XrdCrypto] Avoid some repated calls of EVP_PKEY_check on the same key #2245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in poller deadlock #4

Built-in poller deadlock #4

ljanyst commented Apr 12, 2013

ljanyst commented Apr 12, 2013

ljanyst commented Apr 12, 2013

apeters1971 commented Apr 12, 2013

abh3 commented Apr 13, 2013

apeters1971 commented Apr 16, 2013

abh3 commented Apr 17, 2013

ljanyst commented Apr 17, 2013

abh3 commented Apr 17, 2013

ljanyst commented Apr 17, 2013

abh3 commented Apr 17, 2013

Built-in poller deadlock #4

Built-in poller deadlock #4

Comments

ljanyst commented Apr 12, 2013

ljanyst commented Apr 12, 2013

ljanyst commented Apr 12, 2013

apeters1971 commented Apr 12, 2013

abh3 commented Apr 13, 2013

apeters1971 commented Apr 16, 2013

abh3 commented Apr 17, 2013

ljanyst commented Apr 17, 2013

abh3 commented Apr 17, 2013

ljanyst commented Apr 17, 2013

abh3 commented Apr 17, 2013