Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Built-in poller deadlock #4

Closed
ljanyst opened this issue Apr 12, 2013 · 10 comments
Closed

Built-in poller deadlock #4

ljanyst opened this issue Apr 12, 2013 · 10 comments
Assignees
Milestone

Comments

@ljanyst
Copy link
Contributor

ljanyst commented Apr 12, 2013

It looks like the built-in poller is deadlocking in:

XrdSys::IOEvents::Channel::Enable vs XrdSys::IOEvents::Poller::CbkTMO. In threads 26 and 24.

Thread 26 (Thread 0x40304940 (LWP 17344)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f222637 in Lock (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2aaaae41d180, cP=0x2aaaacf4f200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaae41d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x000000326f2229bb in XrdSys::IOEvents::PollE::Begin (this=0x2aaaae41d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x000000326f221aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x40e802b0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x42738940 (LWP 17345)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00002aaaad668394 in Lock (this=0x2aaaacf94140, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdCl::Stream::Tick (this=0x2aaaacf94140, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:310
#5  0x00002aaaad665f2f in XrdCl::Channel::Tick (this=<value optimized out>, now=1365673017) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:303
#6  0x00002aaaad666acf in (anonymous namespace)::TickGeneratorTask::Run (this=<value optimized out>, now=128) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:171
#7  0x00002aaaad671d50 in XrdCl::TaskManager::RunTasks (this=0x2aaaae41d0c0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClTaskManager.cc:216
#8  0x00002aaaad671e59 in RunRunnerThread (arg=0x2aaaacf94170) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClTaskManager.cc:36
---Type <return> to continue, or q <return> to quit---
#9  0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#10 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 24 (Thread 0x41b3d940 (LWP 17368)):
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f22211c in Lock (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2aaaae41d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x000000326f22300a in XrdSys::IOEvents::Channel::Enable (this=0x2aaaacf4f200, events=<value optimized out>, timeout=1, eText=0x41b3c768)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x00002aaaad663293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2aaaacf99190, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x00002aaaad669423 in EnableUplink (this=0x2aaaacf94140, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2aaaacf94140, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x00002aaaad669729 in XrdCl::Stream::Send (this=0x2aaaacf94140, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x00002aaaad665ede in XrdCl::Channel::Send (this=0x2aaaac1edec0, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x00002aaaad66539e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2aaaaf3dee20, handler=0x2aaaaf13ff08, stateful=true, expires=1365673317)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x00002aaaad684f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2aaaaf3dee20, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x00002aaaad68e836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2aaaac1ed600, url=<value optimized out>, msg=0x2aaaaf3dee20, handler=0x2aaaaf02d9e0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x00002aaaad690716 in XrdCl::FileStateHandler::Stat (this=0x2aaaac1ed600, force=<value optimized out>, handler=0x41b3cf00, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x00002aaaad687904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x00002aaaad68858e in XrdCl::File::Stat (this=0x2aaaac388810, force=true, response=@0x41b3d020, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002aaaad2db2e9 in XrdMqClient::RecvMessage (this=0x2aaaad62ce60) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqClient.cc:304
#19 0x00002aaaad083f58 in eos::mgm::Messaging::Listen (this=0x2aaaac1c9600) at /afs/cern.ch/work/a/apeters/eos-master/eos/mgm/Messaging.cc:185
#20 0x00002aaaad2e1669 in XrdMqMessaging::Start (pp=<value optimized out>) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqMessaging.cc:35
#21 0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#22 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#23 0x0000003da62d44bd in clone () from /lib64/libc.so.6
@ghost ghost assigned abh3 Apr 12, 2013
@ljanyst
Copy link
Contributor Author

ljanyst commented Apr 12, 2013

More feedback from Andres:

The owner of the lock in question is here:

(gdb) print *this
$3 = {cs = {__data = {__lock = 2, __count = 1, __owner = 17344, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\001\000\000\000\300C\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}

owner=17344

... unfortunately this thread is not alive anymore :-(

@ljanyst
Copy link
Contributor Author

ljanyst commented Apr 12, 2013

It looks like an internal mutex of the poller is not released cleanly in some circumstances.

@apeters1971
Copy link
Contributor

Here is another case where it is more clear:

(gdb) thread 251
[Switching to thread 251 (Thread 0x2ad501d36940 (LWP 11329))]#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d66408f65 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003d66408e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003dec42211c in Lock (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2ad4f941d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x0000003dec42300a in XrdSys::IOEvents::Channel::Enable (this=0x2ad4f6cfde20, events=<value optimized out>, timeout=1, eText=0x2ad501d35768)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x0000003ded035293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2ad4fb62b580, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x0000003ded03b423 in EnableUplink (this=0x2ad4fb60fc80, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2ad4fb60fc80, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x0000003ded03b729 in XrdCl::Stream::Send (this=0x2ad4fb60fc80, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x0000003ded037ede in XrdCl::Channel::Send (this=0x2ad4fb60fb40, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x0000003ded03739e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2ad4fbe3a700, handler=0x2ad4fbc20d88, stateful=true, expires=1365672412)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x0000003ded056f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2ad4fbe3a700, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x0000003ded060836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2ad4fb60f280, url=<value optimized out>, msg=0x2ad4fbe3a700, handler=0x2ad4fbc2b1a0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x0000003ded062716 in XrdCl::FileStateHandler::Stat (this=0x2ad4fb60f280, force=<value optimized out>, handler=0x2ad501d35f00, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x0000003ded059904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x0000003ded05a58e in XrdCl::File::Stat (this=0x2ad4f6c0f710, force=true, response=@0x2ad501d36020, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002ad4f72aa299 in XrdMqClient::RecvMessage (this=0x2ad4f75fdca0) at /usr/src/debug/eos-0.3.0-1/mq/XrdMqClient.cc:304
#19 0x00002ad4f70874f8 in eos::mgm::Messaging::Listen (this=0x2ad4f61cdf80) at /usr/src/debug/eos-0.3.0-1/mgm/Messaging.cc:185
#20 0x00002ad4f72b0b29 in XrdMqMessaging::Start (pp=<value optimized out>) at /usr/src/debug/eos-0.3.0-1/mq/XrdMqMessaging.cc:35
#21 0x0000003dec41fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#22 0x0000003d6640683d in start_thread () from /lib64/libpthread.so.0
#23 0x0000003d65cd503d in clone () from /lib64/libc.so.6
(gdb) frame 3
(gdb) print *this
$7 = {cs = {__data = {__lock = 2, __count = 1, __owner = 11305, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = "\002\000\000\000\001\000\000\000),\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}

Now looking at the owner of the lock (which in this case still exists)

(gdb) thread 253
[Switching to thread 253 (Thread 0x2ad4f9302940 (LWP 11305))]#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003d6640d654 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003d66408f65 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003d66408e63 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003dec422637 in Lock (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2ad4f941d180, cP=0x2ad4f6cfde20, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x0000003dec42286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2ad4f941d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x0000003dec4229bb in XrdSys::IOEvents::PollE::Begin (this=0x2ad4f941d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x0000003dec421aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x2ad4f90ff2e0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x0000003dec41fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003d6640683d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003d65cd503d in clone () from /lib64/libc.so.6

Even the owner wants to lock the same thing again .... look's like a reentrant lock ...

@abh3
Copy link
Member

abh3 commented Apr 13, 2013

In Lukasz's example, the thread holding the lock is no longer there. So, it would seem that the thread was cancelled while it was holding that lock. I can't do anything about that. The thread should not have been cancelled. In Andreas' second example, had he gone to thread 253 and printed out the lock information for that thread he would have likely seen that the thread holding the lock which 253 is waiting for is gone as well. Both the timeout queue lock and the callback lock are recursive so no problem with multiple locking by the same thread.

So, who is cancelling these threads? Certainly not the poller.

@apeters1971
Copy link
Contributor

Have another case and it looks like a dead-lock triggered by lock inversion in the poller implementation.

This are two threads blocking each other:

 26 Thread 0x41cbc940 (LWP 2364)  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
 21 Thread 0x41dbd940 (LWP 2391)  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0

gdb) thread 26
[Switching to thread 26 (Thread 0x41cbc940 (LWP 2364))]#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaaec1d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
548         CbkXeq(cP, cP->dlType, 0, 0);
(gdb) where
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f222637 in Lock (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  Lock (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:197
#5  XrdSys::IOEvents::Poller::CbkXeq (this=0x2aaaaec1d180, cP=0x2aaaacc91200, events=<value optimized out>, eNum=0, eTxt=0x0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:626
#6  0x000000326f22286c in XrdSys::IOEvents::Poller::CbkTMO (this=0x2aaaaec1d180) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:548
#7  0x000000326f2229bb in XrdSys::IOEvents::PollE::Begin (this=0x2aaaaec1d180, syncsem=<value optimized out>, retcode=<value optimized out>, eTxt=<value optimized out>)
    at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEventsPollE.icc:202
#8  0x000000326f221aa4 in XrdSys::IOEvents::BootStrap::Start (parg=0x415652b0) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:110
#9  0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#10 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003da62d44bd in clone () from /lib64/libc.so.6


(gdb) thread 21
[Switching to thread 21 (Thread 0x41dbd940 (LWP 2391))]#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
(gdb) where
#0  0x0000003da6a0d524 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x0000003da6a08e35 in _L_lock_1127 () from /lib64/libpthread.so.0
#2  0x0000003da6a08d33 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x000000326f22211c in Lock (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#4  XrdSysMutexHelper (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#5  XrdSys::IOEvents::Poller::TmoAdd (this=0x2aaaaec1d180, cP=0x80) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:965
#6  0x000000326f22300a in XrdSys::IOEvents::Channel::Enable (this=0x2aaaacc91200, events=<value optimized out>, timeout=1, eText=0x41dbb458) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysIOEvents.cc:373
#7  0x00002aaaad663293 in XrdCl::PollerBuiltIn::EnableWriteNotification (this=<value optimized out>, socket=0x2aaaacf633d0, notify=<value optimized out>, timeout=1)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPollerBuiltIn.cc:404
#8  0x00002aaaad669423 in EnableUplink (this=0x2aaaacf93280, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClAsyncSocketHandler.hh:96
#9  XrdCl::Stream::EnableLink (this=0x2aaaacf93280, path=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:186
#10 0x00002aaaad669729 in XrdCl::Stream::Send (this=0x2aaaacf93280, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.cc:273
#11 0x00002aaaad665ede in XrdCl::Channel::Send (this=0x2aaaacf93140, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClChannel.cc:266
#12 0x00002aaaad66539e in XrdCl::PostMaster::Send (this=<value optimized out>, url=<value optimized out>, msg=0x2aaab001d100, handler=0x2aaab0026888, stateful=true, expires=1366069376)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:169
#13 0x00002aaaad684f10 in XrdCl::MessageUtils::SendMessage (url=..., msg=0x2aaab001d100, handler=<value optimized out>, sendParams=...) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClMessageUtils.cc:109
#14 0x00002aaaad68e836 in XrdCl::FileStateHandler::SendOrQueue (this=0x2aaaae517ac0, url=<value optimized out>, msg=0x2aaab001d100, handler=0x2aaab002c5c0, sendParams=...)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:1129
#15 0x00002aaaad690716 in XrdCl::FileStateHandler::Stat (this=0x2aaaae517ac0, force=<value optimized out>, handler=0x41dbbbf0, timeout=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:496
#16 0x00002aaaad687904 in XrdCl::File::Stat (this=<value optimized out>, force=false, handler=0xffffffffffffffff, timeout=53752) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:103
#17 0x00002aaaad68858e in XrdCl::File::Stat (this=0x2aaaaf0256a0, force=true, response=@0x41dbbd10, timeout=0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFile.cc:114
#18 0x00002aaaad2db749 in XrdMqClient::RecvMessage (this=0x2aaaad627130) at /afs/cern.ch/work/a/apeters/eos-master/eos/mq/XrdMqClient.cc:304
#19 0x00002aaaad1a27bd in eos::mgm::Iostat::Receive (this=0x2aaaad626a48) at /afs/cern.ch/work/a/apeters/eos-master/eos/mgm/Iostat.cc:258
#20 0x000000326f21fc2f in XrdSysThread_Xeq (myargs=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#21 0x0000003da6a0673d in start_thread () from /lib64/libpthread.so.0
#22 0x0000003da62d44bd in clone () from /lib64/libc.so.6

Thread 26 is stuck here:

cbkMHelp.Lock(&(cP->chMutex));

Thread 21 is stuck here:

XrdSysMutexHelper mHelper(toMutex);

In Thread 26 the owner of chMutex is ID=2391 => Thread21

(gdb) print cP->chMutex
$28 = {<XrdSysMutex> = {cs = {__data = {__lock = 2, __count = 1, __owner = 2391, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\001\000\000\000W\t\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}, <No data fields>}

In Thread 21 the owner of toMutex is ID=2364 => Thread 26
(gdb) print this->toMutex
$29 = {<XrdSysMutex> = {cs = {__data = {__lock = 2, __count = 1, __owner = 2364, __nusers = 1, __kind = 1, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\002\000\000\000\001\000\000\000<\t\000\000\001\000\000\000\001", '\000' <repeats 22 times>, __align = 4294967298}}, <No data fields>}

===> deadlock between both threads.

@abh3
Copy link
Member

abh3 commented Apr 17, 2013

OK, I see the issue now. It's how timeouts are handled. Seems that just as there was a timeout, the Enable() method was called from another thread for the same object that is in the process of calling the timeout callback. It seems rather suspicious that someone should do that because it means that the channel is being re-enabled while it's was enabled for timeouts. That does mean that it quite likely that the timeout callback will be called in any case. Was this really intended? I pushed a patch that should hopefully fix the deadlock.

@abh3 abh3 closed this as completed Apr 17, 2013
@ljanyst
Copy link
Contributor Author

ljanyst commented Apr 17, 2013

Thanks for the patch! We'll test it.

The timeout is related to reading subscription, which never gets disabled. Here, the writing subscription was being enabled. Yes, this is intended.

@abh3
Copy link
Member

abh3 commented Apr 17, 2013

OK, since I have to review the timeout section to see if I can get it to be more efficient I may have to split the read and write queues. At the moment there is a minimization algorithm that tries to use only one of the two timeout values to keep the timeout queues short.

@ljanyst
Copy link
Contributor Author

ljanyst commented Apr 17, 2013

I think that it's good enough for me for the moment.

@abh3
Copy link
Member

abh3 commented Apr 17, 2013

Yes, but I still have to go through the cpugrind output to see where I am inefficient, sigh.

amadio pushed a commit to amadio/xrootd that referenced this issue May 12, 2023
Added additional statistics to ceph access.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants