Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self deadlock in XrdFileCache #239

Closed
illingwo opened this issue May 28, 2015 · 6 comments · Fixed by #240
Closed

Self deadlock in XrdFileCache #239

illingwo opened this issue May 28, 2015 · 6 comments · Fixed by #240
Milestone

Comments

@illingwo
Copy link

I can consistently provoke a deadlock in the file cache (version 4.2.0):

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x0000003120209508 in _L_lock_854 () from /lib64/libpthread.so.0
#2  0x00000031202093d7 in __pthread_mutex_lock (mutex=0x7f86ec0057c8) at pthread_mutex_lock.c:61
#3  0x00007f86fbddd110 in Lock (this=0x7f86ec0056b0, __in_chrg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:57
#4  XrdFileCache::Prefetch::~Prefetch (this=0x7f86ec0056b0, __in_chrg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFileCache/XrdFileCachePrefetch.cc:133
#5  0x00007f86fbdde57f in XrdFileCache::IOEntireFile::Detach (this=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdFileCache/XrdFileCacheIOEntireFile.cc:77
#6  0x000000312260a4cc in XrdPosixFile::~XrdPosixFile (this=0x7f86ec0031e0, __in_chrg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdPosix/XrdPosixFile.cc:84
#7  0x000000312260a6a9 in XrdPosixFile::~XrdPosixFile (this=0x7f86ec0031e0, __in_chrg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdPosix/XrdPosixFile.cc:93
#8  0x000000312260dc4f in XrdPosixXrootd::Close (fildes=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdPosix/XrdPosixXrootd.cc:166
#9  0x00007f86fc200c41 in XrdPssFile::Close (this=0x7f86ec0031b0, retsz=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdPss/XrdPss.cc:734
#10 0x000000312126a86a in XrdOfsHandle::Retire (this=0x245df48, retc=@0x7f86fc960d98, retsz=0x0, buff=<value optimized out>, blen=0)
    at /usr/src/debug/xrootd/xrootd/src/XrdOfs/XrdOfsHandle.cc:400
#11 0x0000003121260107 in XrdOfsFile::close (this=0x7f86ec003810) at /usr/src/debug/xrootd/xrootd/src/XrdOfs/XrdOfs.cc:753
#12 0x0000003121259151 in XrdXrootdProtocol::do_Close (this=0x7f86ec002df0) at /usr/src/debug/xrootd/xrootd/src/XrdXrootd/XrdXrootdXeq.cc:474
#13 0x000000312165e119 in XrdLink::DoIt (this=0x7f86ec002b38) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdLink.cc:397
#14 0x00000031216615f5 in XrdScheduler::Run (this=0x610038) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:333
#15 0x00000031216617e9 in XrdStartWorking (carg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/Xrd/XrdScheduler.cc:85
#16 0x00000031216263af in XrdSysThread_Xeq (myargs=0x244b250) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.cc:86
#17 0x00000031202079d1 in start_thread (arg=0x7f86fc961700) at pthread_create.c:301
#18 0x000000311fee88fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

where the thread is stuck trying to acquire a lock it already holds.

I think the cause is Prefetch::InitiateClose in XrdFileCache/XrdFileCachePrefetch.cc (called previously from XrdOfsFile::close() via IOEntireFile::ioActive()

bool Prefetch::InitiateClose()
{
   // Retruns true if delay is needed

   clLog()->Debug(XrdCl::AppMsg, "Prefetch::Initiate close start", lPath());
   if (m_cfi.IsComplete()) return false;

   m_stateCond.Lock();
   if (m_started == false) return false;
   m_stopping = true;
   m_stateCond.UnLock();

   return true;
}

if m_started is false m_stateCond is not unlocked, causing a self-deadlock when the destructor tries to lock it again.

@osschar
Copy link
Contributor

osschar commented May 28, 2015

Sure, we'll get this sorted out asap.

What do you do to get it to lock consistently?

@illingwo
Copy link
Author

All I'm doing is running xrdcp on a small file (8.2M) that's already in the cache. This part of a trivial test setup, not a production system. The weird thing is that it worked fine earlier today; now it's consistently hanging even after several server restarts.

@alja
Copy link
Contributor

alja commented May 28, 2015

@illingwo

I see it would be better to use XrdSysCondVarHelper for m_stateCond in XrdFileCachePrefetc::InitiateClose() instead of directly calling Lock/UnLock on the m_statCond. Still I'm trying to understand how this deadlock happens. The m_started variable is set to true as soon as Prefetch::Run() thread is started and is never set to false except in constructor.

Can you send me an xrootd log file if you have it?

I could not reproduce the lock. Did you run xrdcp simultaneously from different clients? Were you killing xrdcp commands? Is it by any chance possible disk usage exceeded the limit and the files were purged in the meantime? Is the lock global for all files or for a specific file?

Thanks,
Alja

@illingwo
Copy link
Author

I've figured out how I can reproduce it. If I restart the server and copy the file it works. But if I don't remove the existing local file and try to copy over it I get the hang (which is something I keep doing accidentally). After that all transfers are stuck

[illingwo@fermicloud126 ~]$ xrdcp -d2 root://localhost//public/blast/data/drosoph.aa.phr /tmp
[2015-05-28 13:52:44.785589 -0500][Debug  ][Utility           ] CopyProcess: 2 jobs to prepare
[2015-05-28 13:52:44.785728 -0500][Debug  ][Utility           ] Creating a classic copy job, from root://localhost:1094//public/blast/data/drosoph.aa.phr to file:///tmp//drosoph.aa.phr
[2015-05-28 13:52:44.785763 -0500][Debug  ][Utility           ] Monitor library name not set. No monitoring
[2015-05-28 13:52:44.785814 -0500][Debug  ][Utility           ] Opening root://localhost:1094//public/blast/data/drosoph.aa.phr for reading
[2015-05-28 13:52:44.785974 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Sending an open command
[2015-05-28 13:52:44.786023 -0500][Debug  ][Poller            ] Available pollers: built-in
[2015-05-28 13:52:44.786034 -0500][Debug  ][Poller            ] Attempting to create a poller according to preference: built-in,libevent
[2015-05-28 13:52:44.786044 -0500][Debug  ][Poller            ] Creating poller: built-in
[2015-05-28 13:52:44.786053 -0500][Debug  ][Poller            ] Creating and starting the built-in poller...
[2015-05-28 13:52:44.786133 -0500][Debug  ][TaskMgr           ] Starting the task manager...
[2015-05-28 13:52:44.786176 -0500][Debug  ][TaskMgr           ] Task manager started
[2015-05-28 13:52:44.786190 -0500][Debug  ][JobMgr            ] Starting the job manager...
[2015-05-28 13:52:44.786263 -0500][Debug  ][JobMgr            ] Job manager started, 3 workers
[2015-05-28 13:52:44.786280 -0500][Debug  ][TaskMgr           ] Registering task: "FileTimer task" to be run at: [2015-05-28 13:52:44 -0500]
[2015-05-28 13:52:44.786312 -0500][Debug  ][PostMaster        ] Creating new channel to: localhost:1094 1 stream(s)
[2015-05-28 13:52:44.786345 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Stream parameters: Network Stack: IPAuto, Connection Window: 120, ConnectionRetry: 5, Stream Error Widnow: 1800
[2015-05-28 13:52:44.787284 -0500][Debug  ][TaskMgr           ] Registering task: "TickGeneratorTask for: localhost:1094" to be run at: [2015-05-28 13:52:59 -0500]
[2015-05-28 13:52:44.787408 -0500][Debug  ][PostMaster        ] [localhost:1094] Found 1 address(es): [::ffff:127.0.0.1]:1094
[2015-05-28 13:52:44.787441 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Attempting connection to [::ffff:127.0.0.1]:1094
[2015-05-28 13:52:44.787513 -0500][Debug  ][Poller            ] Adding socket 0x1befdf0 to the poller
[2015-05-28 13:52:44.788023 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Async connection call returned
[2015-05-28 13:52:44.788057 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Sending out the initial hand shake + kXR_protocol
[2015-05-28 13:52:44.788265 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Got the server hand shake response (type: server [], protocol version 300)
[2015-05-28 13:52:44.788291 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] kXR_protocol successful (type: server [], protocol version 300)
[2015-05-28 13:52:44.791703 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Sending out kXR_login request, username: illingwo, cgi: ?xrd.cc=us&xrd.tz=-6&xrd.appname=xrdcp&xrd.info=&xrd.hostname=fermicloud126.fnal.gov, dual-stack: false, private IPv4: false, private IPv6: false
[2015-05-28 13:52:44.791963 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Logged in, session: 01000000b20a00000700000001000000
[2015-05-28 13:52:44.791989 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Stream 0 connected.
[2015-05-28 13:52:47.085138 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Open has returned with status [SUCCESS] 
[2015-05-28 13:52:47.085180 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] successfully opened at localhost:1094, handle: 0x0, session id: 1
[2015-05-28 13:52:47.085220 -0500][Debug  ][Utility           ] Opening /tmp//drosoph.aa.phr for writing
[2015-05-28 13:52:47.085299 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Sending a read command for handle 0x0 to localhost:1094
[1.699MB/1.699MB][100%][==================================================][580kB/s]  [2015-05-28 13:52:47.090811 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Sending a close command for handle 0x0 to localhost:1094
[2015-05-28 13:52:47.384986 -0500][Debug  ][File              ] [0x1bea760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Close returned from localhost:1094 with: [SUCCESS] 
[1.699MB/1.699MB][100%][==================================================][580kB/s]  
[2015-05-28 13:52:47.385145 -0500][Debug  ][JobMgr            ] Stopping the job manager...
[2015-05-28 13:52:47.385371 -0500][Debug  ][JobMgr            ] Job manager stopped
[2015-05-28 13:52:47.385388 -0500][Debug  ][TaskMgr           ] Stopping the task manager...
[2015-05-28 13:52:47.385464 -0500][Debug  ][TaskMgr           ] Task manager stopped
[2015-05-28 13:52:47.385480 -0500][Debug  ][Poller            ] Stopping the poller...
[2015-05-28 13:52:47.385556 -0500][Debug  ][TaskMgr           ] Requesting unregistration of: "TickGeneratorTask for: localhost:1094"
[2015-05-28 13:52:47.385573 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Closing the socket
[2015-05-28 13:52:47.385583 -0500][Debug  ][Poller            ] <[::ffff:127.0.0.1]:32953><--><[::ffff:127.0.0.1]:1094> Removing socket from the poller
[2015-05-28 13:52:47.385767 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Destroying stream
[2015-05-28 13:52:47.385784 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Closing the socket
[illingwo@fermicloud126 ~]$ xrdcp -d2 root://localhost//public/blast/data/drosoph.aa.phr /tmp
[2015-05-28 13:52:53.705630 -0500][Debug  ][Utility           ] CopyProcess: 2 jobs to prepare
[2015-05-28 13:52:53.705755 -0500][Debug  ][Utility           ] Creating a classic copy job, from root://localhost:1094//public/blast/data/drosoph.aa.phr to file:///tmp//drosoph.aa.phr
[2015-05-28 13:52:53.705787 -0500][Debug  ][Utility           ] Monitor library name not set. No monitoring
[2015-05-28 13:52:53.705951 -0500][Debug  ][Utility           ] Opening root://localhost:1094//public/blast/data/drosoph.aa.phr for reading
[2015-05-28 13:52:53.706000 -0500][Debug  ][File              ] [0x18d2760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Sending an open command
[2015-05-28 13:52:53.706045 -0500][Debug  ][Poller            ] Available pollers: built-in
[2015-05-28 13:52:53.706055 -0500][Debug  ][Poller            ] Attempting to create a poller according to preference: built-in,libevent
[2015-05-28 13:52:53.706070 -0500][Debug  ][Poller            ] Creating poller: built-in
[2015-05-28 13:52:53.706082 -0500][Debug  ][Poller            ] Creating and starting the built-in poller...
[2015-05-28 13:52:53.706168 -0500][Debug  ][TaskMgr           ] Starting the task manager...
[2015-05-28 13:52:53.706212 -0500][Debug  ][TaskMgr           ] Task manager started
[2015-05-28 13:52:53.706227 -0500][Debug  ][JobMgr            ] Starting the job manager...
[2015-05-28 13:52:53.706298 -0500][Debug  ][JobMgr            ] Job manager started, 3 workers
[2015-05-28 13:52:53.706314 -0500][Debug  ][TaskMgr           ] Registering task: "FileTimer task" to be run at: [2015-05-28 13:52:53 -0500]
[2015-05-28 13:52:53.706347 -0500][Debug  ][PostMaster        ] Creating new channel to: localhost:1094 1 stream(s)
[2015-05-28 13:52:53.706382 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Stream parameters: Network Stack: IPAuto, Connection Window: 120, ConnectionRetry: 5, Stream Error Widnow: 1800
[2015-05-28 13:52:53.707284 -0500][Debug  ][TaskMgr           ] Registering task: "TickGeneratorTask for: localhost:1094" to be run at: [2015-05-28 13:53:08 -0500]
[2015-05-28 13:52:53.707403 -0500][Debug  ][PostMaster        ] [localhost:1094] Found 1 address(es): [::ffff:127.0.0.1]:1094
[2015-05-28 13:52:53.707436 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Attempting connection to [::ffff:127.0.0.1]:1094
[2015-05-28 13:52:53.707502 -0500][Debug  ][Poller            ] Adding socket 0x18d7df0 to the poller
[2015-05-28 13:52:53.707770 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Async connection call returned
[2015-05-28 13:52:53.707802 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Sending out the initial hand shake + kXR_protocol
[2015-05-28 13:52:53.708131 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Got the server hand shake response (type: server [], protocol version 300)
[2015-05-28 13:52:53.708158 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] kXR_protocol successful (type: server [], protocol version 300)
[2015-05-28 13:52:53.710948 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Sending out kXR_login request, username: illingwo, cgi: ?xrd.cc=us&xrd.tz=-6&xrd.appname=xrdcp&xrd.info=&xrd.hostname=fermicloud126.fnal.gov, dual-stack: false, private IPv4: false, private IPv6: false
[2015-05-28 13:52:53.711086 -0500][Debug  ][XRootDTransport   ] [localhost:1094 #0.0] Logged in, session: 02000000b20a00001900000002000000
[2015-05-28 13:52:53.711110 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Stream 0 connected.
[2015-05-28 13:52:53.726693 -0500][Debug  ][File              ] [0x18d2760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Open has returned with status [SUCCESS] 
[2015-05-28 13:52:53.726712 -0500][Debug  ][File              ] [0x18d2760@root://localhost:1094//public/blast/data/drosoph.aa.phr] successfully opened at localhost:1094, handle: 0x0, session id: 1
[2015-05-28 13:52:53.726739 -0500][Debug  ][Utility           ] Opening /tmp//drosoph.aa.phr for writing
[2015-05-28 13:52:53.726764 -0500][Debug  ][Utility           ] Unable to open /tmp//drosoph.aa.phr: File exists
[2015-05-28 13:52:53.726779 -0500][Debug  ][File              ] [0x18d2760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Sending a close command for handle 0x0 to localhost:1094
[2015-05-28 13:53:53.787431 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Closing the socket
[2015-05-28 13:53:53.787489 -0500][Debug  ][Poller            ] <[::ffff:127.0.0.1]:32957><--><[::ffff:127.0.0.1]:1094> Removing socket from the poller
[2015-05-28 13:53:53.787578 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Recovering error for stream #0: [ERROR] Socket error.
[2015-05-28 13:53:53.787590 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Reporting disconnection to queued message handlers.
[2015-05-28 13:53:53.787603 -0500][Error  ][XRootD            ] [localhost:1094] Handling error while processing kXR_close (handle: 0x00000000): [ERROR] Socket error.
[2015-05-28 13:53:53.787612 -0500][Error  ][XRootD            ] [localhost:1094] Unable to get the response to request kXR_close (handle: 0x00000000)
[2015-05-28 13:53:53.787628 -0500][Debug  ][File              ] [0x18d2760@root://localhost:1094//public/blast/data/drosoph.aa.phr] Close returned from localhost:1094 with: [ERROR] Socket error
[0B/0B][100%][==================================================][0B/s]  
Run: [ERROR] OS Error: File exists
[2015-05-28 13:53:53.787891 -0500][Debug  ][JobMgr            ] Stopping the job manager...
[2015-05-28 13:53:53.788290 -0500][Debug  ][JobMgr            ] Job manager stopped
[2015-05-28 13:53:53.788309 -0500][Debug  ][TaskMgr           ] Stopping the task manager...
[2015-05-28 13:53:53.788391 -0500][Debug  ][TaskMgr           ] Task manager stopped
[2015-05-28 13:53:53.788407 -0500][Debug  ][Poller            ] Stopping the poller...
[2015-05-28 13:53:53.788482 -0500][Debug  ][TaskMgr           ] Requesting unregistration of: "TickGeneratorTask for: localhost:1094"
[2015-05-28 13:53:53.788498 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Closing the socket
[2015-05-28 13:53:53.788507 -0500][Debug  ][PostMaster        ] [localhost:1094 #0] Destroying stream
[2015-05-28 13:53:53.788515 -0500][Debug  ][AsyncSock         ] [localhost:1094 #0.0] Closing the socket

(Notice there's a minute delay after the Sending a close command for handle 0x0 to localhost:1094 in the second copy.)

150528 13:52:44 2745 XrdInet: Accepted connection from 7@localhost.localdomain
150528 13:52:44 2744 XrdSched: Now have 3 workers
150528 13:52:44 2744 XrdSched: running main accept inq=0
150528 13:52:44 2745 XrdProtocol: matched protocol xrootd
150528 13:52:44 2745 ?:7@localhost.localdomain XrdPoll: FD 7 attached to poller 0; num=1
150528 13:52:44 2745 XrootdXeq: illingwo.2844:7@localhost.localdomain pvt IPv4 login
150528 13:52:44 2745 illingwo.2844:7@localhost.localdomain ofs_open: 0-600 fn=/public/blast/data/drosoph.aa.phr
150528 13:52:47 2745 illingwo.2844:7@localhost.localdomain ofs_fstat:  fn=/public/blast/data/drosoph.aa.phr
150528 13:52:47 2745 illingwo.2844:7@localhost.localdomain ofs_read: 1781790@0 fn=/public/blast/data/drosoph.aa.phr
150528 13:52:47 2745 illingwo.2844:7@localhost.localdomain ofs_close: use=1 fn=/public/blast/data/drosoph.aa.phr
150528 13:52:47 2745 illingwo.2844:7@localhost.localdomain ofs_close: use=0 fn=dummy
150528 13:52:47 2745 XrootdXeq: illingwo.2844:7@localhost.localdomain disc 0:00:03
150528 13:52:47 2745 illingwo.2844:7@localhost.localdomain XrdPoll: FD 7 detached from poller 0; num=0
150528 13:52:53 2744 XrdInet: Accepted connection from 25@localhost.localdomain
150528 13:52:53 2850 XrdSched: running main accept inq=0
150528 13:52:53 2744 XrdProtocol: matched protocol xrootd
150528 13:52:53 2744 ?:25@localhost.localdomain XrdPoll: FD 25 attached to poller 0; num=1
150528 13:52:53 2744 XrootdXeq: illingwo.2852:25@localhost.localdomain pvt IPv4 login
150528 13:52:53 2744 illingwo.2852:25@localhost.localdomain ofs_open: 0-600 fn=/public/blast/data/drosoph.aa.phr
150528 13:52:53 2744 illingwo.2852:25@localhost.localdomain ofs_fstat:  fn=/public/blast/data/drosoph.aa.phr
150528 13:52:53 2744 illingwo.2852:25@localhost.localdomain ofs_close: use=1 fn=/public/blast/data/drosoph.aa.phr

@illingwo illingwo reopened this May 28, 2015
alja added a commit to alja/xrootd that referenced this issue May 29, 2015
… is started. Observed with xrdcp ran without -f option and an existing local file. Fixes xrootd#239.
@alja
Copy link
Contributor

alja commented May 29, 2015

@illingwo
I submitted a PR #240 to fix this issue.

@illingwo
Copy link
Author

I can confirm this fixes the problem for me.

@abh3 abh3 closed this as completed in #240 Jun 1, 2015
@esindril esindril added this to the 4.2.2 milestone Jul 9, 2015
simonmichal pushed a commit that referenced this issue Jul 13, 2015
… is started. Observed with xrdcp ran without -f option and an existing local file. Fixes #239.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants