Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xrdcp hanging for weeks/months #293

Closed
esindril opened this issue Sep 28, 2015 · 1 comment
Closed

xrdcp hanging for weeks/months #293

esindril opened this issue Sep 28, 2015 · 1 comment

Comments

@esindril
Copy link
Contributor

We have seen xrdcp processes which are hanging for a long time. For example:

root      9775  0.0  0.0 103308   864 pts/0    S+   19:08   0:00 grep xrdcp
stage    31823  0.0  0.0 427516 61896 ?        Sl   Jul19   2:09 xrdcp -N root://p05798818s26894.cern.ch:1095///srv/castor/04/58/1403770558@castorns.26998125389?castor.accessop=0&castor.txtype=d2duser&castor.exptime=1437339330&castor.pfn2=0:15511:1a5a38c6-84a5-7fee-e053-a908100acf8f&castor.pfn1=/srv/castor/04/58/1403770558@castorns.26998125389&castor.signature=I+7E0Q5KcvQn8wYXWy070SCq3IAPZOeuBzUHgwN8wz738q52vGJy5D9TbMDUeYnHrg/T3ErFi6F6ZEKuvdv3OA== root://localhost:1095///srv/castor/01/58/1403770558@castorns.28601710010?castor.accessop=0&castor.txtype=d2duser&castor.exptime=1437339330&castor.pfn2=0:15511:1a5a38c6-84a5-7fee-e053-a908100acf8f&castor.pfn1=/srv/castor/01/58/1403770558@castorns.28601710010&castor.signature=NVQ6OqRWr1gz1sN/vscKqSLsRkSyi9lDjVofTU3haepuGNdG+WFokT1Tth4exwPiyZfzyOLltyI3mdn+ZKu7og==

Looking at the state of the threads:

(gdb) info threads 
* 1 Thread 31828  0x00000038bf80efe0 in nanosleep () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00000038bf80efe0 in nanosleep () from /lib64/libpthread.so.0
#1  0x00000038bf80917b in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x000000310526bb73 in Lock (this=0x249c010, status=0x7feec40008c0, openInfo=0x0, hostList=0x7feed4000c60) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:149
#3  XrdSysMutexHelper (this=0x249c010, status=0x7feec40008c0, openInfo=0x0, hostList=0x7feed4000c60) at /usr/src/debug/xrootd/xrootd/src/XrdSys/XrdSysPthread.hh:208
#4  XrdCl::FileStateHandler::OnOpen (this=0x249c010, status=0x7feec40008c0, openInfo=0x0, hostList=0x7feed4000c60) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:889
#5  0x000000310526c641 in (anonymous namespace)::OpenHandler::HandleResponseWithHosts (this=0x7feed4002f90, status=0x7feec40008c0, response=0x0, hostList=0x7feed4000c60)
    at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClFileStateHandler.cc:83
#6  0x0000003105259891 in XrdCl::XRootDMsgHandler::HandleResponse (this=0x7feed40035f0) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:1064
#7  0x000000310525a0d3 in XrdCl::XRootDMsgHandler::HandleError (this=0x7feed40035f0, status=..., msg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:1795
#8  0x000000310525ae3d in XrdCl::XRootDMsgHandler::Process (this=0x7feed40035f0, msg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClXRootDMsgHandler.cc:342
#9  0x0000003105244bde in XrdCl::Stream::HandleIncMsgJob::Run (this=0x7feed4002a10, arg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClStream.hh:279
#10 0x00000031052889cd in XrdCl::JobManager::RunJobs (this=0x2496e40) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClJobManager.cc:148
#11 0x0000003105288a49 in RunRunnerThread (arg=<value optimized out>) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClJobManager.cc:33
#12 0x00000038bf8079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00000038bf4e88fd in sysctl () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()

(gdb) bt
#0  0x00000038bf80822d in pthread_join () from /lib64/libpthread.so.0
#1  0x0000003105288784 in XrdCl::JobManager::StopWorkers (this=0x2496e40, n=2) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClJobManager.cc:127
#2  0x0000003105288be6 in XrdCl::JobManager::Stop (this=0x2496e40) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClJobManager.cc:103
#3  0x000000310523e161 in XrdCl::PostMaster::Stop (this=0x2496260) at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClPostMaster.cc:140
#4  0x0000003105230cca in XrdCl::DefaultEnv::Finalize () at /usr/src/debug/xrootd/xrootd/src/XrdCl/XrdClDefaultEnv.cc:590
#5  0x00000038bf435ebd in __cxa_finalize () from /lib64/libc.so.6
#6  0x0000003105224db6 in __do_global_dtors_aux () from /usr/lib64/libXrdCl.so.2
#7  0x0000000000000000 in ?? ()

These are the only threads for which I could get useful info. These processes usually have the xrdcp binary removed i.e.

# ls -lrt /proc/31823/
total 0
lrwxrwxrwx. 1 stage st 0 Sep 28 13:40 exe -> /usr/bin/xrdcp (deleted)
@esindril
Copy link
Contributor Author

This was fixed in XRootD 4.3.0 by fixing the handling of async requests in retry scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant