Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XrdCl] Crash in on-exit destruction in ROOT #1487

Closed
osschar opened this issue Jul 28, 2021 · 10 comments
Closed

[XrdCl] Crash in on-exit destruction in ROOT #1487

osschar opened this issue Jul 28, 2021 · 10 comments
Assignees

Comments

@osschar
Copy link
Contributor

osschar commented Jul 28, 2021

root master, xrootd-5.3.0

To reproduce: TFile::Open("root://....."), then .q root.

Also opened as issue in root git: root-project/root#8767

I can try reproducing this with a minimal c++ program.

0x00007ffff74b1a66 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007ffff74b1a66 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00007fffe91ecd62 in XrdCl::Env::GetInt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int&) ()
   from /lib64/libXrdCl.so.3
#2  0x00007fffe9207a39 in XrdCl::PostMaster::PostMaster() () from /lib64/libXrdCl.so.3
#3  0x00007fffe91fed5e in XrdCl::DefaultEnv::GetPostMaster() () from /lib64/libXrdCl.so.3
#4  0x00007fffe9245fc5 in XrdCl::FileStateHandler::~FileStateHandler() () from /lib64/libXrdCl.so.3
#5  0x00007fffe9243bd0 in XrdCl::File::~File() () from /lib64/libXrdCl.so.3
#6  0x00007fffe9243c6d in XrdCl::File::~File() () from /lib64/libXrdCl.so.3
#7  0x00007fffef69630f in TNetXNGFile::~TNetXNGFile (this=0x2019f20, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/net/netxng/src/TNetXNGFile.cxx:233
#8  0x00007fffef696396 in TNetXNGFile::~TNetXNGFile (this=0x2019f20, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/net/netxng/src/TNetXNGFile.cxx:236
#9  0x00007ffff7bc5ad3 in TCollection::GarbageCollect (obj=0x2019f20) at /home/matevz/root-dev/dev-1/core/cont/src/TCollection.cxx:736
#10 0x00007ffff7bcdbe5 in TList::Delete (this=0x4307c0, option=0x7ffff7e1726b "slow") at /home/matevz/root-dev/dev-1/core/cont/src/TList.cxx:508
#11 0x00007ffff7af756b in TROOT::~TROOT (this=0x7ffff7f5a900 <ROOT::Internal::GetROOT1()::alloc>, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/core/base/src/TROOT.cxx:889
#12 0x00007ffff7af4f5f in at_exit_of_TROOT () at /home/matevz/root-dev/dev-1/core/base/src/TROOT.cxx:290
#13 0x00007ffff7317247 in __run_exit_handlers () from /lib64/libc.so.6
#14 0x00007ffff73173f0 in exit () from /lib64/libc.so.6
#15 0x00007ffff7c91627 in TUnixSystem::Exit (this=0x42a800, code=0, mode=true) at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:2157
#16 0x00007ffff7b0cc16 in TApplication::Terminate (this=0x482000, status=0) at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1672
#17 0x00007ffff7facf9d in TRint::Terminate (this=0x482000, status=0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:696
#18 0x00007ffff7b0b37d in TApplication::ProcessLine (this=0x482000, line=0x7fffffffb589 ".q", sync=false, err=0x7fffffffb51c)
    at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1334
#19 0x00007ffff7fad334 in TRint::ProcessLineNr (this=0x482000, filestem=0x7ffff7fbd6cf "ROOT_prompt_", line=0x7fffffffb589 ".q", error=0x7fffffffb51c)
    at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:766
#20 0x00007ffff7facab4 in TRint::HandleTermInput (this=0x482000) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:612
#21 0x00007ffff7faa5c5 in TTermInputHandler::Notify (this=0x1381dc0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:132
#22 0x00007ffff7fae98f in TTermInputHandler::ReadNotify (this=0x1381dc0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:124
#23 0x00007ffff7c8fa8a in TUnixSystem::CheckDescriptors (this=0x42a800) at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:1322
#24 0x00007ffff7c8eec6 in TUnixSystem::DispatchOneEvent (this=0x42a800, pendingOnly=false)
    at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:1077
#25 0x00007ffff7b7e773 in TSystem::InnerLoop (this=0x42a800) at /home/matevz/root-dev/dev-1/core/base/src/TSystem.cxx:406
#26 0x00007ffff7b7e519 in TSystem::Run (this=0x42a800) at /home/matevz/root-dev/dev-1/core/base/src/TSystem.cxx:356
#27 0x00007ffff7b0c963 in TApplication::Run (this=0x482000, retrn=false) at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1604
#28 0x00007ffff7fabddd in TRint::Run (this=0x482000, retrn=false) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:463
#29 0x00000000004011e8 in main (argc=1, argv=0x7fffffffda28) at /home/matevz/root-dev/dev-1/main/src/rmain.cxx:30
@simonmichal
Copy link
Contributor

hmm it seems to be related to this comment:

File::~File()
{
//--------------------------------------------------------------------------
// This, in principle, should never ever happen. Except for the case
// when we're interfaced with ROOT that may call this desctructor from
// its garbage collector, from its __cxa_finalize, ie. after the XrdCl lib
// has been finalized by the linker. So, if we don't have the log object
// at this point we just give up the hope.
//--------------------------------------------------------------------------
if ( DefaultEnv::GetLog() && IsOpen() ) {XRootDStatus status = Close();}
delete pStateHandler;
delete pPlugIn;
}

@simonmichal
Copy link
Contributor

It must happen here:

FileStateHandler::~FileStateHandler()
{
if( pSessionId ) // if the file object was bound to a physical connection
DefaultEnv::GetPostMaster()->DecFileInstCnt( *pDataServer );

I think the if statement needs the DefaultEnv::GetLog() similarly as the destructor ~File destructor.

@osschar
Copy link
Contributor Author

osschar commented Jul 28, 2021

Thanks! I did a stupid test that just opens a file and exits ... and it seems to exit fine. Should I try something else or look harder at what happens in root?

#include "XrdCl/XrdClFile.hh"

#include <cassert>
#include <cstdio>
#include <iostream>
 
// c++ -o noclose-test  noclose-test.cxx -I/usr/include/xrootd -lXrdCl

const std::string FNAME = "root://eoscms.cern.ch//store/mc/RunIISummer20UL18MiniAOD/BToKee_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/MINIAODSIM/bParking_bParking_106X_upgrade2018_realistic_v11_L1v1-v1/280000/0040A793-C5FE-CB42-9846-DA1E89C342AB.root";

int main()
{
  XrdCl::XRootDStatus st;
  XrdCl::File *f;

  f = new XrdCl::File;
  st = f->Open(FNAME, XrdCl::OpenFlags::Read);

  std::cout << "Status " << st.ToStr() << "\n";
}

@simonmichal
Copy link
Contributor

I am rather sure this is root specific, my gut feeling is the TCollection::GarbageCollect is called after the libXrdCl has been already unloaded (hence the call to the XrdCl::PostMaster::PostMaster() constructor in the backtrace)

@osschar
Copy link
Contributor Author

osschar commented Jul 28, 2021

Yes, something in that direction was my first guess as well ... but wanted to make sure all is good in xrootd before we start digging on the root side :)

Backtrace with debuginfo installed (not much better, but at least it's clear this pointers are garbage towards the top):

Thread 1 "root.exe" received signal SIGSEGV, Segmentation fault.
[Switching to process 2459374]
0x00007ffff74b1a66 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007ffff74b1a66 in pthread_rwlock_rdlock () from /lib64/libpthread.so.0
#1  0x00007fffe91ecd62 in XrdSysRWLock::ReadLock (this=0x8) at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/./XrdSys/XrdSysPthread.hh:342
#2  XrdSysRWLockHelper::XrdSysRWLockHelper (rd=true, l=..., this=<synthetic pointer>)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/./XrdSys/XrdSysPthread.hh:392
#3  XrdCl::Env::GetInt (this=0x0, k=..., value=@0x7fffffffad4c: 3) at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClEnv.cc:92
#4  0x00007fffe9207a39 in XrdCl::PostMasterImpl::PostMasterImpl (this=0x1942140)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClPostMaster.cc:64
#5  XrdCl::PostMaster::PostMaster (this=<optimized out>, this=<optimized out>)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClPostMaster.cc:60
#6  0x00007fffe91fed5e in XrdCl::DefaultEnv::GetPostMaster () at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClDefaultEnv.cc:464
#7  0x00007fffe9245fc5 in XrdCl::FileStateHandler::~FileStateHandler (this=<optimized out>, this=<optimized out>)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClFileStateHandler.cc:796
#8  0x00007fffe9243bd0 in XrdCl::File::~File (this=<optimized out>, this=<optimized out>)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClFile.cc:71
#9  0x00007fffe9243c6d in XrdCl::File::~File (this=<optimized out>, this=<optimized out>)
    at /usr/src/debug/xrootd-5.3.0-1.fc33.x86_64/src/XrdCl/XrdClFile.cc:73
#10 0x00007fffef69630f in TNetXNGFile::~TNetXNGFile (this=0x1f8d5d0, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/net/netxng/src/TNetXNGFile.cxx:233
#11 0x00007fffef696396 in TNetXNGFile::~TNetXNGFile (this=0x1f8d5d0, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/net/netxng/src/TNetXNGFile.cxx:236
#12 0x00007ffff7bc5ad3 in TCollection::GarbageCollect (obj=0x1f8d5d0) at /home/matevz/root-dev/dev-1/core/cont/src/TCollection.cxx:736
#13 0x00007ffff7bcdbe5 in TList::Delete (this=0x4307c0, option=0x7ffff7e1726b "slow") at /home/matevz/root-dev/dev-1/core/cont/src/TList.cxx:508
#14 0x00007ffff7af756b in TROOT::~TROOT (this=0x7ffff7f5a900 <ROOT::Internal::GetROOT1()::alloc>, __in_chrg=<optimized out>)
    at /home/matevz/root-dev/dev-1/core/base/src/TROOT.cxx:889
#15 0x00007ffff7af4f5f in at_exit_of_TROOT () at /home/matevz/root-dev/dev-1/core/base/src/TROOT.cxx:290
#16 0x00007ffff7317247 in __run_exit_handlers () from /lib64/libc.so.6
#17 0x00007ffff73173f0 in exit () from /lib64/libc.so.6
#18 0x00007ffff7c91627 in TUnixSystem::Exit (this=0x42a800, code=0, mode=true) at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:2157
#19 0x00007ffff7b0cc16 in TApplication::Terminate (this=0x482000, status=0) at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1672
#20 0x00007ffff7facf9d in TRint::Terminate (this=0x482000, status=0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:696
#21 0x00007ffff7b0b37d in TApplication::ProcessLine (this=0x482000, line=0x7fffffffb589 ".q", sync=false, err=0x7fffffffb51c)
    at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1334
#22 0x00007ffff7fad334 in TRint::ProcessLineNr (this=0x482000, filestem=0x7ffff7fbd6cf "ROOT_prompt_", line=0x7fffffffb589 ".q", error=0x7fffffffb51c)
    at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:766
#23 0x00007ffff7facab4 in TRint::HandleTermInput (this=0x482000) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:612
#24 0x00007ffff7faa5c5 in TTermInputHandler::Notify (this=0x1381dc0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:132
#25 0x00007ffff7fae98f in TTermInputHandler::ReadNotify (this=0x1381dc0) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:124
#26 0x00007ffff7c8fa8a in TUnixSystem::CheckDescriptors (this=0x42a800) at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:1322
#27 0x00007ffff7c8eec6 in TUnixSystem::DispatchOneEvent (this=0x42a800, pendingOnly=false)
    at /home/matevz/root-dev/dev-1/core/unix/src/TUnixSystem.cxx:1077
#28 0x00007ffff7b7e773 in TSystem::InnerLoop (this=0x42a800) at /home/matevz/root-dev/dev-1/core/base/src/TSystem.cxx:406
#29 0x00007ffff7b7e519 in TSystem::Run (this=0x42a800) at /home/matevz/root-dev/dev-1/core/base/src/TSystem.cxx:356
#30 0x00007ffff7b0c963 in TApplication::Run (this=0x482000, retrn=false) at /home/matevz/root-dev/dev-1/core/base/src/TApplication.cxx:1604
#31 0x00007ffff7fabddd in TRint::Run (this=0x482000, retrn=false) at /home/matevz/root-dev/dev-1/core/rint/src/TRint.cxx:463
#32 0x00000000004011e8 in main (argc=1, argv=0x7fffffffda28) at /home/matevz/root-dev/dev-1/main/src/rmain.cxx:30

@osschar
Copy link
Contributor Author

osschar commented Jul 28, 2021

Trying to summarize:

  1. ROOT should close opened files sooner, probably in TApplication::Terminate()
  2. If there is a very good reason for them not doing this, we can add a similar protection as in ~File to ~FileStateHandler, with similar caveat.

@simonmichal
Copy link
Contributor

well, I think I will just add the protection to ~FileStateHandler, the PostMaster staff has been added recently to the destructor

FileStateHandler::~FileStateHandler()
{
if( pSessionId ) // if the file object was bound to a physical connection
DefaultEnv::GetPostMaster()->DecFileInstCnt( *pDataServer );

so from root perspective this is a regression in XrdCl I suppose.

Of course, it would be great if they could garbage collect all XrdCl object before unloading the library ;-)

@osschar
Copy link
Contributor Author

osschar commented Jul 28, 2021

Thank you! Yes, I left a comment to this effect, that known global objects should be deleted before calling exit.

I somewhat suspect nobody wants to touch that code, not even with a 10 ft pole :)

@abh3 abh3 assigned abh3 and simonmichal and unassigned abh3 and simonmichal Jul 29, 2021
@simonmichal
Copy link
Contributor

I pushed the patch: ca92cb3

I somewhat suspect nobody wants to touch that code, not even with a 10 ft pole :)

yes, I think so too ;-)

@osschar
Copy link
Contributor Author

osschar commented Jul 29, 2021

Excellent, I confirm this avoids the crash in root shutdown.

@osschar osschar closed this as completed Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants