Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data races TThread TTimer TApplication #8365

Open
1 task done
Tracked by #14319
ferdymercury opened this issue Jun 8, 2021 · 15 comments
Open
1 task done
Tracked by #14319

Data races TThread TTimer TApplication #8365

ferdymercury opened this issue Jun 8, 2021 · 15 comments

Comments

@ferdymercury
Copy link
Collaborator

ferdymercury commented Jun 8, 2021

  • Checked for duplicates

Describe the bug

I encounter some data races when using TThread. They are reported by helgrind, and I also see sometimes dead-locks, see full discussion on https://root-forum.cern.ch/t/trentrantrwlock-thread-lock-program-freezes/45116/14

   at 0x5201F09: TOrdCollection::PhysIndex(int) const (TOrdCollection.h:135)
   by 0x5200D7E: TOrdCollection::IndexOf(TObject const*) const (TOrdCollection.cxx:258)
   by 0x5201355: TOrdCollection::Remove(TObject*) (TOrdCollection.cxx:369)
   by 0x519E6E4: TSystem::RemoveTimer(TTimer*) (TSystem.cxx:486)
   by 0x52CDD63: TUnixSystem::RemoveTimer(TTimer*) (TUnixSystem.cxx:2987)
   by 0x51B4903: TTimer::TurnOff() (TTimer.cxx:232)
   by 0x1C9E0029: ???
   by 0x8D6D5A9: TClingCallFunc::exec(void*, void*) (TClingCallFunc.cxx:1843)
   by 0x8D6E882: TClingCallFunc::Exec(void*, TInterpreterValue*) (TClingCallFunc.cxx:2102)
   by 0x8C164CB: TCling::CallFunc_Exec(CallFunc_t*, void*) const (TCling.cxx:7788)
   by 0x517DC35: TQConnection::SendSignal() (TQConnection.h:76)
   by 0x487CC2B: void TQObject::EmitVA<>(char const*, int) (TQObject.h:137)

This conflicts with a previous write of size 4 by thread #3
Locks held: 1, at address 0x167C9B50
   at 0x52005E4: TOrdCollection::AddAt(TObject*, int) (TOrdCollection.cxx:70)
   by 0x5200785: TOrdCollection::AddLast(TObject*) (TOrdCollection.cxx:102)
   by 0x5201C2D: TSeqCollection::Add(TObject*) (TSeqCollection.h:38)
   by 0x519E68B: TSystem::AddTimer(TTimer*) (TSystem.cxx:476)
   by 0x52CDC32: TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2974)
   by 0x64AC1E3: TThreadTimer::TThreadTimer(long) (TThread.cxx:1200)
   by 0x64AB943: TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1037)
   by 0x64AB448: TThread::Printf(char const*, ...) (TThread.cxx:950)
TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1058)
TThreadTimer::Notify() (TThread.cxx:1208)

image

--> TThread::fgXact is accessed unprotected, it could have been changed at the same time by XARequest:

image

TApplication::IsRunning() const (TApplication.h:148)
TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1029)
This conflicts with a previous write of size 1 by thread #1
TApplication::Run(bool) (TApplication.cxx:1622)

image
image

Possible data race during read of size 4 at 0x7F25098 by thread #3
Locks held: 1, at address 0x167C9B50
TCollection::GetSize() const (TCollection.h:182)
TOrdCollectionIter::Next() (TOrdCollection.cxx:506)
TIter::Next() (TCollection.h:249)
TIter::operator()() (TCollection.h:248)
TCollection::FindObject(TObject const*) const (TCollection.cxx:342)
TSystem::AddTimer(TTimer*) (TSystem.cxx:475)
TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2974)
TThreadTimer::TThreadTimer(long) (TThread.cxx:1200)
TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1037)
This conflicts with a previous write of size 4 by thread #1
Locks held: none
TOrdCollection::AddAt(TObject*, int) (TOrdCollection.cxx:85)
TOrdCollection::AddLast(TObject*) (TOrdCollection.cxx:102)
TSeqCollection::Add(TObject*) (TSeqCollection.h:38)
TSystem::AddTimer(TTimer*) (TSystem.cxx:476)
TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2974)
TTimer::TurnOn() (TTimer.cxx:247)
TTimer::Start(long, bool) (TTimer.cxx:216)

Possible data race during write of size 8 at 0x1B96F790 by thread #3
Locks held: none
TOrdCollection::MoveGapTo(int) (TOrdCollection.cxx:309)
TOrdCollection::AddAt(TObject*, int) (TOrdCollection.cxx:78)
TOrdCollection::AddLast(TObject*) (TOrdCollection.cxx:102)
TSeqCollection::Add(TObject*) (TSeqCollection.h:38)
TSystem::AddTimer(TTimer*) (TSystem.cxx:476)
TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2974)
TTimer::TurnOn() (TTimer.cxx:247)
TTimer::Start(long, bool) (TTimer.cxx:216)
MainWindow::SaveAndExit() (MainWindow.cpp:1216)
???
TClingCallFunc::exec(void*, void*) (TClingCallFunc.cxx:1843)
TClingCallFunc::Exec(void*, TInterpreterValue*) (TClingCallFunc.cxx:2102)
Address 0x1b96f790 is 64 bytes inside a block of size 72 alloc'd
operator new[](unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_helgrind-amd64-linux.so)
TStorage::ReAlloc(void*, unsigned long, unsigned long) (TStorage.cxx:238)
TOrdCollection::SetCapacity(int) (TOrdCollection.cxx:387)
TOrdCollection::AddAt(TObject*, int) (TOrdCollection.cxx:66)
TOrdCollection::AddLast(TObject*) (TOrdCollection.cxx:102)
TSeqCollection::Add(TObject*) (TSeqCollection.h:38)
TSystem::AddTimer(TTimer*) (TSystem.cxx:476)
TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2974)
TTimer::TurnOn() (TTimer.cxx:247)
TGCommandPlugin::TGCommandPlugin(TGWindow const*, unsigned int, unsigned int) (TGCommandPlugin.cxx:110)
MainWindow::MainWindow(TGWindow const*, unsigned int, unsigned int, PulseSurfer*, unsigned int, unsigned int, bool) (MainWindow.cpp:616)
main (main.cxx:86)

and several more can be seen in helgrind.log in the forum post, or in helgrind.xml attached here. helgrind.xml.zip

When opening the XML with QtCreator, they are rendered nicely:
image

Expected behavior

No data races are found. Or they are added to helgrind-root.supp

To Reproduce

  1. cd /opt/ && git clone https://github.com/CLIUtils/CLI11
  2. Download https://root-forum.cern.ch/uploads/short-url/z59x8uBIVMEE5S46EWUUHyyO3df.zip and unzip it, cd into it
  3. mkdir build && cd build
  4. cmake -DROOT_DIR=/build-debug-mode-ROOT -DCMAKE_BUILD_TYPE=Debug ../
  5. make
  6. valgrind --xml=yes --xml-file=helgrind.xml --tool=helgrind --suppressions=/pathtoDebugROOT/etc/helgrind-root.supp --log-file=helgrind.log gui/runGUI --prbs 3 -q

Setup

  1. ROOT master
  2. Ubuntu 18
  3. Self-built

Additional context

https://root-forum.cern.ch/t/trentrantrwlock-thread-lock-program-freezes/45116/14
#8297

@ferdymercury
Copy link
Collaborator Author

I managed to reduce a bit the number of data races with the attached patch.
race.patch.zip
Now I get:


---Thread-Announcement------------------------------------------

Thread #2 was created
   at 0x6D6070E: clone (clone.S:71)
   by 0x60E8EC4: create_thread (createthread.c:100)
   by 0x60E8EC4: pthread_create@@GLIBC_2.2.5 (pthread_create.c:797)
   by 0x4C38A27: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x7ABA260: TPosixThread::Run(TThread*) (TPosixThread.cxx:52)
   by 0x7AB61CA: TThread::Run(void*) (TThread.cxx:576)
   by 0x4E5D108: RThread::Start() (RThread.cpp:62)
   by 0x4E6C34B: MainWindow::DoStartDAQ() (MainWindow.cpp:1216)
   by 0x4E6DB7C: MainWindow::AcquirePRBS() (MainWindow.cpp:1476)
   by 0x11D697: main (main.cxx:94)

---Thread-Announcement------------------------------------------

Thread #1 is the program's root thread

----------------------------------------------------------------

Possible data race during write of size 1 at 0xB2D0758 by thread #1
Locks held: none
   at 0x5BDAAB9: TUnixSystem::DispatchTimers(bool) (TUnixSystem.cxx:2956)
   by 0x5BD5A48: TUnixSystem::DispatchOneEvent(bool) (TUnixSystem.cxx:1100)
   by 0x5AB6BC2: TSystem::InnerLoop() (TSystem.cxx:409)
   by 0x5AB6953: TSystem::Run() (TSystem.cxx:359)
   by 0x5A3F120: TApplication::Run(bool) (TApplication.cxx:1624)
   by 0x11D768: main (main.cxx:112)

This conflicts with a previous read of size 1 by thread #2
Locks held: none
   at 0x5BDAF58: TUnixSystem::ResetTimer(TTimer*) (TUnixSystem.cxx:3010)
   by 0x5BDADAE: TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2987)
   by 0x7AB7E06: TThreadTimer::TThreadTimer(long) (TThread.cxx:1209)
   by 0x7AB750D: TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1042)
   by 0x7AB6FE8: TThread::Printf(char const*, ...) (TThread.cxx:950)
   by 0x4E5F218: ThSFMC01::ThreadFunction() (ThSFMC01.cpp:51)
   by 0x4E5E88A: RThread::ThreadHandle(void*) (RThread.cpp:258)
   by 0x7AB6A70: TThread::Function(void*) (TThread.cxx:828)
 Address 0xb2d0758 is 168 bytes inside a block of size 672 alloc'd
   at 0x4C3359F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x5A9F485: TStorage::ObjectAlloc(unsigned long) (TStorage.cxx:330)
   by 0x4E5E9DD: TObject::operator new(unsigned long) (TObject.h:167)
   by 0x5A2C9D3: TROOT::InitSystem() (TROOT.cxx:1926)
   by 0x5A277F6: TROOT::TROOT(char const*, char const*, void (**)()) (TROOT.cxx:667)
   by 0x5A32A0C: ROOT::Internal::TROOTAllocator::TROOTAllocator() (TROOT.cxx:334)
   by 0x5A267B6: ROOT::Internal::GetROOT1() (TROOT.cxx:376)
   by 0x5A26B62: ROOT::GetROOT() (TROOT.cxx:466)
   by 0x5A31563: __static_initialization_and_destruction_0(int, int) (TROOT.cxx:584)
   by 0x5A315AA: _GLOBAL__sub_I_TROOT.cxx (TROOT.cxx:3125)
   by 0x40108D2: call_init (dl-init.c:72)
   by 0x40108D2: _dl_init (dl-init.c:119)
   by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)
 Block was alloc'd by thread #1

----------------------------------------------------------------

 Lock at 0x265C1600 was first observed
   at 0x4C39F2A: pthread_mutex_init (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x7AB9D77: TPosixMutex::TPosixMutex(bool) (TPosixMutex.cxx:42)
   by 0x7ABA912: TPosixThreadFactory::CreateMutexImp(bool) (TPosixThreadFactory.cxx:45)
   by 0x7A9913B: TMutex::TMutex(bool) (TMutex.cxx:36)
   by 0x7A99280: TMutex::Factory(bool) (TMutex.cxx:89)
   by 0x5AB6DC1: TSystem::AddTimer(TTimer*) (TSystem.cxx:479)
   by 0x5BDAD8E: TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2986)
   by 0x53C5642: TGTextView::Init(unsigned long) (TGTextView.cxx:101)
   by 0x53C58C3: TGTextView::TGTextView(TGWindow const*, unsigned int, unsigned int, int, unsigned int, unsigned long) (TGTextView.cxx:122)
   by 0x52C6394: TGCommandPlugin::TGCommandPlugin(TGWindow const*, unsigned int, unsigned int) (TGCommandPlugin.cxx:72)
   by 0x4E6810B: MainWindow::MainWindow(TGWindow const*, unsigned int, unsigned int, PulseSurfer*, unsigned int, unsigned int, bool) (MainWindow.cpp:614)
   by 0x11D4FE: main (main.cxx:82)
 Address 0x265c1600 is 16 bytes inside a block of size 56 alloc'd
   at 0x4C3359F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x5A9F485: TStorage::ObjectAlloc(unsigned long) (TStorage.cxx:330)
   by 0x4E5E9DD: TObject::operator new(unsigned long) (TObject.h:167)
   by 0x7ABA901: TPosixThreadFactory::CreateMutexImp(bool) (TPosixThreadFactory.cxx:45)
   by 0x7A9913B: TMutex::TMutex(bool) (TMutex.cxx:36)
   by 0x7A99280: TMutex::Factory(bool) (TMutex.cxx:89)
   by 0x5AB6DC1: TSystem::AddTimer(TTimer*) (TSystem.cxx:479)
   by 0x5BDAD8E: TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2986)
   by 0x53C5642: TGTextView::Init(unsigned long) (TGTextView.cxx:101)
   by 0x53C58C3: TGTextView::TGTextView(TGWindow const*, unsigned int, unsigned int, int, unsigned int, unsigned long) (TGTextView.cxx:122)
   by 0x52C6394: TGCommandPlugin::TGCommandPlugin(TGWindow const*, unsigned int, unsigned int) (TGCommandPlugin.cxx:72)
   by 0x4E6810B: MainWindow::MainWindow(TGWindow const*, unsigned int, unsigned int, PulseSurfer*, unsigned int, unsigned int, bool) (MainWindow.cpp:614)
 Block was alloc'd by thread #1

Possible data race during write of size 1 at 0xB2D0758 by thread #1
Locks held: 1, at address 0x265C1600
   at 0x5BDACDE: TUnixSystem::DispatchTimers(bool) (TUnixSystem.cxx:2976)
   by 0x5BD5A48: TUnixSystem::DispatchOneEvent(bool) (TUnixSystem.cxx:1100)
   by 0x5AB6BC2: TSystem::InnerLoop() (TSystem.cxx:409)
   by 0x5AB6953: TSystem::Run() (TSystem.cxx:359)
   by 0x5A3F120: TApplication::Run(bool) (TApplication.cxx:1624)
   by 0x11D768: main (main.cxx:112)

This conflicts with a previous read of size 1 by thread #2
Locks held: none
   at 0x5BDAF58: TUnixSystem::ResetTimer(TTimer*) (TUnixSystem.cxx:3010)
   by 0x5BDADAE: TUnixSystem::AddTimer(TTimer*) (TUnixSystem.cxx:2987)
   by 0x7AB7E06: TThreadTimer::TThreadTimer(long) (TThread.cxx:1209)
   by 0x7AB750D: TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1042)
   by 0x7AB6FE8: TThread::Printf(char const*, ...) (TThread.cxx:950)
   by 0x4E5F218: ThSFMC01::ThreadFunction() (ThSFMC01.cpp:51)
   by 0x4E5E88A: RThread::ThreadHandle(void*) (RThread.cpp:258)
   by 0x7AB6A70: TThread::Function(void*) (TThread.cxx:828)
 Address 0xb2d0758 is 168 bytes inside a block of size 672 alloc'd
   at 0x4C3359F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x5A9F485: TStorage::ObjectAlloc(unsigned long) (TStorage.cxx:330)
   by 0x4E5E9DD: TObject::operator new(unsigned long) (TObject.h:167)
   by 0x5A2C9D3: TROOT::InitSystem() (TROOT.cxx:1926)
   by 0x5A277F6: TROOT::TROOT(char const*, char const*, void (**)()) (TROOT.cxx:667)
   by 0x5A32A0C: ROOT::Internal::TROOTAllocator::TROOTAllocator() (TROOT.cxx:334)
   by 0x5A267B6: ROOT::Internal::GetROOT1() (TROOT.cxx:376)
   by 0x5A26B62: ROOT::GetROOT() (TROOT.cxx:466)
   by 0x5A31563: __static_initialization_and_destruction_0(int, int) (TROOT.cxx:584)
   by 0x5A315AA: _GLOBAL__sub_I_TROOT.cxx (TROOT.cxx:3125)
   by 0x40108D2: call_init (dl-init.c:72)
   by 0x40108D2: _dl_init (dl-init.c:119)
   by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)
 Block was alloc'd by thread #1


Process terminating with default action of signal 2 (SIGINT)
   at 0x60F21FD: __lll_lock_wait (lowlevellock.S:135)
   by 0x60EB0F3: pthread_mutex_lock (pthread_mutex_lock.c:115)
   by 0x4C35FD6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x7AB9F0D: TPosixMutex::Lock() (TPosixMutex.cxx:76)
   by 0x7A991BC: TMutex::Lock() (TMutex.cxx:48)
   by 0x7AB7EB7: TThreadTimer::Notify() (TThread.cxx:1220)
   by 0x5ACC1D9: TTimer::CheckTimer(TTime const&) (TTimer.cxx:132)
   by 0x5BDAC20: TUnixSystem::DispatchTimers(bool) (TUnixSystem.cxx:2967)
   by 0x5BD5A48: TUnixSystem::DispatchOneEvent(bool) (TUnixSystem.cxx:1100)
   by 0x5AB6BC2: TSystem::InnerLoop() (TSystem.cxx:409)
   by 0x5AB6953: TSystem::Run() (TSystem.cxx:359)
   by 0x5A3F120: TApplication::Run(bool) (TApplication.cxx:1624)
----------------------------------------------------------------

Thread #2: Exiting thread still holds 1 lock
   at 0x60EEAD3: futex_wait_cancelable (futex-internal.h:88)
   by 0x60EEAD3: __pthread_cond_wait_common (pthread_cond_wait.c:502)
   by 0x60EEAD3: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.c:655)
   by 0x4C38DC3: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x7AB9A02: TPosixCondition::Wait() (TPosixCondition.cxx:65)
   by 0x7AB765D: TThread::XARequest(char const*, int, void**, int*) (TThread.cxx:1069)
   by 0x7AB6FE8: TThread::Printf(char const*, ...) (TThread.cxx:950)
   by 0x4E5F218: ThSFMC01::ThreadFunction() (ThSFMC01.cpp:51)
   by 0x4E5E88A: RThread::ThreadHandle(void*) (RThread.cpp:258)
   by 0x7AB6A70: TThread::Function(void*) (TThread.cxx:828)
   by 0x4C38C26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x60E86DA: start_thread (pthread_create.c:463)
   by 0x6D6071E: clone (clone.S:95)

----------------------------------------------------------------

Thread #1: Exiting thread still holds 1 lock
   at 0x60F21FD: __lll_lock_wait (lowlevellock.S:135)
   by 0x60EB0F3: pthread_mutex_lock (pthread_mutex_lock.c:115)
   by 0x4C35FD6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
   by 0x7AB9F0D: TPosixMutex::Lock() (TPosixMutex.cxx:76)
   by 0x7A991BC: TMutex::Lock() (TMutex.cxx:48)
   by 0x7AB7EB7: TThreadTimer::Notify() (TThread.cxx:1220)
   by 0x5ACC1D9: TTimer::CheckTimer(TTime const&) (TTimer.cxx:132)
   by 0x5BDAC20: TUnixSystem::DispatchTimers(bool) (TUnixSystem.cxx:2967)
   by 0x5BD5A48: TUnixSystem::DispatchOneEvent(bool) (TUnixSystem.cxx:1100)
   by 0x5AB6BC2: TSystem::InnerLoop() (TSystem.cxx:409)
   by 0x5AB6953: TSystem::Run() (TSystem.cxx:359)
   by 0x5A3F120: TApplication::Run(bool) (TApplication.cxx:1624)

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jun 30, 2021

After the changes from #8562 and #8572, I am left only with this data race:

  • TList::NewLink vs TList::FindObject
    image

  • And then I see this error, that might be the reason of the freeze of my program.
    image

@pcanal
Copy link
Member

pcanal commented Jul 6, 2021

TList::NewLink vs TList::FindObject

Is this still an issue after making the collection 'thread-safe'?

And then I see this error, that might be the reason of the freeze of my program.

Does it say 'which' lock is still being held?

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jul 6, 2021

Thanks for the reply.

Yes, it is still an issue because the problem is in TList itself not being (completely) thread-safe even if UseRWLock is called. In particular, there is a suspicious std::make_shared< link > that might be giving problems when 'sharing it'.

Find attached the complete log.
helgrindlog.txt

The issue is that when a timer is added, it calls FindObject:
https://github.com/root-project/root/blob/master/core/cont/src/TList.cxx#L617

but at the same time, a new object might have been added in the meanwhile by the other thread:
https://github.com/root-project/root/blob/master/core/cont/src/TList.cxx#L738

Maybe a mutex should be added inside the TList class to handle this?

Here the XML version (lower part of it):
image

@ferdymercury
Copy link
Collaborator Author

It happens in several places, but it is always in the TList class when retrieving or adding links:

image

Maybe the deadlock disappears once we fix this last race.

@ferdymercury
Copy link
Collaborator Author

Side note: I was able to reproduce the same race and deadlock on different computers using:

0. cd /opt/ && git clone https://github.com/CLIUtils/CLI11
1. Download https://root-forum.cern.ch/uploads/short-url/z59x8uBIVMEE5S46EWUUHyyO3df.zip and unzip it, cd into it
2. mkdir build && cd build
3. cmake -DROOT_DIR=/build-debug-mode-ROOT -DCMAKE_BUILD_TYPE=Debug ../
4. make
5. valgrind --xml=yes --xml-file=helgrind.xml --tool=helgrind --suppressions=/pathtoDebugROOT/etc/helgrind-root.supp --log-file=helgrind.log gui/runGUI --prbs 3 -q

@ferdymercury
Copy link
Collaborator Author

Another comment: If I change the TThread::Printf to std::cout, the deadlock and data races disappear. So this points that the data race happens when Printf is called, which creates as TTimer that is somehow lost during the TObjLink creation.

@pcanal
Copy link
Member

pcanal commented Jul 7, 2021

The race is likely unrelated to the dead lock but maybe that it is time to follow the comment's advise:

class TObjLink ...

   TObject    *fObject; // should be atomic ...

well unless we really still have:

which creates as TTimer that is somehow lost during the TObjLink creation.

do we?

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jul 7, 2021

well unless we really still have:

which creates as TTimer that is somehow lost during the TObjLink creation.

do we?

Well, that's what I suspect based on the Helgrind log, because Printf creates a TThreadTimer and then it's racing with NewLink:

image

(Side note: the NewLink function appears also in several places in valgrind-root.supp, so that a related memory leak might be hidden.)

@ferdymercury
Copy link
Collaborator Author

class TObjLink ...

   TObject    *fObject; // should be atomic ...

Good point!

The change is however non-trivial, because there is:
TObject **GetObjectRef() { return &fObject; }

@pcanal
Copy link
Member

pcanal commented Jul 7, 2021

In the last screen shot, the "no lock held" in the write is suspicious. It is suppose to have take the write lock in AddLast line 158. For the lock to be missed, it would need to have been executed before the thread-safety was enable or before the list was made thread-safe.

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jul 7, 2021

Yes, it's weird indeed. Well, I call ROOT::EnableThreadSafety before any thread starts, so that should be fine.

the list is immediately made thread-safe in the function TSystem::Init,

   fTimers              = new TList;
   fTimers->UseRWLock();

I can try to rewrite it using an intermediary variable to lock it before assigning it to fTimers, to see if it changes something.

@ferdymercury
Copy link
Collaborator Author

If I do:

   TList *timers        = new TList;
   timers->UseRWLock();//lock it before assigning it
   fTimers = timers;
   timers = nullptr;

then I get this helgrind log, seems similar:
helgrindlog.txt

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jul 12, 2021

This is the gdb output in Release mode:

Thread 1 "runGUI" received signal SIGINT, Interrupt.
0x00007ffff6b46ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562a4138) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88	../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7f8fb80 (LWP 20703) "runGUI" 0x00007ffff6b46ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562a4138) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7fffa4c86700 (LWP 20736) "runGUI" 0x00007ffff6b46ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555573d87fc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) thread apply all bt

Thread 2 (Thread 0x7fffa4c86700 (LWP 20736)):
#0  0x00007ffff6b46ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555573d87fc) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x5555573d8880, cond=0x5555573d87d0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5555573d87d0, mutex=0x5555573d8880) at pthread_cond_wait.c:655
#3  0x00007ffff537d8fe in TThread::XARequest(char const*, int, void**, int*) () from /opt/root/lib/libThread.so
#4  0x00007ffff537dadd in TThread::Printf(char const*, ...) () from /opt/root/lib/libThread.so
#5  0x00007ffff7ba899d in RThread::Stop (this=0x5555568eb150) at /home/user/Desktop/zdt/gui/RThread.cpp:119
#6  0x00007ffff7bb776e in MainWindow::DoStopDAQ (this=0x7ffffffdd1f0) at /home/user/Desktop/zdt/gui/MainWindow.cpp:1250
#7  0x00007ffff7bb5164 in MainWindow::SaveAndExit (this=0x7ffffffdd1f0) at /home/user/Desktop/zdt/gui/MainWindow.cpp:848
#8  0x00007fffa43c802a in ?? ()
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7f8fb80 (LWP 20703)):
#0  0x00007ffff6b46ad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562a4138) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55555620d3b0, cond=0x5555562a4110) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5555562a4110, mutex=0x55555620d3b0) at pthread_cond_wait.c:655
#3  0x00007ffff68678bc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff536e1c8 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock() () from /opt/root/lib/libThread.so
#5  0x00007ffff709ec6b in TApplication::Class() () from /opt/root/lib/libCore.so
#6  0x00007ffff6f3a04d in TObject::InheritsFrom(char const*) const () from /opt/root/lib/libCore.so
#7  0x00007ffff76d1770 in TGCommandPlugin::CheckRemote(char const*) () from /opt/root/lib/libGui.so
#8  0x00007ffff76d1970 in TGCommandPlugin::HandleTimer(TTimer*) () from /opt/root/lib/libGui.so
#9  0x00007ffff6f7eced in TTimer::Notify() () from /opt/root/lib/libCore.so
#10 0x00007ffff6f7e771 in TTimer::CheckTimer(TTime const&) () from /opt/root/lib/libCore.so
#11 0x00007ffff705487f in TUnixSystem::DispatchTimers(bool) () from /opt/root/lib/libCore.so
#12 0x00007ffff705586d in TUnixSystem::DispatchOneEvent(bool) () from /opt/root/lib/libCore.so
#13 0x00007ffff6f6bf51 in TSystem::Run() () from /opt/root/lib/libCore.so
#14 0x00007ffff6f01a72 in TApplication::Run(bool) () from /opt/root/lib/libCore.so
#15 0x000055555555d755 in main (argc=3, argv=0x7fffffffded8) at /home/user/Desktop/zdt/gui/main.cxx:112
(gdb) print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex
No symbol "ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>" in current context.

And here in Debug mode:

Thread 1 "runGUI" received signal SIGINT, Interrupt.
0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88	../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7f8fb80 (LWP 23653) "runGUI" 0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7fffa2a96700 (LWP 23702) "runGUI" 0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55555747c2c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) thread apply all bt

Thread 2 (Thread 0x7fffa2a96700 (LWP 23702)):
#0  0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55555747c2c8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55555747c350, cond=0x55555747c2a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x55555747c2a0, mutex=0x55555747c350) at pthread_cond_wait.c:655
#3  0x00007ffff4fd7227 in TPosixCondition::Wait (this=0x55555747c290) at /opt/root_src/core/thread/src/TPosixCondition.cxx:65
#4  0x00007ffff4fd4ec6 in TThread::XARequest (xact=0x7ffff4fe1260 "PRTF", nb=2, ar=0x7fffa2a90420, iret=0x0) at /opt/root_src/core/thread/src/TThread.cxx:1067
#5  0x00007ffff4fd4857 in TThread::Printf (fmt=0x7ffff7bc640e "%s finished.") at /opt/root_src/core/thread/src/TThread.cxx:953
#6  0x00007ffff7ba8943 in RThread::Stop (this=0x5555569a6820) at /home/user/Desktop/zdt/gui/RThread.cpp:116
#7  0x00007ffff7bb778e in MainWindow::DoStopDAQ (this=0x7ffffffdd1f0) at /home/user/Desktop/zdt/gui/MainWindow.cpp:1251
#8  0x00007ffff7bb5184 in MainWindow::SaveAndExit (this=0x7ffffffdd1f0) at /home/user/Desktop/zdt/gui/MainWindow.cpp:849
#9  0x00007fffa21d802a in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7f8fb80 (LWP 23653)):
#0  0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555556100770, cond=0x5555562dea30) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5555562dea30, mutex=0x555556100770) at pthread_cond_wait.c:655
#3  0x00007ffff665c8bc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff4fc44cb in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x5555562dea30, __lock=...) at /usr/include/c++/8/condition_variable:263
#5  0x00007ffff4fc31fc in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}) (this=0x5555562dea30, __lock=..., __p=...) at /usr/include/c++/8/condition_variable:272
#6  0x00007ffff4fbf283 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555562de9f8) at /opt/root_src/core/thread/src/TReentrantRWLock.cxx:190
#7  0x00007ffff4fcd2f6 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555562de9f0) at /opt/root_src/core/thread/src/TRWMutexImp.cxx:42
#8  0x00007ffff701788b in ROOT::TVirtualRWMutex::Lock (this=0x5555562de9f0) at /home/user/builds/build-root_src-Desktop-Debug/include/TVirtualRWMutex.h:79
#9  0x00007ffff76abb7a in TLockGuard::TLockGuard (this=0x7ffffffdc988, mutex=0x5555562de9f0) at /opt/root_src/core/base/inc/TVirtualMutex.h:77
#10 0x00007ffff6f87040 in TObjectSpy::TObjectSpy (this=0x7ffffffdc9e0, obj=0x5555573f9700, fixMustCleanupBit=true) at /opt/root_src/core/gui/src/TObjectSpy.cxx:40
#11 0x00007ffff771f746 in TGFrame::HandleEvent (this=0x5555573f9700, event=0x7ffffffdcac0) at /opt/root_src/gui/gui/src/TGFrame.cxx:462
#12 0x00007ffff76d18e6 in TGClient::HandleEvent (this=0x5555563eef20, event=0x7ffffffdcac0) at /opt/root_src/gui/gui/src/TGClient.cxx:846
#13 0x00007ffff76d11bc in TGClient::ProcessOneEvent (this=0x5555563eef20) at /opt/root_src/gui/gui/src/TGClient.cxx:656
#14 0x00007ffff76d1364 in TGClient::HandleInput (this=0x5555563eef20) at /opt/root_src/gui/gui/src/TGClient.cxx:703
#15 0x00007ffff76cf70c in TGInputHandler::Notify (this=0x5555564c87f0) at /opt/root_src/gui/gui/src/TGClient.cxx:117
#16 0x00007ffff6fa0d9f in TUnixSystem::DispatchOneEvent (this=0x5555557f57c0, pendingOnly=false) at /opt/root_src/core/unix/src/TUnixSystem.cxx:1067
#17 0x00007ffff6e863af in TSystem::InnerLoop (this=0x5555557f57c0) at /opt/root_src/core/base/src/TSystem.cxx:406
#18 0x00007ffff6e86144 in TSystem::Run (this=0x5555557f57c0) at /opt/root_src/core/base/src/TSystem.cxx:356
#19 0x00007ffff6e0f64b in TApplication::Run (this=0x7ffffffdce30, retrn=false) at /opt/root_src/core/base/src/TApplication.cxx:1604
#20 0x00005555555695a5 in main (argc=3, argv=0x7fffffffded8) at /home/user/Desktop/zdt/gui/main.cxx:112
(gdb) print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex
$1 = {<std::__mutex_base> = {_M_mutex = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with no waiters, Owner ID = Unknown, Robust = No, Shared = No, Protocol = None}}, <No data fields>}

again debug, but a different trace

Thread 1 "runGUI" received signal SIGINT, Interrupt.
0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
88	../sysdeps/unix/sysv/linux/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id         Frame 
* 1    Thread 0x7ffff7f8fb80 (LWP 23769) "runGUI" 0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
  2    Thread 0x7fffa2a96700 (LWP 23803) "runGUI" 0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55555747dfa8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
(gdb) thread apply all bt

Thread 2 (Thread 0x7fffa2a96700 (LWP 23803)):
#0  0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x55555747dfa8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x55555747e030, cond=0x55555747df80) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x55555747df80, mutex=0x55555747e030) at pthread_cond_wait.c:655
#3  0x00007ffff4fd7227 in TPosixCondition::Wait (this=0x55555747df70) at /opt/root_src/core/thread/src/TPosixCondition.cxx:65
#4  0x00007ffff4fd4ec6 in TThread::XARequest (xact=0x7ffff4fe1260 "PRTF", nb=2, ar=0x7fffa2a90420, iret=0x0) at /opt/root_src/core/thread/src/TThread.cxx:1067
#5  0x00007ffff4fd4857 in TThread::Printf (fmt=0x7ffff7bc641c "Start: %s") at /opt/root_src/core/thread/src/TThread.cxx:953
#6  0x00007ffff7ba89e8 in RThread::Stop (this=0x5555569a4f70) at /home/huesof/Desktop/zdt/gui/RThread.cpp:120
#7  0x00007ffff7bb778e in MainWindow::DoStopDAQ (this=0x7ffffffdd1f0) at /home/huesof/Desktop/zdt/gui/MainWindow.cpp:1251
#8  0x00007ffff7bb5184 in MainWindow::SaveAndExit (this=0x7ffffffdd1f0) at /home/huesof/Desktop/zdt/gui/MainWindow.cpp:849
#9  0x00007fffa21d802a in ?? ()
#10 0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7f8fb80 (LWP 23769)):
#0  0x00007ffff693bad3 in futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5555562dea58) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x555556100770, cond=0x5555562dea30) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x5555562dea30, mutex=0x555556100770) at pthread_cond_wait.c:655
#3  0x00007ffff665c8bc in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff4fc44cb in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x5555562dea30, __lock=...) at /usr/include/c++/8/condition_variable:263
#5  0x00007ffff4fc31fc in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock()::{lambda()#1}) (this=0x5555562dea30, __lock=..., __p=...) at /usr/include/c++/8/condition_variable:272
#6  0x00007ffff4fbf283 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555562de9f8) at /opt/root_src/core/thread/src/TReentrantRWLock.cxx:190
#7  0x00007ffff4fcd2f6 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::WriteLock (this=0x5555562de9f0) at /opt/root_src/core/thread/src/TRWMutexImp.cxx:42
#8  0x00007ffff701788b in ROOT::TVirtualRWMutex::Lock (this=0x5555562de9f0) at /home/huesof/builds/build-root_src-Desktop-Debug/include/TVirtualRWMutex.h:79
#9  0x00007ffff76abb7a in TLockGuard::TLockGuard (this=0x7ffffffdc9b0, mutex=0x5555562de9f0) at /opt/root_src/core/base/inc/TVirtualMutex.h:77
#10 0x00007ffff6fd9755 in TApplication::Class () at /home/huesof/builds/build-root_src-Desktop-Debug/core/G__Core.cxx:12360
#11 0x00007ffff6e10f79 in TApplication::IsA (this=0x7ffffffdce30) at /opt/root_src/core/base/inc/TApplication.h:162
#12 0x00007ffff6e51951 in TObject::InheritsFrom (this=0x7ffffffdce30, classname=0x7ffff789295d "TRint") at /opt/root_src/core/base/src/TObject.cxx:447
#13 0x00007ffff76e6205 in TGCommandPlugin::CheckRemote (this=0x5555566b5b30) at /opt/root_src/gui/gui/src/TGCommandPlugin.cxx:141
#14 0x00007ffff76e6f8b in TGCommandPlugin::HandleTimer (this=0x5555566b5b30, t=0x555557414fe0) at /opt/root_src/gui/gui/src/TGCommandPlugin.cxx:275
#15 0x00007ffff6e9b6d9 in TTimer::Notify (this=0x555557414fe0) at /opt/root_src/core/base/src/TTimer.cxx:146
#16 0x00007ffff6e9b670 in TTimer::CheckTimer (this=0x555557414fe0, now=...) at /opt/root_src/core/base/src/TTimer.cxx:132
#17 0x00007ffff6fa6105 in TUnixSystem::DispatchTimers (this=0x5555557f57c0, mode=true) at /opt/root_src/core/unix/src/TUnixSystem.cxx:2956
#18 0x00007ffff6fa1018 in TUnixSystem::DispatchOneEvent (this=0x5555557f57c0, pendingOnly=false) at /opt/root_src/core/unix/src/TUnixSystem.cxx:1096
#19 0x00007ffff6e863af in TSystem::InnerLoop (this=0x5555557f57c0) at /opt/root_src/core/base/src/TSystem.cxx:406
#20 0x00007ffff6e86144 in TSystem::Run (this=0x5555557f57c0) at /opt/root_src/core/base/src/TSystem.cxx:356
#21 0x00007ffff6e0f64b in TApplication::Run (this=0x7ffffffdce30, retrn=false) at /opt/root_src/core/base/src/TApplication.cxx:1604
#22 0x00005555555695a5 in main (argc=3, argv=0x7fffffffded8) at /home/huesof/Desktop/zdt/gui/main.cxx:112
(gdb) print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex
$1 = {<std::__mutex_base> = {_M_mutex = pthread_mutex_t = {Type = Normal, Status = Acquired, possibly with no waiters, Owner ID = Unknown, Robust = No, Shared = No, Protocol = None}}, <No data fields>}
(gdb) 

Valgrind:
image

And in another computer:

Thread 1 "runGUI" received signal SIGINT, Interrupt.
futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556a28e98) at ../sysdeps/nptl/futex-internal.h:183
183	../sysdeps/nptl/futex-internal.h: No such file or directory.
(gdb) info threads
  Id   Target Id                                   Frame 
* 1    Thread 0x7ffff523cb80 (LWP 705986) "runGUI" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556a28e98) at ../sysdeps/nptl/futex-internal.h:183
  2    Thread 0x7ffee552f700 (LWP 706016) "runGUI" futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556c5e1ec) at ../sysdeps/nptl/futex-internal.h:183
(gdb) thread apply all bt

Thread 2 (Thread 0x7ffee552f700 (LWP 706016)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556c5e1ec) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x5555579d0380, cond=0x555556c5e1c0) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x555556c5e1c0, mutex=0x5555579d0380) at pthread_cond_wait.c:638
#3  0x00007ffff66c5077 in TPosixCondition::Wait (this=0x555556c5e1b0) at /opt/root_src/core/thread/src/TPosixCondition.cxx:65
#4  0x00007ffff66c2b46 in TThread::XARequest (xact=0x7ffff66d0208 "PRTF", nb=2, ar=0x7ffee5529510, iret=0x0) at /opt/root_src/core/thread/src/TThread.cxx:1067
#5  0x00007ffff66c24d7 in TThread::Printf (fmt=0x7ffff7fbc0f1 "Start: %s") at /opt/root_src/core/thread/src/TThread.cxx:953
#6  0x00007ffff7fa0617 in RThread::Stop() () from /tmp/zdt/build/gui/libZDTgui.so
#7  0x00007ffff7fa6785 in MainWindow::SaveAndExit() () from /tmp/zdt/build/gui/libZDTgui.so
#8  0x00007ffee4c2f02a in ?? ()
#9  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff523cb80 (LWP 705986)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x555556a28e98) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555556a2a610, cond=0x555556a28e70) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x555556a28e70, mutex=0x555556a2a610) at pthread_cond_wait.c:638
#3  0x00007ffff7157e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff66b1225 in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex> > (this=0x555556a28e70, __lock=...) at /usr/include/c++/9/condition_variable:273
#5  0x00007ffff66afe12 in std::_V2::condition_variable_any::wait<std::unique_lock<std::mutex>, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::ReadLock()::{lambda()#1}>(std::unique_lock<std::mutex>&, ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::ReadLock()::{lambda()#1}) (this=0x555556a28e70, __lock=..., __p=...) at /usr/include/c++/9/condition_variable:282
#6  0x00007ffff66ab912 in ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::ReadLock (this=0x555556a28e38) at /opt/root_src/core/thread/src/TReentrantRWLock.cxx:110
#7  0x00007ffff66ba786 in ROOT::TRWMutexImp<std::mutex, ROOT::Internal::RecurseCountsTBBUnique>::ReadLock (this=0x555556a28e30) at /opt/root_src/core/thread/src/TRWMutexImp.cxx:33
#8  0x00007ffff7589bea in ROOT::TReadLockGuard::TReadLockGuard (this=0x7ffffffdcbf0, mutex=0x555556a28e30) at /opt/root_src/core/base/inc/TVirtualRWMutex.h:120
#9  0x00007ffff7662143 in THashTable::FindObject (this=0x5555555ca010, name=0x7ffff7e8d9dd "TRint") at /opt/root_src/core/cont/src/THashTable.cxx:242
#10 0x00007ffff76baf03 in TClass::GetClass (name=0x7ffff7e8d9dd "TRint", load=true, silent=true, hint_pair_offset=0, hint_pair_size=0) at /opt/root_src/core/meta/src/TClass.cxx:2973
#11 0x00007ffff76badd7 in TClass::GetClass (name=0x7ffff7e8d9dd "TRint", load=true, silent=true) at /opt/root_src/core/meta/src/TClass.cxx:2957
#12 0x00007ffff76c1e04 in TClass::InheritsFrom (this=0x555557ad69e0, classname=0x7ffff7e8d9dd "TRint") at /opt/root_src/core/meta/src/TClass.cxx:4871
#13 0x00007ffff75da92c in TObject::InheritsFrom (this=0x7ffffffdd280, classname=0x7ffff7e8d9dd "TRint") at /opt/root_src/core/base/src/TObject.cxx:447
#14 0x00007ffff7cd58f1 in TGCommandPlugin::CheckRemote (this=0x5555572a3120) at /opt/root_src/gui/gui/src/TGCommandPlugin.cxx:141
#15 0x00007ffff7cd66c7 in TGCommandPlugin::HandleTimer (this=0x5555572a3120, t=0x555557972600) at /opt/root_src/gui/gui/src/TGCommandPlugin.cxx:275
#16 0x00007ffff7626f0d in TTimer::Notify (this=0x555557972600) at /opt/root_src/core/base/src/TTimer.cxx:146
#17 0x00007ffff7626ea0 in TTimer::CheckTimer (this=0x555557972600, now=...) at /opt/root_src/core/base/src/TTimer.cxx:132
#18 0x00007ffff773c635 in TUnixSystem::DispatchTimers (this=0x5555555c3800, mode=true) at /opt/root_src/core/unix/src/TUnixSystem.cxx:2956
#19 0x00007ffff7737353 in TUnixSystem::DispatchOneEvent (this=0x5555555c3800, pendingOnly=false) at /opt/root_src/core/unix/src/TUnixSystem.cxx:1096
#20 0x00007ffff7610e47 in TSystem::InnerLoop (this=0x5555555c3800) at /opt/root_src/core/base/src/TSystem.cxx:408
#21 0x00007ffff7610bb8 in TSystem::Run (this=0x5555555c3800) at /opt/root_src/core/base/src/TSystem.cxx:358
#22 0x00007ffff75976bd in TApplication::Run (this=0x7ffffffdd280, retrn=false) at /opt/root_src/core/base/src/TApplication.cxx:1604
#23 0x000055555555cd0e in main ()
(gdb)  print (('ROOT::TReentrantRWLock<std::mutex, ROOT::Internal::RecurseCounts>' * const)ROOT::gCoreMutex)->fMutex
$1 = {<std::__mutex_base> = {_M_mutex = {__data = {__lock = 0, __count = 1, __owner = 0, __nusers = 0, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, 
      __size = "\000\000\000\000\001", '\000' <repeats 34 times>, __align = 4294967296}}, <No data fields>}

with slightly more warnings in valgrind:
image

@ferdymercury
Copy link
Collaborator Author

ferdymercury commented Jul 12, 2021

One hypothesis: there is lock is due to a race between TThread::Join and TThread::Printf. When stopping, the thread is asked to "join". After that, a Printf is issued, which creates a TThreadTimer and retrieves TThread::Self and TThread::GetThread before locking the mutex, and then it hangs at condimp->Wait.

image

image

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants