Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[I/O] Race condition when reading vectors with custom allocators with TTreeProcessorMT #10357

Closed
eguiraud opened this issue Apr 8, 2022 · 4 comments · Fixed by #10361
Closed
Assignees

Comments

@eguiraud
Copy link
Member

eguiraud commented Apr 8, 2022

This is a reproducer (segfaults frequently but not always):

#include <ROOT/TTreeProcessorMT.hxx>
#include <TROOT.h>
#include <TTreeReader.h>
#include <TTreeReaderArray.h>

void workload(TTreeReader &r) {
  TTreeReaderArray<double> ra(r, "truthCaloPt");
  while (r.Next())
    ra.GetSize();
}

int main() {
  ROOT::EnableImplicitMT(2);
  ROOT::TTreeProcessorMT mt({"f1.root", "f2.root", "f3.root", "f4.root", "f5.root"}, "t");
  mt.Process(workload);
}

With these files: files.zip

The problem seems to be at the level of TGenCollectionProxy: multiple threads end up sharing the same TGenCollectionProxy objects, which is not thread safe (e.g. because of

// FIXME: This is not thread safe.
TVirtualCollectionProxy::TPushPop env(const_cast<TEmulatedCollectionProxy*>(this), p);
). In principle, however, as we use different TChains/TTreeReaders in each thread, they should also access different TGenCollectionProxy instances.

Example backtraces at the point of crash (this is one of several failure modes, but it's the one where the problem is clear -- both threads, at frame 0, are accessing the same TGenCollectionProxy instance):

>>> thread apply all bt 10

Thread 2 (Thread 0x7fffdc0e2640 (LWP 312745) "repro_ttreeproc"):
#0  0x00007ffff767d973 in TGenCollectionProxy::PopProxy (this=0x7fffd4016090) at ../io/io/src/TGenCollectionProxy.cxx:1333
#1  0x00007ffff7d57a15 in TVirtualCollectionProxy::TPushPop::~TPushPop (this=0x7fffdc0dad20, __in_chrg=<optimized out>) at ../core/cont/inc/TVirtualCollectionProxy.h:65
#2  0x00007ffff76274b1 in TEmulatedCollectionProxy::Destructor (this=0x7fffd4016090, p=0x7fffd40156e0, dtorOnly=false) at ../io/io/src/TEmulatedCollectionProxy.cxx:87
#3  0x00007ffff7d4f8c2 in TClass::Destructor (this=0x7fffd40152c0, obj=0x7fffd40156e0, dtorOnly=false) at ../core/meta/src/TClass.cxx:5417
#4  0x00007ffff676afdb in TBranchElement::ReleaseObject (this=0x7fffd4017590) at ../tree/tree/src/TBranchElement.cxx:4743
#5  0x00007ffff676b265 in TBranchElement::ResetAddress (this=0x7fffd4017590) at ../tree/tree/src/TBranchElement.cxx:4806
#6  0x00007ffff675b10b in TBranchElement::~TBranchElement (this=0x7fffd4017590, __in_chrg=<optimized out>) at ../tree/tree/src/TBranchElement.cxx:982
#7  0x00007ffff675b338 in TBranchElement::~TBranchElement (this=0x7fffd4017590, __in_chrg=<optimized out>) at ../tree/tree/src/TBranchElement.cxx:1003
#8  0x00007ffff7ceae9f in TCollection::GarbageCollect (obj=0x7fffd4017590) at ../core/cont/src/TCollection.cxx:736
#9  0x00007ffff7cfbe70 in TObjArray::Delete (this=0x7fffd4011ab8) at ../core/cont/src/TObjArray.cxx:376
(More stack frames follow...)

Thread 1 (Thread 0x7ffff42bec00 (LWP 312681) "repro_ttreeproc"):
#0  0x00007ffff767d973 in TGenCollectionProxy::PopProxy (this=0x7fffd4016090) at ../io/io/src/TGenCollectionProxy.cxx:1333
#1  0x00007ffff656b78d in (anonymous namespace)::TCollectionLessSTLReader::GetSize (this=0x5555577ccb80, proxy=0x5555577cdde0) at ../tree/treeplayer/src/TTreeReaderArray.cxx:130
#2  0x0000555555561837 in ROOT::Internal::TTreeReaderArrayBase::GetSize (this=0x7fffffffc1c0) at /home/blue/ROOT/master/cmake-build-foo/include/TTreeReaderArray.h:35
#3  0x00005555555612bc in workload (r=...) at repro_ttreeprocmt.cpp:10
#4  0x0000555555563ef5 in std::__invoke_impl<void, void (*&)(TTreeReader&), TTreeReader&> (__f=@0x7fffffffde50: 0x555555561269 <workload(TTreeReader&)>) at /usr/include/c++/11.2.0/bits/invoke.h:61
#5  0x0000555555563784 in std::__invoke_r<void, void (*&)(TTreeReader&), TTreeReader&> (__fn=@0x7fffffffde50: 0x555555561269 <workload(TTreeReader&)>) at /usr/include/c++/11.2.0/bits/invoke.h:111
#6  0x0000555555562df8 in std::_Function_handler<void (TTreeReader&), void (*)(TTreeReader&)>::_M_invoke(std::_Any_data const&, TTreeReader&) (__functor=..., __args#0=...) at /usr/include/c++/11.2.0/bits/std_function.h:291
#7  0x00007ffff659e8a9 in std::function<void (TTreeReader&)>::operator()(TTreeReader&) const (this=0x7fffffffde50, __args#0=...) at /usr/include/c++/11.2.0/bits/std_function.h:560
#8  0x00007ffff659881c in operator() (__closure=0x7fffffffcf10, c=...) at ../tree/treeplayer/src/TTreeProcessorMT.cxx:555
#9  0x00007ffff6599d8c in operator() (__closure=0x7fffffffceb0, i=0) at ../core/imt/inc/ROOT/TThreadExecutor.hxx:231
(More stack frames follow...)

First reported at https://root-forum.cern.ch/t/root-6-26-00-issue-with-multi-threaded-rdataframe-and-rvec/49310 .

@eguiraud
Copy link
Member Author

eguiraud commented Apr 8, 2022

@pcanal suggests the following patch which does seem to fix the crash for me:

diff --git a/core/meta/src/TClass.cxx b/core/meta/src/TClass.cxx
index a36ff3f6f5..ba5dbc50d6 100644
--- a/core/meta/src/TClass.cxx
+++ b/core/meta/src/TClass.cxx
@@ -5414,7 +5414,7 @@ void TClass::Destructor(void *obj, Bool_t dtorOnly)
       // There is no dictionary at all, so this is an emulated
       // class; however we do have the services of a collection proxy,
       // so this is an emulated STL class.
-      fCollectionProxy->Destructor(p, dtorOnly);
+      GetCollectionProxy()->Destructor(p, dtorOnly);
    } else if (!HasInterpreterInfo() && !fCollectionProxy) {
       // There is no dictionary at all and we do not have
       // the services of a collection proxy available, so
@@ -5540,7 +5540,7 @@ void TClass::DeleteArray(void *ary, Bool_t dtorOnly)
       // There is no dictionary at all, so this is an emulated
       // class; however we do have the services of a collection proxy,
       // so this is an emulated STL class.
-      fCollectionProxy->DeleteArray(ary, dtorOnly);
+      GetCollectionProxy()->DeleteArray(ary, dtorOnly);
    } else if (!HasInterpreterInfo() && !fCollectionProxy) {
       // There is no dictionary at all and we do not have
       // the services of a collection proxy available, so

@eguiraud
Copy link
Member Author

eguiraud commented Apr 8, 2022

Valgrind's helgrind sees several problems, but it is not to be trusted completely as it does not understand the thread synchronization semantics of TBB tasks. One of the reports that seems relevant is e.g.:

==144544== Possible data race during read of size 8 at 0x1A752C00 by thread #1
==144544== Locks held: none
==144544==    at 0x51BD968: TGenCollectionProxy::PopProxy() (TGenCollectionProxy.cxx:1332)
==144544==    by 0x63D678C: (anonymous namespace)::TCollectionLessSTLReader::GetSize(ROOT::Detail::TBranchProxy*) (TTreeReaderArray.cxx:130)
==144544==    by 0x168CEE: ROOT::Internal::TTreeReaderArrayBase::GetSize() const (TTreeReaderArray.h:35)
==144544==    by 0x180901: ROOT::Internal::RDF::RTreeColumnReader<ROOT::VecOps::RVec<double> >::GetImpl(long long) (RTreeColumnReader.hxx:79)
==144544==    by 0x178553: ROOT::VecOps::RVec<double>& ROOT::Detail::RDF::RColumnReaderBase::Get<ROOT::VecOps::RVec<double> >(long long) (RColumnReaderBase.hxx:38)
==144544==    by 0x167C89: void ROOT::Detail::RDF::RDefine<plot_rc()::{lambda(ROOT::VecOps::RVec<double>&)#1}, ROOT::Detail::RDF::CustomColExtraArgs::None>::UpdateHelper<ROOT::VecOps::RVec<double>, 0ul>(unsigned int, long long, ROOT::TypeTraits::TypeList<ROOT::VecOps::RVec<double> >, std::integer_sequence<unsigned long, 0ul>, ROOT::Detail::RDF::CustomColExtraArgs::None) (RDefine.hxx:77)
==144544==    by 0x16782C: ROOT::Detail::RDF::RDefine<plot_rc()::{lambda(ROOT::VecOps::RVec<double>&)#1}, ROOT::Detail::RDF::CustomColExtraArgs::None>::Update(unsigned int, long long) (RDefine.hxx:138)
==144544==    by 0x168A5E: ROOT::Internal::RDF::RDefineReader::GetImpl(long long) (RDefineReader.hxx:41)
==144544==    by 0x176D71: ROOT::VecOps::RVec<int>& ROOT::Detail::RDF::RColumnReaderBase::Get<ROOT::VecOps::RVec<int> >(long long) (RColumnReaderBase.hxx:38)
==144544==    by 0x17588E: void ROOT::Internal::RDF::RAction<ROOT::Internal::RDF::SumHelper<int>, ROOT::Detail::RDF::RLoopManager, ROOT::TypeTraits::TypeList<ROOT::VecOps::RVec<int> > >::CallExec<ROOT::VecOps::RVec<int>, 0ul>(unsigned int, long long, ROOT::TypeTraits::TypeList<ROOT::VecOps::RVec<int> >, std::integer_sequence<unsigned long, 0ul>) (RAction.hxx:103)
==144544==    by 0x174BAF: ROOT::Internal::RDF::RAction<ROOT::Internal::RDF::SumHelper<int>, ROOT::Detail::RDF::RLoopManager, ROOT::TypeTraits::TypeList<ROOT::VecOps::RVec<int> > >::Run(unsigned int, long long) (RAction.hxx:111)
==144544==    by 0x6F1EF5E: ROOT::Detail::RDF::RLoopManager::RunAndCheckFilters(unsigned int, long long) (RLoopManager.cxx:589)
==144544==  Address 0x1a752c00 is 0 bytes inside a block of size 16 alloc'd
==144544==    at 0x4848093: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==144544==    by 0x51C300F: __gnu_cxx::new_allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*>::allocate(unsigned long, void const*) (new_allocator.h:127)
==144544==    by 0x51C29DA: std::allocator_traits<std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*> >::allocate(std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*>&, unsigned long) (alloc_traits.h:460)
==144544==    by 0x51C20DF: std::_Vector_base<ROOT::Detail::TCollectionProxyInfo::EnvironBase*, std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*> >::_M_allocate(unsigned long) (stl_vector.h:346)
==144544==    by 0x51C1620: void std::vector<ROOT::Detail::TCollectionProxyInfo::EnvironBase*, std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*> >::_M_realloc_insert<ROOT::Detail::TCollectionProxyInfo::EnvironBase* const&>(__gnu_cxx::__normal_iterator<ROOT::Detail::TCollectionProxyInfo::EnvironBase**, std::vector<ROOT::Detail::TCollectionProxyInfo::EnvironBase*, std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*> > >, ROOT::Detail::TCollectionProxyInfo::EnvironBase* const&) (vector.tcc:440)
==144544==    by 0x51C09AB: std::vector<ROOT::Detail::TCollectionProxyInfo::EnvironBase*, std::allocator<ROOT::Detail::TCollectionProxyInfo::EnvironBase*> >::push_back(ROOT::Detail::TCollectionProxyInfo::EnvironBase* const&) (stl_vector.h:1198)
==144544==    by 0x51BD8FB: TGenCollectionProxy::PushProxy(void*) (TGenCollectionProxy.cxx:1322)
==144544==    by 0x4C589E6: TVirtualCollectionProxy::TPushPop::TPushPop(TVirtualCollectionProxy*, void*) (TVirtualCollectionProxy.h:64)
==144544==    by 0x5167481: TEmulatedCollectionProxy::Destructor(void*, bool) const (TEmulatedCollectionProxy.cxx:85)
==144544==    by 0x4C508C1: TClass::Destructor(void*, bool) (TClass.cxx:5417)
==144544==    by 0x60EBFDA: TBranchElement::ReleaseObject() (TBranchElement.cxx:4743)
==144544==    by 0x60EC264: TBranchElement::ResetAddress() (TBranchElement.cxx:4806)
==144544==  Block was alloc'd by thread #2

@eguiraud
Copy link
Member Author

eguiraud commented Apr 8, 2022

The problem appears when reading branch type vector<double,ROOT::Detail::VecOps::RAdoptAllocator<double>> (this is how RVecs were written out by RDF's Snapshot before v.6.26). Removing the custom allocator or using RVecs written out with their new I/O implementation (based on collection proxies) "fixes" the problem (EDIT: ...because in those cases we don't use an emulated collection proxy, which is what is causing the problem).

eguiraud added a commit to eguiraud/root that referenced this issue Apr 8, 2022
This fixes root-project#10357 (a race condition when reading vectors with custom
allocators with TTreeProcessorMT that also affected RDataFrame).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
eguiraud added a commit that referenced this issue Apr 10, 2022
This fixes #10357 (a race condition when reading vectors with custom
allocators with TTreeProcessorMT that also affected RDataFrame).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
eguiraud added a commit to eguiraud/root that referenced this issue Apr 10, 2022
This fixes root-project#10357 (a race condition when reading vectors with custom
allocators with TTreeProcessorMT that also affected RDataFrame).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
@eguiraud eguiraud reopened this Apr 10, 2022
@eguiraud
Copy link
Member Author

Reopening until backport is in.

eguiraud added a commit that referenced this issue Apr 10, 2022
This fixes #10357 (a race condition when reading vectors with custom
allocators with TTreeProcessorMT that also affected RDataFrame).

Co-authored-by: Philippe Canal <pcanal@fnal.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants