Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBufferMerger cannot handle TTree spanning over multiple files #6523

Closed
vepadulano opened this issue Oct 1, 2020 · 4 comments · Fixed by #6570
Closed

TBufferMerger cannot handle TTree spanning over multiple files #6523

vepadulano opened this issue Oct 1, 2020 · 4 comments · Fixed by #6570

Comments

@vepadulano
Copy link
Member

vepadulano commented Oct 1, 2020

Describe the bug

This is a continuation of a jira issue originally found in multithreaded RDF ROOT-10896.

If a TTree spans over multiple files (i.e. when its size is higher than fgMaxTreeSize) TBufferMerger cannot correctly save its contents to the (multiple) TBufferMergerFile(s) in its queue. This is due to the original TFile being deleted in TTree::ChangeFile but not updated in TBufferMergerFile which then holds a dangling pointer. The program crashes with the following stacktrace (not really helpful even with debug symbols enabled in ROOT as well):

vpadulan@fedorathinkpad-T550 [~/Projects/ROOT-10896]: ./tbuffermerger_setmaxsize_seq 
Fill: Switching to new file: tbuffermerger_sequential_1.root
Fill: Switching to new file: tbuffermerger_sequential_2.root

 *** Break *** segmentation violation
===========================================================
There was a crash (kSigSegmentationViolation).
This is the entire stack trace of all threads:
===========================================================
#0  0x00007f4fca7fbeca in wait4 () from /lib64/libc.so.6
#1  0x00007f4fca777aa7 in do_system () from /lib64/libc.so.6
#2  0x00007f4fcd15e590 in TUnixSystem::Exec (this=0xe4c800, shellcmd=0x1f92840 "/home/vpadulan/Programs/rootproject/rootinstall/devdebugtest/etc/gdb-backtrace.sh 8100 1>&2") at /home/vpadulan/Programs/rootproject/root/core/unix/src/TUnixSystem.cxx:2120
#3  0x00007f4fcd15edf6 in TUnixSystem::StackTrace (this=0xe4c800) at /home/vpadulan/Programs/rootproject/root/core/unix/src/TUnixSystem.cxx:2411
#4  0x00007f4fcd16271a in TUnixSystem::DispatchSignals (this=0xe4c800, sig=kSigSegmentationViolation) at /home/vpadulan/Programs/rootproject/root/core/unix/src/TUnixSystem.cxx:3649
#5  0x00007f4fcd15a82a in SigHandler (sig=kSigSegmentationViolation) at /home/vpadulan/Programs/rootproject/root/core/unix/src/TUnixSystem.cxx:407
#6  0x00007f4fcd162628 in sighandler (sig=11) at /home/vpadulan/Programs/rootproject/root/core/unix/src/TUnixSystem.cxx:3620
#7  <signal handler called>
#8  0x0000000000000000 in ?? ()
#9  0x00000000004012fb in main () at tbuffermerger_setmaxsize_seq.cpp:25
===========================================================

The lines below might hint at the cause of the crash.
You may get help by asking at the ROOT forum http://root.cern.ch/forum
Only if you are really convinced it is a bug in ROOT then please submit a
report at http://root.cern.ch/bugs Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#8  0x0000000000000000 in ?? ()
#9  0x00000000004012fb in main () at tbuffermerger_setmaxsize_seq.cpp:25
===========================================================

Expected behavior

The program doesn't crash. TBufferMergerFile is updated alongside the tree (i.e. it should be the same pointer as TTree::GetCurrentFile()) and the partial files are correctly saved when the TTree changes to new ones.

To Reproduce

See the following gist:
https://gist.github.com/vepadulano/fc09150516193f1a801e0991591958ee

Setup

  1. ROOT master
  2. Fedora 32
  3. ROOT built with cmake: cmake -GNinja -Ddev=ON -DCMAKE_BUILD_TYPE=Debug -Dtesting=ON -Droottest=ON

Additional Context

Some print statements from gdb, also commented in the gist.
Before filling the TTree we can see that the current TFile of the tree is correctly the TBufferMergerFile (tbmfile):

(gdb) p tbmfile.get()
$4 = (std::__shared_ptr<ROOT::Experimental::TBufferMergerFile, (__gnu_cxx::_Lock_policy)2>::element_type *) 0x1f0e750
(gdb) p tree.GetCurrentFile()
$5 = (TFile *) 0x1f0e750
(gdb) p tree.GetDirectory()  
$6 = (TDirectory *) 0x1f0e750
(gdb) p tbmfile.get()->GetName()
$7 = 0x1eca1f0 "tbuffermerger_sequential.root"
(gdb) p tree.GetDirectory()->GetName()
$8 = 0x1eca1f0 "tbuffermerger_sequential.root"
(gdb) p tree.GetCurrentFile()->GetName()
$9 = 0x1eca1f0 "tbuffermerger_sequential.root"

But after the event loop the TTree changes files twice and the latest file it points to is called "tbuffermerger_sequential_2.root", but this is not reflected correctly in the TBufferMergerFile:

(gdb) p tbmfile.get()
$10 = (std::__shared_ptr<ROOT::Experimental::TBufferMergerFile, (__gnu_cxx::_Lock_policy)2>::element_type *) 0x1f0e750
(gdb) p tree.GetDirectory()                  
$11 = (TDirectory *) 0x19865c0
(gdb) p tree.GetCurrentFile()                
$12 = (TFile *) 0x19865c0
(gdb) p tree.GetCurrentFile()->GetName()
$13 = 0x1db1780 "tbuffermerger_sequential_2.root"
(gdb) p tree.GetDirectory()->GetName()       
$14 = 0x1db1780 "tbuffermerger_sequential_2.root"
(gdb) p tbmfile.get()->GetName()

Program received signal SIGSEGV, Segmentation fault.
@vepadulano vepadulano self-assigned this Oct 1, 2020
@github-actions github-actions bot added this to Needs triage in Triage Oct 1, 2020
@vepadulano
Copy link
Member Author

vepadulano commented Oct 6, 2020

Update: see the gist for some comments from Philippe.

@vepadulano
Copy link
Member Author

vepadulano commented Oct 6, 2020

One way to go about this would be to mimic the suggested usage of the TFile the TTree is attached to as per the TTree::ChangeFile docs in TBufferMerger.

One of the main limitations I see is that TBufferMerger is hardwired to the initial TFile it is constructed with.

Supposing we modify TTree::ChangeFile to check the tree is attached to a TBufferMergerFile, if it doesn't get deleted (like Philippe suggests in the gist comments) then we still have to find a way to change its name to not collide with the original one. If instead it still gets deleted then we have to create a new TBufferMergerFile and not a simple TFile, but we couldn't attach it to the TBufferMerger inside TTree::ChangeFile.

@eguiraud eguiraud moved this from Needs triage to Rest in Triage Oct 6, 2020
@amadio
Copy link
Member

amadio commented Oct 6, 2020

Since merging implies the intention of creating a single file at the end, I recommend disabling the mechanism to change files by setting the max TTree size to the maximum possible size. This mechanism cannot be controlled from within TBufferMerger since it doesn't have access to the objects that are being merged, so I think that it will be difficult to find a solution that doesn't involve changing TTree as well (i.e. like adding a ChangeFile method to TBufferMerger and calling that from TTree::ChangeFile if necessary. That shouldn't change the file the TTree is associated with, but the output file used by TBufferMerger to have the expected effect). Note, however, that as a complication you may have multiple threads writing to different trees each attached to a different TBufferMergerFile switching files at different times as they reach their size limit and that may trigger several small files to appear in the end if they reach their limit around the same time and try to change files nthreads number of times... Another complication is what happens to other kinds of objects (like histograms), as they may now be split among several output files rather than being in one merged output file. Disabling the mechanism sounds like the better option to me.

@amadio
Copy link
Member

amadio commented Oct 6, 2020

A better option for disabling the mechanism that I just considered with @vepadulano is to change TTree::ChangeFile to be a noop if the file it's attached to is a TBufferMergerFile. That has the advantage that if users try to use TTree with TBufferMerger by hand (i.e. outside of RDataFrame), then it would work for them too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants