Skip to content

Conversation

@pcanal
Copy link
Member

@pcanal pcanal commented Mar 3, 2021

See discussion at #7286

This set of improvements to TBufferMerger (and more) was inspired by the terrible performance of the parallel merging (and fast merging in general) in the case where the TTree has a very large number of branches (1000+).

Where in the original version a TBufferMerger with a file with 1000+ branches and only 50s and ran with any number of threads would take more than 3m (did not wait until the end) the new version takes 11s with 1 thread, 8s with 6 threads and 22s with 6 thread when increased to 500 events. (using the CMS file ../data//250202_181_RECO.root).

This PR includes:

  • Skipping the boxing/compressing/uncompressing/unboxing cycle if the TBufferMerger is available (not already merging) when the thread is writing its TMemFile.

  • Skipping SetBranchAddress and SetMakeClass in when doing fast cloning (where that information is not used anyway).

  • Replacing calling to the slow TObjArray::GetEntries (which counts the slot used) by calling GetEntriesFast.

  • Speeding up the GetMother implementation (caching parent's address sooner when reading, use that information in GetMother).

  • In fast cloning, delay writing the output until the last input is processed (instead of writing the output after each input),

  • Optimization of fast cloning handing of empty write basket.

It also contains a couple of bug fixes (RNtuple merging forgetting to merge the result of the objects in the file, iofeatures incorrectly cloned).

Also made TBufferMerger::GetQueueSize actually thread safe (it is necessary to use it to stop the producer from adding more data if the queue is too full).

Also improved TClass::GetBaseClassOffset parallelism (benefit parallel boxing/unboxing) and reduced/removed contention see in the mechanism use to determine the actual object type at run-time (TIsAProxy).

pcanal added 30 commits March 2, 2021 16:20
fParent was already set during the usual branch creation (TTree::Branch and friends) but
might still be not set if the user create a hierarchy by hand
When fMother is not set, first try to use fParent rather than iterate through
the TTree's list of top level branches
This allow the fast cloning of the first tree in the list when using TTree::MergeTrees (rare)
This speeds up the first step of merging (hadd)
This can now be safely use to avoid overloading the queue.
This can be used to monitor the queue size in bytes rather than number of files
We do not apply the same to CloneTree as the addresses copying
is a side-effect (no reset at the end) that is relied upon by
user code.g
Limited to TTree and Histogram.
TObjArray::GetEntries() counts the number of non-zero slots
while TObjArray::GetEntriesFast() return (in most case) the value of a high water mark (the slot over which all remaining slots
are empty).

In one use case (doing read a TTree and do many TTree cloning (via TBufferMerger), TObjArray::GetEntries was taking 8% of the
running time (compression was disabled)

Note: TObjArray::IsEmpty use GetEntriesFast rather than GetEntries.
For iteration over the list of baskets, the existing code try to reduce the range of the iteration
to stop as soon as all non-nullptr slot have been seen.
However to know the number of non-nullptr slot, the code (has to) call GetEntries which scan the
entire array anyway ...
Currently only in the case where we have a Merge function and a dictionary.
If TBufferMerger can take the merge lock, then rather than queuing the input file, merge it immediately
and skip the Writing the TTree, doing a memcpy, Pushing to the queue then Reading and then deleting parts.
Add new enum for TObject::Write: kOnlyPrepStep  = BIT(3)         ///< Just prepare the objects do not write them (i.e. call TTree::FlushBaskets)
TClass::GetBaseClassOffset is a 'often' used routine in the I/O.
In the fast path (looking up in a cache), rather than taking the
global read lock (which is not only global but also 'slower' than
a regular mutex), we are now use a instance specific mutex to
protect the cache.
To save space, the compression can be re-enabled by TBuffferMerger::SetCompressTemporaryKeys.

The TBaskets are still compressed and thus the end result is unchanged.
This allows to significant reduce the number of search through the list of cached result ...
That search was a noticeable cause of slow down with an increase of number threads due to the contention
on updating the number of readers (used for the local read/write lock mechanism).
Since fClass is modified only 'once' at init time of the IsAProxy object, it is actual more efficient
to protect the initialization by a lock rather than having fClass being atomic.
@phsft-bot
Copy link

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14
How to customize builds

@pcanal
Copy link
Member Author

pcanal commented Mar 3, 2021

@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=On

@phsft-bot
Copy link

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-fedora30/cxx14, ROOT-fedora31/noimt, ROOT-ubuntu16/nortcxxmod, mac1014/python3, mac11.0/cxx17, windows10/cxx14 with flags -DCTEST_TEST_EXCLUDE_NONE=On
How to customize builds

@pcanal pcanal merged commit 7966dee into root-project:v6-22-00-patches Mar 3, 2021
@pcanal pcanal deleted the v622-BufferMergerImprove branch March 3, 2021 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants