-
Notifications
You must be signed in to change notification settings - Fork 1.4k
v6.22 Buffer merger improvement #7339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v6.22 Buffer merger improvement #7339
Conversation
... instead of just always ignoring it
fParent was already set during the usual branch creation (TTree::Branch and friends) but might still be not set if the user create a hierarchy by hand
When fMother is not set, first try to use fParent rather than iterate through the TTree's list of top level branches
This allow the fast cloning of the first tree in the list when using TTree::MergeTrees (rare)
This speeds up the first step of merging (hadd)
This can now be safely use to avoid overloading the queue.
This can be used to monitor the queue size in bytes rather than number of files
We do not apply the same to CloneTree as the addresses copying is a side-effect (no reset at the end) that is relied upon by user code.g
Limited to TTree and Histogram.
TObjArray::GetEntries() counts the number of non-zero slots while TObjArray::GetEntriesFast() return (in most case) the value of a high water mark (the slot over which all remaining slots are empty). In one use case (doing read a TTree and do many TTree cloning (via TBufferMerger), TObjArray::GetEntries was taking 8% of the running time (compression was disabled) Note: TObjArray::IsEmpty use GetEntriesFast rather than GetEntries.
For iteration over the list of baskets, the existing code try to reduce the range of the iteration to stop as soon as all non-nullptr slot have been seen. However to know the number of non-nullptr slot, the code (has to) call GetEntries which scan the entire array anyway ...
Currently only in the case where we have a Merge function and a dictionary.
If TBufferMerger can take the merge lock, then rather than queuing the input file, merge it immediately and skip the Writing the TTree, doing a memcpy, Pushing to the queue then Reading and then deleting parts.
Add new enum for TObject::Write: kOnlyPrepStep = BIT(3) ///< Just prepare the objects do not write them (i.e. call TTree::FlushBaskets)
TClass::GetBaseClassOffset is a 'often' used routine in the I/O. In the fast path (looking up in a cache), rather than taking the global read lock (which is not only global but also 'slower' than a regular mutex), we are now use a instance specific mutex to protect the cache.
To save space, the compression can be re-enabled by TBuffferMerger::SetCompressTemporaryKeys. The TBaskets are still compressed and thus the end result is unchanged.
This allows to significant reduce the number of search through the list of cached result ... That search was a noticeable cause of slow down with an increase of number threads due to the contention on updating the number of readers (used for the local read/write lock mechanism).
Since fClass is modified only 'once' at init time of the IsAProxy object, it is actual more efficient to protect the initialization by a lock rather than having fClass being atomic.
…tween now and the next search).
This is necessary to support the 'fast' incremental merge where the input will continue to be used.
This enum will be removed shortly (functionality moved from TObject::Write to *::Merge)
|
Starting build on |
|
@phsft-bot build with flags -DCTEST_TEST_EXCLUDE_NONE=On |
|
Starting build on |
See discussion at #7286
This set of improvements to TBufferMerger (and more) was inspired by the terrible performance of the parallel merging (and fast merging in general) in the case where the TTree has a very large number of branches (1000+).
Where in the original version a TBufferMerger with a file with 1000+ branches and only 50s and ran with any number of threads would take more than 3m (did not wait until the end) the new version takes 11s with 1 thread, 8s with 6 threads and 22s with 6 thread when increased to 500 events. (using the CMS file ../data//250202_181_RECO.root).
This PR includes:
Skipping the boxing/compressing/uncompressing/unboxing cycle if the TBufferMerger is available (not already merging) when the thread is writing its TMemFile.
Skipping SetBranchAddress and SetMakeClass in when doing fast cloning (where that information is not used anyway).
Replacing calling to the slow TObjArray::GetEntries (which counts the slot used) by calling GetEntriesFast.
Speeding up the GetMother implementation (caching parent's address sooner when reading, use that information in GetMother).
In fast cloning, delay writing the output until the last input is processed (instead of writing the output after each input),
Optimization of fast cloning handing of empty write basket.
It also contains a couple of bug fixes (RNtuple merging forgetting to merge the result of the objects in the file, iofeatures incorrectly cloned).
Also made TBufferMerger::GetQueueSize actually thread safe (it is necessary to use it to stop the producer from adding more data if the queue is too full).
Also improved TClass::GetBaseClassOffset parallelism (benefit parallel boxing/unboxing) and reduced/removed contention see in the mechanism use to determine the actual object type at run-time (TIsAProxy).