Skip to content

[RDF] Add internal GetDatasetGlobalClusterBoundaries utility to retrieve cluster entry ranges#21768

Merged
siliataider merged 4 commits intoroot-project:masterfrom
siliataider:rdf-clusterranges
Apr 10, 2026
Merged

[RDF] Add internal GetDatasetGlobalClusterBoundaries utility to retrieve cluster entry ranges#21768
siliataider merged 4 commits intoroot-project:masterfrom
siliataider:rdf-clusterranges

Conversation

@siliataider
Copy link
Copy Markdown
Contributor

@siliataider siliataider commented Apr 1, 2026

This Pull request:

Adds GetDatasetGlobalClusterBoundaries as an internal utility to retrieve entry ranges for each cluster in a TTree or RNTuple based RDataFrame.

When possible, the files are processed in parallel with a ROOT::TThreadExecutor.

It returns a list of cluster boundaries across files, using a global offset.

This utility is required by the RDataLoader to shuffle and prefetch data for ML training.

Changes

  • RNTupleDS: add GetDatasetGlobalClusterBoundaries as a friend function to access private members fNTupleName and fFileNames
  • RNTupleDS: set fNTupleName in the single file constructor (like the multi file constructor)
  • RInterface: add GetDatasetGlobalClusterBoundaries() implementation for both TTree and RNTuple datasources (dispatches to existing GetClustersAndEntries implementations)

@siliataider siliataider self-assigned this Apr 1, 2026
@siliataider siliataider added in:RDataFrame in:ML Everything under ROOT/ML labels Apr 1, 2026
@siliataider
Copy link
Copy Markdown
Contributor Author

@vepadulano the multithreaded approach may not be so straightforward: the global offset depends on the cumulative entry count from previous files, opened sequentially

We could maybe think of a two step approach, collecting clusters in parallel and then sequentially adjusting with a global offset somehow preserving the original order of the files..

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Test Results

    22 files      22 suites   3d 6h 37m 20s ⏱️
 3 830 tests  3 818 ✅  1 💤 11 ❌
75 611 runs  75 582 ✅ 18 💤 11 ❌

For more details on these failures, see this check.

Results for commit 5442dae.

♻️ This comment has been updated with latest results.

@siliataider siliataider force-pushed the rdf-clusterranges branch 5 times, most recently from cef5c00 to 395d045 Compare April 7, 2026 14:54
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better function naming would be GetClusterBoundaries. More in general, I think we should take this chance to merge the functionalities of ROOT::Internal::TreeUtils::GetClustersAndEntries and ROOT::Internal::RDF::GetClustersAndEntries.

All in all, let's discuss how much extra work would be to implement GetClustersAndEntries as one function in the Internal::RDF namespace that dispatches the cluster boundary + number of entries retrieval depending on whether the input dataset is TTree or RNTuple

@siliataider siliataider force-pushed the rdf-clusterranges branch 4 times, most recently from 57dd8d0 to 58b3d7c Compare April 8, 2026 16:35
@siliataider siliataider requested a review from vepadulano April 8, 2026 16:39
@siliataider
Copy link
Copy Markdown
Contributor Author

siliataider commented Apr 8, 2026

@vepadulano as a followup to the conversation we had earlier, I rewrote GetClusterBoundaries as a function that detects the type of the RDataSource we have, and dispatches accordingly either to ROOT::Internal::TreeUtils::GetClustersAndEntries for TTree or ROOT::Internal::RDF::GetClustersAndEntries for RNTuple, then unifies the results into a common return format.
It also now uses a pool of threads to open the files and get the metadata concurrently making it a bit faster.

@siliataider siliataider changed the title [RDF] Add internal GetClusterRanges utility to retrieve cluster entry ranges [RDF] Add internal GetClusterBoundaries utility to retrieve cluster entry ranges Apr 8, 2026
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I left some comments to refine the PR.

@siliataider siliataider changed the title [RDF] Add internal GetClusterBoundaries utility to retrieve cluster entry ranges [RDF] Add internal GetDatasetGlobalClusterBoundaries utility to retrieve cluster entry ranges Apr 9, 2026
@siliataider siliataider requested a review from vepadulano April 9, 2026 11:04
Copy link
Copy Markdown
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks!

@siliataider siliataider merged commit 257fd07 into root-project:master Apr 10, 2026
48 of 53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in:ML Everything under ROOT/ML in:RDataFrame

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants