Improve startup time of distributed RDataFrame application #8232

vepadulano · 2021-05-25T10:07:34Z

Explain what you would like to see improved

Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:

For each file of the dataset, open it and compute a list of all the clusters in the file.
From the list of all clusters of all files, divide it into groups of clusters (Ranges) depending on the npartitions parameter of the dataframe

Each Range will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on TFile::Open . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:

root/bindings/experimental/distrdf/python/DistRDF/Node.py

Lines 363 to 368 in db3d424

    
           for filename in filelist: 
        
               f = ROOT.TFile.Open(filename) 
        
               t = f.Get(treename) 
        
               entries = t.GetEntriesFast() 
        
               it = t.GetClusterIterator(0)

Optional: share how it could be improved

Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:

If you have these files:

   50MB
   100MB
   300MB
   3GB

then I'd translate that to cluster estimates:

    2
    3
    10
    100

and split this into n tasks accordingly.

The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.

Additional context

Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.

The text was updated successfully, but these errors were encountered:

vepadulano · 2022-05-11T08:39:14Z

This issue is fixed by #10098

vepadulano added the improvement label May 25, 2021

vepadulano self-assigned this May 25, 2021

vepadulano mentioned this issue Jun 10, 2021

[WIP][DF] Refactor DistRDF range creation #8391

Closed

vepadulano closed this as completed May 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve startup time of distributed RDataFrame application #8232

Improve startup time of distributed RDataFrame application #8232

vepadulano commented May 25, 2021

vepadulano commented May 11, 2022

Improve startup time of distributed RDataFrame application #8232

Improve startup time of distributed RDataFrame application #8232

Comments

vepadulano commented May 25, 2021

Explain what you would like to see improved

Optional: share how it could be improved

Additional context

vepadulano commented May 11, 2022