Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve startup time of distributed RDataFrame application #8232

Closed
vepadulano opened this issue May 25, 2021 · 1 comment
Closed

Improve startup time of distributed RDataFrame application #8232

vepadulano opened this issue May 25, 2021 · 1 comment
Assignees

Comments

@vepadulano
Copy link
Member

Explain what you would like to see improved

Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:

  1. For each file of the dataset, open it and compute a list of all the clusters in the file.
  2. From the list of all clusters of all files, divide it into groups of clusters (Ranges) depending on the npartitions parameter of the dataframe

Each Range will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on TFile::Open . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:

for filename in filelist:
f = ROOT.TFile.Open(filename)
t = f.Get(treename)
entries = t.GetEntriesFast()
it = t.GetClusterIterator(0)

Optional: share how it could be improved

Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:

If you have these files:

   50MB
   100MB
   300MB
   3GB

then I'd translate that to cluster estimates:

    2
    3
    10
    100

and split this into n tasks accordingly.

The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.

Additional context

Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.

@vepadulano
Copy link
Member Author

This issue is fixed by #10098

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant