You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:
For each file of the dataset, open it and compute a list of all the clusters in the file.
From the list of all clusters of all files, divide it into groups of clusters (Ranges) depending on the npartitions parameter of the dataframe
Each Range will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies on TFile::Open . If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:
Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:
If you have these files:
50MB
100MB
300MB
3GB
then I'd translate that to cluster estimates:
2
3
10
100
and split this into n tasks accordingly.
The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.
Additional context
Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.
The text was updated successfully, but these errors were encountered:
Explain what you would like to see improved
Currently there is a setup step done in the client before actually starting the distributed computations. During the setup, a list of ranges of entries from the original dataset is computed. The logic is as follows:
Range
s) depending on thenpartitions
parameter of the dataframeEach
Range
will be assigned its own task in the distributed resources. The point 1. above can be particularly expensive to run since it relies onTFile::Open
. If the files of the dataset are stored remotely, the overhead adds up pretty quickly. The call happens specifically in:root/bindings/experimental/distrdf/python/DistRDF/Node.py
Lines 363 to 368 in db3d424
Optional: share how it could be improved
Ideally we could avoid calling TFile::Open in the client. @Axel-Naumann proposed on mattermost to estimate the number of clusters of each file depending on its size and consequently compute the number of tasks to run on the distributed resources:
The single task in the distributed worker would then be responsible to open only the file(s) where the estimated clusters should be stored. This needs to be explored.
Additional context
Thanks to @stwunsch for bringing this up. This issue will keep track of further discussion and updates on the matter.
The text was updated successfully, but these errors were encountered: