New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DF] Crash with distributed RDataFrame on dask with dask_jobqueue #9429
Comments
I should add that when running the RDF on an existing file and TTree, there is a different exception:
With the code: import ROOT
RDataFrame = ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
from dask.distributed import Client
from dask_jobqueue import HTCondorCluster
cluster = HTCondorCluster(cores=1, processes=1, memory="1GB", disk="0.1GB", job_extra={"jobflavour": espresso"})
cluster.scale(jobs=1)
client = Client(cluster)
df = RDataFrame("Events", "tree.root", daskclient=client)
prod = df.Mean("myCol")
val = prod.GetValue()
print(f"Value: {val}") And the file |
Hi @swertz, In the case of Dask, this is done by accessing the In particular, the print(client.scheduler_info()) call after you create your This effect can be mitigated if you first call With that being said, the |
As for your second comment
I think you are still seeing the effect of HTCondor that hasn't already given resources to Dask. Try again adding |
Thanks for the quick feedback! I've tried with I'll try again once #9431 is merged; in particular I'm interested in using Dask's adaptive worker management, so that the exact number of submitted jobs is not fixed a priori but automatically adapts to the workload... |
The fact that import time
import random
from dask import delayed
from dask.distributed import Client
from dask_jobqueue import HTCondorCluster
cluster = HTCondorCluster(cores=1, processes=1, memory="1GB", disk="0.1GB", job_extra={"jobflavour": "espresso"})
cluster.scale(jobs=1)
client = Client(cluster)
# Try with and without this
client.wait_for_workers(1)
def inc(x):
time.sleep(random.random())
return x + 1
def dec(x):
time.sleep(random.random())
return x - 1
def add(x, y):
time.sleep(random.random())
return x + y
inc = delayed(inc)
dec = delayed(dec)
add = delayed(add)
x = inc(1)
y = dec(2)
z = add(x, y)
print(f"Result is: {z.compute()}") Coming back to RDF, the df = RDataFrame("treename","filename.root",daskclient=client,npartitions=NPARTITIONS) A good parallelisation can often be obtained when the number of partitions is roughly 3x the amount of cores you can use.
Sounds good, then the |
Strange, I've tried the dask-only example (with lxbatch) but in both cases (with or without the About the choice of |
Deciding how many tasks to submit to a distributed scheduler is a fine art on its own :) From what I understand of your case, I would try to test first with Not sure about the details of the |
I managed to get it to work on HTCondor/lxbatch thanks to the instructions here. The distributed RDF case (with the import socket
cluster = HTCondorCluster(cores=1, processes=1, memory="1GB", disk="0.1GB",
job_extra={
"+JobFlavour": '"espresso"',
'should_transfer_files': 'Yes',
'when_to_transfer_output': 'ON_EXIT',
'getenv': 'True'
},
death_timeout = '60',
scheduler_options={
'port': 8786,
'host': socket.gethostname()
},
extra=['--worker-port 10000:10100']
) Note that for the dask-only example I had to use futures instead of x = client.submit(inc, 1)
y = client.submit(inc, 2)
z = client.submit(add, x, y)
print(f"Result is: {z.result()}") The dask documentation is not very enlightening about the use of As for using the If you'd like I'll try again with the next nightly build including your merged PR, without the explicit call to
If I may abuse the thread with a naive question: how do you get the number of clusters in a file? |
Thanks for all the insights! We are still learning how to cope with all the different interfaces. It is possible that at some point all this extra configuration will be collected in a single place to make it easier for new users to activate from distributed RDataFrame directly. It would be amazing if you could try again your reproducer with the next nightlies if you have time, thank you so much 😄 ! |
Describe the bug
I've been trying out the new
RDF.Experimental.Distributed.Dask.RDataFrame
in ROOT master, which is a great addition. It seems to work fine when using a single-machine cluster of workers (dask.distributed.LocalCluster
), but it fails when using a batch cluster (eitherdask_jobqueue.HTCondorCluster
ordask_jobqueue.SLURMCluster
):To Reproduce
Minimal example, ran on lxplus/lxbatch:
Setup
/cvmfs/sft.cern.ch/lcg/views/dev3/latest/x86_64-centos7-gcc11-opt/setup.sh
pip install dask-jobqueue
The text was updated successfully, but these errors were encountered: