tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` #8523

dcherian · 2023-12-05T21:24:51Z

Is your feature request related to a problem?

When parallel=True and a distributed client is active, Xarray reads every file in parallel, constructs a Dataset per file with indexed coordinates loaded, and then sends all of that back to the "head node" for the combine.

Instead we can tree-reduce the combine (example) by switching to dask.bag instead of dask.delayed and skip the overhead of shipping 1000s of copies of an indexed coordinate back to the head node.

The downside is the dask graph is "worse" but perhaps that shouldn't stop us.
I think this is only feasible for combine="nested"

cc @TomNicholas

The text was updated successfully, but these errors were encountered:

TomNicholas · 2023-12-05T21:34:17Z

Oh this is an interesting idea...

How much faster is this? What does the graph look like? (The notebook in the gist doesn't seem to show either)

skip the overhead of shipping 1000s of copies of an indexed coordinate back to the head node.

What is this proposal doing instead? Don't the coordinates still ultimately get shipped to be on the same node in order to do the alignment?

dcherian · 2023-12-05T21:41:14Z

How much faster is this?

Haven't tested, happy to say I don't use open_mfdataset any more :) . I am just posting this experiment so someone else can pursue it if they want.

Don't the coordinates still ultimately get shipped to be on the same node in order to do the alignment?

No it'll execute the combine 8 datasets at a time, then combine the results of that step 8 datasets at a time, and so on remotely and ship the final combined dataset back to the head node.

TomNicholas · 2023-12-05T22:18:39Z

Haven't tested, happy to say I don't use open_mfdataset any more :)

I used it the first time today in a while 😅 Mostly because of fsspec/kerchunk#386

No it'll execute the combine 8 datasets at a time, then combine the results of that step 8 datasets at a time, and so on remotely and ship the final combined dataset back to the head node.

I'm definitely missing something, but like won't the same amount of data still need to get moved around in the end? This is potentially faster just because the communication doesn't all clobber the lone head node at once?

dcherian · 2023-12-18T19:32:34Z

but like won't the same amount of data still need to get moved around in the end?

In the coiled pattern where you orchestrate remote workers but download results to the user's machine machine, this is a lot of copies moving to the user's machine. I agree that this is less of a concern in remote JupyterHub deployments, or in HPC environments; but I bet you'll still see an improvement when opening O(10,000) files.

dcherian added enhancement topic-dask labels Dec 5, 2023

TomNicholas added the topic-combine combine/concat/merge label Dec 5, 2023

TomNicholas mentioned this issue Feb 2, 2024

Refactor MultiZarrToZarr into multiple functions fsspec/kerchunk#377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` #8523

tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` #8523

dcherian commented Dec 5, 2023 •

edited

TomNicholas commented Dec 5, 2023

dcherian commented Dec 5, 2023 •

edited

TomNicholas commented Dec 5, 2023

dcherian commented Dec 18, 2023

tree-reduce the combine for open_mfdataset(..., parallel=True, combine="nested") #8523

tree-reduce the combine for open_mfdataset(..., parallel=True, combine="nested") #8523

Comments

dcherian commented Dec 5, 2023 • edited

Is your feature request related to a problem?

TomNicholas commented Dec 5, 2023

dcherian commented Dec 5, 2023 • edited

TomNicholas commented Dec 5, 2023

dcherian commented Dec 18, 2023

tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` #8523

tree-reduce the combine for `open_mfdataset(..., parallel=True, combine="nested")` #8523

dcherian commented Dec 5, 2023 •

edited

dcherian commented Dec 5, 2023 •

edited