-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
I'm trying to use map_blocks in combintation with to_zarr, but this unfortunately results into very suboptimal dask graphs.
The issue as I understand it, when using map_blocks it requires all all data variables for each chunk, which works totally fine. But the to_zarr includes a finalize task, which is handled per data variable, this means that dask tries to write the dask arrays more or less one at a time. The combination of this means, that the memory just explodes.
Very simple example:
ds = xr.DataArray(da.arange(30 * 3 * 4).reshape(30, 3, 4), dims=("time", "y", "x")).to_dataset(name="hello")
ds.coords["doy"] = "time", np.arange(30)
ds.coords["y"] = "time", np.arange(30)
ds = ds.chunk(time=3, y=-1, x=-1)
ds.to_zarr("./tmp/in3.zarr", mode="w", zarr_format=2)
ds = xr.open_dataarray("./tmp/in3.zarr", chunks={}, inline_array=True)
mapped = xr.map_blocks(lambda x: x + 1, ds, template=ds)
delayed_obj = mapped.drop_encoding().to_zarr("./tmp/out4.zarr", mode="w", zarr_format=2, compute=False)
delayed_obj.visualize("map-blocks-to-zarr-2.svg", color="order", cmap="autumn", optimize_graph=False)

What I expected:
I had hoped that the chunks that were done would immediatly be written to disk, so I had expected the graph to look more like this (this is what we see when we only have a single data variable):

What I see:
Because of the finalize tasks and the way that dask ordering works, I see that all the data is loaded in before it writes a lot of the data.
Questions:
- Is there a way to work around this, by forcing the ordering to be more logical?
- Is there a way to fix this in xarray itself?