Skip to content

xr.map_blocks in combination with to_zarr results in very suboptimal dask graphs #10745

@mark-boer

Description

@mark-boer

What is your issue?

I'm trying to use map_blocks in combintation with to_zarr, but this unfortunately results into very suboptimal dask graphs.

The issue as I understand it, when using map_blocks it requires all all data variables for each chunk, which works totally fine. But the to_zarr includes a finalize task, which is handled per data variable, this means that dask tries to write the dask arrays more or less one at a time. The combination of this means, that the memory just explodes.

Very simple example:

ds = xr.DataArray(da.arange(30 * 3 * 4).reshape(30, 3, 4), dims=("time", "y", "x")).to_dataset(name="hello")
ds.coords["doy"] = "time", np.arange(30)
ds.coords["y"] = "time", np.arange(30)
ds = ds.chunk(time=3, y=-1, x=-1)
ds.to_zarr("./tmp/in3.zarr", mode="w", zarr_format=2)

ds = xr.open_dataarray("./tmp/in3.zarr", chunks={}, inline_array=True)

mapped = xr.map_blocks(lambda x: x + 1, ds, template=ds)

delayed_obj = mapped.drop_encoding().to_zarr("./tmp/out4.zarr", mode="w", zarr_format=2, compute=False)
delayed_obj.visualize("map-blocks-to-zarr-2.svg", color="order", cmap="autumn", optimize_graph=False)
Image

What I expected:
I had hoped that the chunks that were done would immediatly be written to disk, so I had expected the graph to look more like this (this is what we see when we only have a single data variable):

Image

What I see:
Because of the finalize tasks and the way that dask ordering works, I see that all the data is loaded in before it writes a lot of the data.

Questions:

  1. Is there a way to work around this, by forcing the ordering to be more logical?
  2. Is there a way to fix this in xarray itself?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions