Add `metadata_only` param to `.to_zarr`? #8343

max-sixty · 2023-10-19T20:25:11Z

Is your feature request related to a problem?

A leaf from #8245, which has a bullet:

compute=False is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe make_template can be an abstraction over it

I've also noticed that for large arrays, running compute=False can take several minutes, despite the indexes being very small. I think this is because it's building a dask task graph — which is then discarded, since the array is written from different machines with the region pattern.

Describe the solution you'd like

Would introducing a metadata_only parameter to to_zarr help here:

Better name
No dask graph

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

shoyer · 2023-10-21T19:58:49Z

Yes, is a great idea!

jhamman · 2023-10-22T02:39:35Z

+1, this is a really nice idea.

Related to this could also be a write-through cache of sorts. For high-latency stores (e.g. S3), synchronously populating the store metadata can really add up. If we knew we were only writing metadata, we could safely populate all the Zarr json objects then send them in one bulk write step.

The combination of these two features would be a lightning fast Zarr initialization routine 🚀

max-sixty · 2023-10-24T22:03:31Z

I came across #8343 recently — this seems to be a similar suggestion to what I was intending. Is that correct?

The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset.

Is anyone more familiar with whether there is a cost to producing the dask task graph? I'm seeing .to_zarr(compute=False) take well over 1 minutes with large arrays with lots of chunks. And it's only writing very small metadata.

shoyer · 2023-10-25T00:43:00Z

Is anyone more familiar with whether there is a cost to producing the dask task graph? I'm seeing .to_zarr(compute=False) take well over 1 minutes with large arrays with lots of chunks. And it's only writing very small metadata.

If to_zarr(compute=False) is slow, that's more likely due to Xarray doing lots of sequential blocking IO operations. Once you've made the Dataset object the dask graphs have already been created.

max-sixty · 2023-10-25T02:40:50Z

If to_zarr(compute=False) is slow, that's more likely due to Xarray doing lots of sequential blocking IO operations. Once you've made the Dataset object the dask graphs have already been created.

This is an area I don't understand well, so obv I defer.

I had thought that it was driven by dask since that's most of the time is spent in dask/array/optimization.py. But maybe we're saying that this is because xarray is writing indexes during that time? (Though the indexes are fairly small, and it's spending lots of time in a fuse function?)

From py-spy:


  %Own   %Total  OwnTime  TotalTime  Function (filename)
  0.00%   0.00%   14.07s    16.42s   fuse (dask/optimization.py)
  0.00%   0.00%    8.15s     8.15s   reverse_dict (dask/core.py)
  0.00%   0.00%    3.17s     4.17s   make_blockwise_graph (dask/blockwise.py)
  0.00%   0.00%    2.81s     2.94s   _cull_dependencies (dask/blockwise.py)
 20.00%  25.00%    2.50s     3.04s   fuse_slice (dask/array/optimization.py)
  0.00%   0.00%    2.49s     2.49s   keys_in_tasks (dask/core.py)
  0.00%   0.00%    2.19s     2.90s   functions_of (dask/optimization.py)
  0.00%   0.00%    1.32s     9.20s   cull (dask/highlevelgraph.py)
  0.00%   0.00%   0.930s    48.16s   optimize (dask/array/optimization.py)

dcherian · 2023-10-25T02:46:25Z

Max I think you're right. In recent times, dask has a "lazy graph" (HighLevelGraph) that gets lowered down to an old-style graph expressed as dicts. That lowering is still slow and potentially whats happening here.

dcherian · 2023-10-25T02:55:44Z

Yeah here's the call to optimize in dask.array.store before it handles the compute kwarg

https://github.com/dask/dask/blob/1203b1bb6d52b1fb00d54656af4af8e6e35aa615/dask/array/core.py#L1169-L1171

dcherian · 2023-10-25T16:59:01Z

I really think we're just taking advantage of a side-effect by using compute=False for this use-case. I like the idea of a separate method, say initialize_zarr, for this.

max-sixty · 2023-10-25T19:14:36Z

Yes we could either do:

.initialize_zarr / .to_zarr_metadata / xarray_beam.make_template
or .to_zarr(metadata_only=True)

How difficult do we think this is? Is it something I can bang out in 45 mins or is it a bigger effort that requires more context?

max-sixty · 2023-11-06T21:49:30Z

FYI — for the moment, xr.ones_like(ds).to_zarr(compute=False) saves a few minutes each go! (though the data needs to be floats)

shoyer · 2023-11-07T02:55:37Z

FYI — for the moment, xr.ones_like(ds).to_zarr(compute=False) saves a few minutes each go! (though the data needs to be floats)

In xarray-beam we use zeros_like, which I believe works for any NumPy dtype.

max-sixty · 2023-11-07T19:34:21Z

How difficult do we think this is? Is it something I can bang out in 45 mins or is it a bigger effort that requires more context?

In xarray-beam we use zeros_like, which I believe works for any NumPy dtype.

Could this just be running xr.zero_like(ds).to_zarr(compute=False)?? Are there any data types that zarr supports which wouldn't be covered by zeros_like? Possibly strings...

dcherian · 2023-11-14T19:38:52Z

Possibly strings...

Seem to work!

ds = xr.Dataset(
    {
        "a": (("x", "y"), np.arange(20).reshape(4, 5)),
        "b": ("x", ["aa", "b", "c", "d"]),
        "c": ("x", np.array(["aa", "b", "c", "d"], dtype="S")),
        "time": ("t", pd.date_range("2001-01-01", periods=4, freq='D')),
    },
    coords={"x": np.arange(4), "y": [33, 44, 22, 11, 55]},
)

dcherian · 2023-11-14T23:44:58Z

I'm playing around with this piece of code. Does it make sense? There's a fair bit of complexity around the write_empty_chunks=False optimization.

def make_template(ds, *, encoding=None):
    fillvalues = {
        name: var.encoding["_FillValue"]
        for name, var in ds._variables.items()
        if var.encoding and "_FillValue" in var.encoding
    }
    fillvalues.update(
        {
            name: enc["_FillValue"]
            for name, enc in encoding.items()
            if "_FillValue" in enc
        }
    )

    to_drop = {var for var, varenc in encoding.items() if "_FillValue" in varenc}
    dropped = ds.drop_vars(to_drop)
    template = xr.zeros_like(ds)
    for var in to_drop:
        template[var] = xr.full_like(ds[var], encoding[var]["_FillValue"])
    return template


def initialize_zarr(ds, repo, *, region_dims=None, append_dim=None, **kwargs):
    if "compute" in kwargs:
        raise ValueError("The ``compute`` kwarg is not supported in `initialize_zarr`.")

    if kwargs.get("mode", "w") != "w":
        raise ValueError(
            f"Only mode='w' is allowed for initialize_zarr. Received {kwargs['mode']}"
        )

    encoding = kwargs.get("encoding", {})

    template = make_template(ds, encoding=encoding)

    # TODO: handle `write_empty_chunks` in init_kwargs["encoding"]
    init_kwargs = kwargs.copy()
    init_kwargs.pop("write_empty_chunks", None)

    template.to_zarr(
        store, group="foo/", compute=False, write_empty_chunks=False, **init_kwargs
    )

    if region_dims:
        # At this point, the store has been initialized (and potentially overwritten)
        kwargs.pop("mode")
        dropped = ds.drop_dims(region_dims)
        new_encoding = kwargs.pop("encoding", None)
        if new_encoding:
            new_encoding = {k: v for k, v in new_encoding.items() if k in dropped}

        dropped.to_zarr(
            store,
            group="foo/",
            **kwargs,
            encoding=new_encoding,
            compute=True,
            mode="a",
        )

        # can't use drop_dims since that will also remove any variable
        # with the dims to be dropped
        # even if they have anything in region_dims
        dims_to_drop = set(ds.dims) - set(region_dims)
        vars_to_drop = [
            name
            for name, var in ds._variables.items()
            if set(var.dims).issubset(dims_to_drop)
        ]
        return ds.drop_vars(vars_to_drop)

    elif append_dim:
        # TODO
        pass

    else:
        return ds


encoding = {"a": {"_FillValue": -1}}
initialized = initialize_zarr(ds, store, region_dims="y", mode="w", encoding=encoding)

max-sixty · 2023-11-15T00:58:47Z

n00b question — why do we need the code around _FillValue?

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

(very minor point — we also want to allow w-, indeed that should be the default, given that Zarr will wipe everything, including a whole bucket, for any path it's given)

dcherian · 2023-11-15T03:10:50Z

why do we need the code around _FillValue?

🤦🏾‍♂️ I was testing with a numpy dataset, so compute=False made no difference :/

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

Yes. But the indexes for the dims get written during template.to_zarr. After the drop, we are looking to write non-dim coordinate variables and data variables without region dims. So if there was a big dask array that had no dimensions in common with region_dims it would get written at this point. Perhaps initialize_ isn't a good prefix for this function. Thoughts?

we also want to allow w-, indeed that should be the default, given that Zarr will wipe everything,

👍

max-sixty · 2023-11-15T05:22:12Z

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

Yes. But the indexes for the dims get written during template.to_zarr. After the drop, we are looking to write non-dim coordinate variables and data variables without region dims. So if there was a big dask array that had no dimensions in common with region_dims it would get written at this point.

I think the code is doing the right thing — dims_to_drop = set(ds.dims) - set(region_dims) means that we'll drop anything with no dims in common, as you replied.

A clearer way to ask my initial question was "If there's a dataset with one array with two dims, and both dims are region_dims, does this drop the array?" — it doesn't drop anything (dims_to_drop is empty).

Perhaps initialize_ isn't a good prefix for this function.

I like it! make_template etc seems fine too if anyone has strong preferences.

I see there's a logic branch for append_dim; I haven't used region with append_dim; is that a commonly used pattern?

Closes pydata#8343

max-sixty added the enhancement label Oct 19, 2023

TomNicholas added the topic-zarr Related to zarr storage library label Oct 20, 2023

dcherian mentioned this issue Nov 14, 2023

Tools for writing distributed zarrs #8245

Open

7 tasks

dcherian added a commit to dcherian/xarray that referenced this issue Nov 16, 2023

Add initialize_zarr

994af64

Closes pydata#8343

dcherian linked a pull request Nov 16, 2023 that will close this issue

Add initialize_zarr #8460

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `metadata_only` param to `.to_zarr`? #8343

Add `metadata_only` param to `.to_zarr`? #8343

max-sixty commented Oct 19, 2023

shoyer commented Oct 21, 2023

jhamman commented Oct 22, 2023

max-sixty commented Oct 24, 2023

shoyer commented Oct 25, 2023

max-sixty commented Oct 25, 2023 •

edited

dcherian commented Oct 25, 2023

dcherian commented Oct 25, 2023 •

edited

dcherian commented Oct 25, 2023

max-sixty commented Oct 25, 2023 •

edited

max-sixty commented Nov 6, 2023

shoyer commented Nov 7, 2023

max-sixty commented Nov 7, 2023

dcherian commented Nov 14, 2023

dcherian commented Nov 14, 2023 •

edited

max-sixty commented Nov 15, 2023

dcherian commented Nov 15, 2023

max-sixty commented Nov 15, 2023

Add metadata_only param to .to_zarr? #8343

Add metadata_only param to .to_zarr? #8343

Comments

max-sixty commented Oct 19, 2023

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

shoyer commented Oct 21, 2023

jhamman commented Oct 22, 2023

max-sixty commented Oct 24, 2023

shoyer commented Oct 25, 2023

max-sixty commented Oct 25, 2023 • edited

dcherian commented Oct 25, 2023

dcherian commented Oct 25, 2023 • edited

dcherian commented Oct 25, 2023

max-sixty commented Oct 25, 2023 • edited

max-sixty commented Nov 6, 2023

shoyer commented Nov 7, 2023

max-sixty commented Nov 7, 2023

dcherian commented Nov 14, 2023

dcherian commented Nov 14, 2023 • edited

max-sixty commented Nov 15, 2023

dcherian commented Nov 15, 2023

max-sixty commented Nov 15, 2023

Add `metadata_only` param to `.to_zarr`? #8343

Add `metadata_only` param to `.to_zarr`? #8343

max-sixty commented Oct 25, 2023 •

edited

dcherian commented Oct 25, 2023 •

edited

max-sixty commented Oct 25, 2023 •

edited

dcherian commented Nov 14, 2023 •

edited