Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata_only param to .to_zarr? #8343

Open
max-sixty opened this issue Oct 19, 2023 · 17 comments · May be fixed by #8460
Open

Add metadata_only param to .to_zarr? #8343

max-sixty opened this issue Oct 19, 2023 · 17 comments · May be fixed by #8460
Labels
enhancement topic-zarr Related to zarr storage library

Comments

@max-sixty
Copy link
Collaborator

Is your feature request related to a problem?

A leaf from #8245, which has a bullet:

compute=False is arguably a less-than-obvious kwarg meaning "write metadata". Maybe this should be a method, maybe it's a candidate for renaming? Or maybe make_template can be an abstraction over it

I've also noticed that for large arrays, running compute=False can take several minutes, despite the indexes being very small. I think this is because it's building a dask task graph — which is then discarded, since the array is written from different machines with the region pattern.

Describe the solution you'd like

Would introducing a metadata_only parameter to to_zarr help here:

  • Better name
  • No dask graph

Describe alternatives you've considered

No response

Additional context

No response

@TomNicholas TomNicholas added the topic-zarr Related to zarr storage library label Oct 20, 2023
@shoyer
Copy link
Member

shoyer commented Oct 21, 2023

Yes, is a great idea!

@jhamman
Copy link
Member

jhamman commented Oct 22, 2023

+1, this is a really nice idea.

Related to this could also be a write-through cache of sorts. For high-latency stores (e.g. S3), synchronously populating the store metadata can really add up. If we knew we were only writing metadata, we could safely populate all the Zarr json objects then send them in one bulk write step.

The combination of these two features would be a lightning fast Zarr initialization routine 🚀

@max-sixty
Copy link
Collaborator Author

I came across #8343 recently — this seems to be a similar suggestion to what I was intending. Is that correct?

The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset.

Is anyone more familiar with whether there is a cost to producing the dask task graph? I'm seeing .to_zarr(compute=False) take well over 1 minutes with large arrays with lots of chunks. And it's only writing very small metadata.

@shoyer
Copy link
Member

shoyer commented Oct 25, 2023

Is anyone more familiar with whether there is a cost to producing the dask task graph? I'm seeing .to_zarr(compute=False) take well over 1 minutes with large arrays with lots of chunks. And it's only writing very small metadata.

If to_zarr(compute=False) is slow, that's more likely due to Xarray doing lots of sequential blocking IO operations. Once you've made the Dataset object the dask graphs have already been created.

@max-sixty
Copy link
Collaborator Author

max-sixty commented Oct 25, 2023

If to_zarr(compute=False) is slow, that's more likely due to Xarray doing lots of sequential blocking IO operations. Once you've made the Dataset object the dask graphs have already been created.

This is an area I don't understand well, so obv I defer.

I had thought that it was driven by dask since that's most of the time is spent in dask/array/optimization.py. But maybe we're saying that this is because xarray is writing indexes during that time? (Though the indexes are fairly small, and it's spending lots of time in a fuse function?)

From py-spy:


  %Own   %Total  OwnTime  TotalTime  Function (filename)
  0.00%   0.00%   14.07s    16.42s   fuse (dask/optimization.py)
  0.00%   0.00%    8.15s     8.15s   reverse_dict (dask/core.py)
  0.00%   0.00%    3.17s     4.17s   make_blockwise_graph (dask/blockwise.py)
  0.00%   0.00%    2.81s     2.94s   _cull_dependencies (dask/blockwise.py)
 20.00%  25.00%    2.50s     3.04s   fuse_slice (dask/array/optimization.py)
  0.00%   0.00%    2.49s     2.49s   keys_in_tasks (dask/core.py)
  0.00%   0.00%    2.19s     2.90s   functions_of (dask/optimization.py)
  0.00%   0.00%    1.32s     9.20s   cull (dask/highlevelgraph.py)
  0.00%   0.00%   0.930s    48.16s   optimize (dask/array/optimization.py)

image

@dcherian
Copy link
Contributor

Max I think you're right. In recent times, dask has a "lazy graph" (HighLevelGraph) that gets lowered down to an old-style graph expressed as dicts. That lowering is still slow and potentially whats happening here.

@dcherian
Copy link
Contributor

dcherian commented Oct 25, 2023

Yeah here's the call to optimize in dask.array.store before it handles the compute kwarg

https://github.com/dask/dask/blob/1203b1bb6d52b1fb00d54656af4af8e6e35aa615/dask/array/core.py#L1169-L1171

@dcherian
Copy link
Contributor

I really think we're just taking advantage of a side-effect by using compute=False for this use-case. I like the idea of a separate method, say initialize_zarr, for this.

@max-sixty
Copy link
Collaborator Author

max-sixty commented Oct 25, 2023

Yes we could either do:

How difficult do we think this is? Is it something I can bang out in 45 mins or is it a bigger effort that requires more context?

@max-sixty
Copy link
Collaborator Author

FYI — for the moment, xr.ones_like(ds).to_zarr(compute=False) saves a few minutes each go! (though the data needs to be floats)

@shoyer
Copy link
Member

shoyer commented Nov 7, 2023

FYI — for the moment, xr.ones_like(ds).to_zarr(compute=False) saves a few minutes each go! (though the data needs to be floats)

In xarray-beam we use zeros_like, which I believe works for any NumPy dtype.

@max-sixty
Copy link
Collaborator Author

How difficult do we think this is? Is it something I can bang out in 45 mins or is it a bigger effort that requires more context?

In xarray-beam we use zeros_like, which I believe works for any NumPy dtype.

Could this just be running xr.zero_like(ds).to_zarr(compute=False)?? Are there any data types that zarr supports which wouldn't be covered by zeros_like? Possibly strings...

@dcherian
Copy link
Contributor

Possibly strings...

Seem to work!

ds = xr.Dataset(
    {
        "a": (("x", "y"), np.arange(20).reshape(4, 5)),
        "b": ("x", ["aa", "b", "c", "d"]),
        "c": ("x", np.array(["aa", "b", "c", "d"], dtype="S")),
        "time": ("t", pd.date_range("2001-01-01", periods=4, freq='D')),
    },
    coords={"x": np.arange(4), "y": [33, 44, 22, 11, 55]},
)
image

@dcherian
Copy link
Contributor

dcherian commented Nov 14, 2023

I'm playing around with this piece of code. Does it make sense? There's a fair bit of complexity around the write_empty_chunks=False optimization.

def make_template(ds, *, encoding=None):
    fillvalues = {
        name: var.encoding["_FillValue"]
        for name, var in ds._variables.items()
        if var.encoding and "_FillValue" in var.encoding
    }
    fillvalues.update(
        {
            name: enc["_FillValue"]
            for name, enc in encoding.items()
            if "_FillValue" in enc
        }
    )

    to_drop = {var for var, varenc in encoding.items() if "_FillValue" in varenc}
    dropped = ds.drop_vars(to_drop)
    template = xr.zeros_like(ds)
    for var in to_drop:
        template[var] = xr.full_like(ds[var], encoding[var]["_FillValue"])
    return template


def initialize_zarr(ds, repo, *, region_dims=None, append_dim=None, **kwargs):
    if "compute" in kwargs:
        raise ValueError("The ``compute`` kwarg is not supported in `initialize_zarr`.")

    if kwargs.get("mode", "w") != "w":
        raise ValueError(
            f"Only mode='w' is allowed for initialize_zarr. Received {kwargs['mode']}"
        )

    encoding = kwargs.get("encoding", {})

    template = make_template(ds, encoding=encoding)

    # TODO: handle `write_empty_chunks` in init_kwargs["encoding"]
    init_kwargs = kwargs.copy()
    init_kwargs.pop("write_empty_chunks", None)

    template.to_zarr(
        store, group="foo/", compute=False, write_empty_chunks=False, **init_kwargs
    )

    if region_dims:
        # At this point, the store has been initialized (and potentially overwritten)
        kwargs.pop("mode")
        dropped = ds.drop_dims(region_dims)
        new_encoding = kwargs.pop("encoding", None)
        if new_encoding:
            new_encoding = {k: v for k, v in new_encoding.items() if k in dropped}

        dropped.to_zarr(
            store,
            group="foo/",
            **kwargs,
            encoding=new_encoding,
            compute=True,
            mode="a",
        )

        # can't use drop_dims since that will also remove any variable
        # with the dims to be dropped
        # even if they have anything in region_dims
        dims_to_drop = set(ds.dims) - set(region_dims)
        vars_to_drop = [
            name
            for name, var in ds._variables.items()
            if set(var.dims).issubset(dims_to_drop)
        ]
        return ds.drop_vars(vars_to_drop)

    elif append_dim:
        # TODO
        pass

    else:
        return ds


encoding = {"a": {"_FillValue": -1}}
initialized = initialize_zarr(ds, store, region_dims="y", mode="w", encoding=encoding)

@max-sixty
Copy link
Collaborator Author

n00b question — why do we need the code around _FillValue?

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

(very minor point — we also want to allow w-, indeed that should be the default, given that Zarr will wipe everything, including a whole bucket, for any path it's given)

@dcherian
Copy link
Contributor

why do we need the code around _FillValue?

🤦🏾‍♂️ I was testing with a numpy dataset, so compute=False made no difference :/

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

Yes. But the indexes for the dims get written during template.to_zarr. After the drop, we are looking to write non-dim coordinate variables and data variables without region dims. So if there was a big dask array that had no dimensions in common with region_dims it would get written at this point. Perhaps initialize_ isn't a good prefix for this function. Thoughts?

we also want to allow w-, indeed that should be the default, given that Zarr will wipe everything,

👍

@max-sixty
Copy link
Collaborator Author

If there's a dataset with two dims, and they're both region_dims, does this drop all the vars?

Yes. But the indexes for the dims get written during template.to_zarr. After the drop, we are looking to write non-dim coordinate variables and data variables without region dims. So if there was a big dask array that had no dimensions in common with region_dims it would get written at this point.

I think the code is doing the right thing — dims_to_drop = set(ds.dims) - set(region_dims) means that we'll drop anything with no dims in common, as you replied.

A clearer way to ask my initial question was "If there's a dataset with one array with two dims, and both dims are region_dims, does this drop the array?" — it doesn't drop anything (dims_to_drop is empty).

Perhaps initialize_ isn't a good prefix for this function.

I like it! make_template etc seems fine too if anyone has strong preferences.


I see there's a logic branch for append_dim; I haven't used region with append_dim; is that a commonly used pattern?

dcherian added a commit to dcherian/xarray that referenced this issue Nov 16, 2023
@dcherian dcherian linked a pull request Nov 16, 2023 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants