Zarr consolidated #2559

rabernat · 2018-11-20T04:39:41Z

This PR adds support for reading and writing of consolidated metadata in zarr stores.

Closes how to incorporate zarr's new open_consolidated method? #2558 (remove if there is no corresponding issue, which should only be the case for minor changes)
Tests added (for all bug fixes or enhancements)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

pep8speaks · 2018-11-20T04:39:52Z

Hello @rabernat! Thanks for updating the PR.

There are no PEP8 issues in the file xarray/backends/api.py !
In the file xarray/backends/zarr.py, following are the PEP8 issues :

Line 240:80: E501 line too long (82 > 79 characters)

There are no PEP8 issues in the file xarray/core/dataset.py !
There are no PEP8 issues in the file xarray/tests/test_backends.py !
There are no PEP8 issues in the file xarray/tests/test_distributed.py !

Comment last updated on December 04, 2018 at 19:34 Hours UTC

xarray/backends/api.py

rabernat · 2018-11-20T04:42:33Z

Ping @lilyminium for a review.

jhamman · 2018-11-20T07:49:39Z

xarray/backends/api.py

+    if consolidate:
+        import zarr
+        zarr.consolidate_metadata(store)
+        # do we need to reload ztore now that we have consolidated?


would it make sense for zarr to handle this?

What do you mean?

I meant reloading the zarr store automatically.

I think that would be hard to achieve. And I'm not sure it's necessary. Frankly I don't know why we return a store object from to_zarr at all.

zarr.consolidate_metadata returns the output of open_consolidated on the same store, so this is already happening

rabernat · 2018-11-20T14:09:16Z

Also need to add some version checks...this will only work with zarr > 2.2.

doc/io.rst

xarray/tests/test_backends.py

jhamman · 2018-11-28T16:48:42Z

xarray/backends/zarr.py


-    def __init__(self, zarr_group):
+        if consolidated or consolidate_on_close:
+            if LooseVersion(zarr.__version__) <= '2.2':  # pragma: no cover


reminder to update this version check too.

Being more explicit about the version seems to fix this issue here. In the tests I have used the importorskip approach.

rabernat · 2018-11-28T18:38:23Z

Not sure I understand why there are tests failing now. The failing function is test_basic_compute.

https://travis-ci.org/pydata/xarray/jobs/460873430#L7489

At first glance, this does not appear to have anything to do with my PR. The relevant error is:


______________________________ test_basic_compute ______________________________
    def test_basic_compute():
        ds = Dataset({'foo': ('x', range(5)),
                      'bar': ('x', range(5))}).chunk({'x': 2})
        for get in [dask.threaded.get,
                    dask.multiprocessing.get,
                    dask.local.get_sync,
                    None]:
            with (dask.config.set(scheduler=get)
                  if LooseVersion(dask.__version__) >= LooseVersion('0.19.4')
                  else dask.config.set(scheduler=get)
                  if LooseVersion(dask.__version__) >= LooseVersion('0.18.0')
                  else dask.set_options(get=get)):
>               ds.compute()
xarray/tests/test_dask.py:843: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xarray/core/dataset.py:597: in compute
    return new.load(**kwargs)
xarray/core/dataset.py:494: in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:390: in compute
    collections=collections)
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:865: in get_scheduler
    return get_scheduler(scheduler=config.get('scheduler', None))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
get = None, scheduler = <function get at 0x7fc31d9ae048>, collections = None
cls = None
    def get_scheduler(get=None, scheduler=None, collections=None, cls=None):
        """ Get scheduler function
    
        There are various ways to specify the scheduler to use:
    
        1.  Passing in get= parameters (deprecated)
        2.  Passing in scheduler= parameters
        3.  Passing these into global confiuration
        4.  Using defaults of a dask collection
    
        This function centralizes the logic to determine the right scheduler to use
        from those many options
        """
        if get is not None:
            if scheduler is not None:
                raise ValueError("Both get= and scheduler= provided.  Choose one")
            warn_on_get(get)
            return get
    
        if scheduler is not None:
>           if scheduler.lower() in named_schedulers:
E           AttributeError: 'function' object has no attribute 'lower'
../../../miniconda/envs/test_env/lib/python3.6/site-packages/dask/base.py:854: AttributeError

shoyer · 2018-11-28T19:12:02Z

I bet this is due to the latest dask release (1.0). We can fix this in another PR.

lilyminium · 2018-11-28T19:15:33Z

I remember dealing with this in my pull request -- if I recall correctly scheduler was pointing to the scheduler.get function instead. It was a minor bug that was either fixed in the next release of xarray (0.11.0) or Dask (0.20.1).

rabernat · 2018-11-28T19:33:01Z

So if the test issues can be considered resolved, the only decision we need to make is about the API.

Do we prefer (the current way):

ds.to_zarr(fname, consolidate=True)
xr.open_zarr(fname, consolidated=True)

or @shoyer's suggestion

ds.to_zarr(fname, consolidated=True)
xr.open_zarr(fname, consolidated=True)

???

martindurant · 2018-11-28T19:47:43Z

Will the default for both options be False for the time being?

rabernat · 2018-11-28T19:49:02Z

Will the default for both options be False for the time being?

Yes

martindurant · 2018-11-28T19:49:43Z

Glad to see this happening, by the way. Once in, catalogs using intake-xarray can be updated and I don't thin the code will need to change.

alimanfoo · 2018-11-29T11:33:33Z

Great to see this. On the API, FWIW I'd vote for using the same keyword (consolidated) in both, less burden on the user to remember what to use.

rabernat · 2018-11-30T18:59:06Z

Keywords are now all consolidated and all tests are go.

Ready to merge?

jhamman

I think this is basically ready. I had a few small questions/comments but this looks safe for a merge here soon.

jhamman · 2018-12-03T17:26:44Z

doc/whats-new.rst

@@ -36,6 +36,8 @@ Breaking changes
 Enhancements
 ~~~~~~~~~~~~

+- Ability to read and write consolidated metadata in zarr stores.
+  By `Ryan Abernathey <https://github.com/rabernat>`_.


Can you reference the issue this is attached to: (:issue:`2558`).

jhamman · 2018-12-03T17:38:02Z

xarray/backends/zarr.py

+
+        open_kwargs = dict(mode=mode, synchronizer=synchronizer, path=group)
+        if consolidated:
+            # TODO: an option to pass the metadata_key keyword


do we need to consider this TODO here?

jhamman · 2018-12-03T17:55:06Z

xarray/backends/zarr.py

+
+        open_kwargs = dict(mode=mode, synchronizer=synchronizer, path=group)
+        if consolidated:
+            # TODO: an option to pass the metadata_key keyword


Anything to do here now?

Do we feel that it's important to expose this functionality from within xarray? I don't.

I also don't.
I think it's ok for xarray to have an opinion on what the special key is called.

I propose we just leave these TODO's here as is. If anyone ever needs this feature from the xarray side, this will help guide them on how to implement it.

martindurant · 2018-12-03T17:55:51Z

LGTM

Do you think there should be more explicit text of how to add consolidation to existing zarr/xarray data-sets, rather than creating them with consolidation turned on?

We may also need some text around updating consolidated data-sets, but that can maybe wait to see what kind of usage people try.

rabernat · 2018-12-04T19:06:16Z

We may also need some text around updating consolidated data-sets, but that can maybe wait to see what kind of usage people try.

Since xarray cannot append or modify in-place existing zarr stores, this seems outside the scope of xarray for now. But maybe it is worth mentioning in the docs.

jhamman · 2018-12-04T23:30:08Z

I'm happy here. ...but Appveyor is not.

shoyer · 2018-12-04T23:37:39Z

@rabernat if you're ready, let's merge this.

The failures on Appveyor are unrelated (an issue with int32 and cftime)

rabernat · 2018-12-04T23:47:06Z

👍

…

Sent from my iPhone

On Dec 4, 2018, at 6:37 PM, Stephan Hoyer ***@***.***> wrote: @rabernat if you're ready, let's merge this. The failures on Appveyor are related (an issue with int32 and cftime) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rabernat · 2018-12-05T02:15:48Z

If anyone wants to see how awesome consolidated metadata is, you can try it in this binder:
https://github.com/rabernat/pangeo_ecco_examples/

I did a bit of lazy profiling here:
https://gist.github.com/rabernat/ce1fb414cf53541afe2245363b06c49d

Things that used to take ~40s now take ~1s. Especially since loading the data is one of the first steps in any pangeo notebook, this is a huge improvement in usability.

Thanks to everyone who helped make it happen!

martindurant · 2018-12-05T14:58:58Z

I like those timings.

* upstream/master: Feature: N-dimensional auto_combine (pydata#2553) Support HighLevelGraphs (pydata#2603) Bump cftime version in doc environment (pydata#2604) use keep_attrs in binary operations II (pydata#2590) Temporarily mark dask-dev build as an allowed failure (pydata#2602) Fix wrong error message in interp() (pydata#2598) Add dayofyear and dayofweek accessors (pydata#2599) Fix h5netcdf saving scalars with filters or chunks (pydata#2591) Minor update to PR template (pydata#2596) Zarr consolidated (pydata#2559) fix examples (pydata#2581) Fix typo (pydata#2578) Concat docstring typo (pydata#2577) DOC: remove example using Dataset.T (pydata#2572) python setup.py test now works by default (pydata#2573) Return slices when possible from CFTimeIndex.get_loc() (pydata#2569) DOC: fix computation.rst (pydata#2567)

rabernat and others added 2 commits November 19, 2018 17:00

wip: getting started

9cc0550

preliminary support for zarr consolidated metadata

7eda4cc

rabernat commented Nov 20, 2018

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

rabernat requested a review from jhamman November 20, 2018 04:41

update zarr dev repo

0af5abd

jhamman reviewed Nov 20, 2018

View reviewed changes

rabernat added 5 commits November 24, 2018 15:41

add consolidate to close

bb6f9c2

doc updates

95ac3b9

Merge remote-tracking branch 'upstream/master' into zarr_consolidated

00a0efe

skip tests based on zarr version

cfa0a08

fix doc typos

9b4a8aa

shoyer reviewed Nov 24, 2018

View reviewed changes

doc/io.rst Show resolved Hide resolved

fix PEP8 issues

c00ef82

rabernat commented Nov 26, 2018

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/master' into zarr_consolidated

f063f18

rabernat mentioned this pull request Nov 28, 2018

Consolidate zarr metadata into single key zarr-developers/zarr-python#268

Merged

11 tasks

fix test skipping

e3579a8

jhamman reviewed Nov 28, 2018

View reviewed changes

rabernat added 2 commits November 28, 2018 11:50

fixed integration test

6ef6d63

update version check

fa9cc41

rename keyword arg

09eee44

jhamman approved these changes Dec 3, 2018

View reviewed changes

jhamman reviewed Dec 3, 2018

View reviewed changes

rabernat added 2 commits December 4, 2018 14:29

Update whats-new.rst

95829f0

instructions for consolidating existing stores

fe4af34

shoyer merged commit 3ae93ac into pydata:master Dec 4, 2018

spencerkclark mentioned this pull request Feb 4, 2020

Feature/coarse grain c384 diagnostic data ai2cm/fv3net#122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr consolidated #2559

Zarr consolidated #2559

rabernat commented Nov 20, 2018 •

edited

Loading

pep8speaks commented Nov 20, 2018 •

edited

Loading

rabernat commented Nov 20, 2018

jhamman Nov 20, 2018

rabernat Nov 20, 2018

jhamman Nov 20, 2018

rabernat Nov 24, 2018

martindurant Nov 28, 2018

rabernat commented Nov 20, 2018

jhamman Nov 28, 2018

rabernat Nov 28, 2018

rabernat commented Nov 28, 2018

shoyer commented Nov 28, 2018

lilyminium commented Nov 28, 2018

rabernat commented Nov 28, 2018

martindurant commented Nov 28, 2018

rabernat commented Nov 28, 2018

martindurant commented Nov 28, 2018

alimanfoo commented Nov 29, 2018

rabernat commented Nov 30, 2018

jhamman left a comment

jhamman Dec 3, 2018

jhamman Dec 3, 2018

jhamman Dec 3, 2018

rabernat Dec 4, 2018

martindurant Dec 4, 2018

rabernat Dec 4, 2018

martindurant commented Dec 3, 2018

rabernat commented Dec 4, 2018

jhamman commented Dec 4, 2018

shoyer commented Dec 4, 2018 •

edited

Loading

rabernat commented Dec 4, 2018 via email

rabernat commented Dec 5, 2018

martindurant commented Dec 5, 2018

Zarr consolidated #2559

Zarr consolidated #2559

Conversation

rabernat commented Nov 20, 2018 • edited Loading

pep8speaks commented Nov 20, 2018 • edited Loading

Comment last updated on December 04, 2018 at 19:34 Hours UTC

rabernat commented Nov 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Nov 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Nov 28, 2018

shoyer commented Nov 28, 2018

lilyminium commented Nov 28, 2018

rabernat commented Nov 28, 2018

martindurant commented Nov 28, 2018

rabernat commented Nov 28, 2018

martindurant commented Nov 28, 2018

alimanfoo commented Nov 29, 2018

rabernat commented Nov 30, 2018

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Dec 3, 2018

rabernat commented Dec 4, 2018

jhamman commented Dec 4, 2018

shoyer commented Dec 4, 2018 • edited Loading

rabernat commented Dec 4, 2018 via email

rabernat commented Dec 5, 2018

martindurant commented Dec 5, 2018

rabernat commented Nov 20, 2018 •

edited

Loading

pep8speaks commented Nov 20, 2018 •

edited

Loading

shoyer commented Dec 4, 2018 •

edited

Loading