WIP: Performance improvements for zarr backend #1800

jhamman · 2017-12-26T20:37:45Z

Closes #workflow for moving data to cloud pangeo-data/pangeo#48
Tests added (for all bug fixes or enhancements)
Tests passed (for all non-documentation changes)
Passes git diff upstream/master **/*py | flake8 --diff (remove if you did not edit any Python files)
Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is building on top of #1799. Based on the suggestion from @alimanfoo in pangeo-data/pangeo#48 (comment), I have reworked the handling of attributes in the zarr backend. There is more to do here, particularly in the set_dimensions arena but this is giving almost a 2x speedup in writing to GCP.

cc @rabernat, @mrocklin and @alimanfoo

…taStore

…ttrs

…ix/zarr_set_attrs

shoyer · 2017-12-31T00:00:58Z

xarray/backends/zarr.py

-    def set_dimension(self, name, length, is_unlimited=False):
-        if is_unlimited:
+    # TODO: we need these checks one way or another
+    # def set_dimension(self, name, length, is_unlimited=False):


It seems that we should probably rewrite our backend logic to write/check dimensions once rather than doing it each time a Variable is written.

This seems like a good idea. I've gone ahead and done this and will push some more changes later tonight.

…ix/zarr_set_attrs

jhamman · 2018-01-02T07:02:06Z

@shoyer - I've added a set_dimensions method to the writable datastores. This seems to work quite well except for string variables. I have a hacky fix with a comment in there right now, this works for all backends except h5netcdf. Thoughts on a better way to encode variables prior to setting dimensions would be great.

rabernat · 2018-01-02T18:18:49Z

xarray/backends/zarr.py

@@ -331,29 +331,42 @@ def get_variables(self):
                                 for k, v in self.ds.arrays())

    def get_attrs(self):
-        _, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
+        attributes = HiddenKeyDict(self.ds.attrs, [_DIMENSION_KEY])


rabernat · 2018-01-02T18:20:06Z

xarray/backends/zarr.py

+        for d, l in dims.items():
+            if d in existing_dims and l != existing_dims[d]:
+                raise ValueError("Unable to update size for existing dimension"
+                                 "%r (%d != %d)" % (d, l, existing_dims[d]))


I don't think we have any tests for this sort of error.

rabernat · 2018-01-02T18:22:48Z

xarray/backends/zarr.py

        if name in self.ds:
            zarr_array = self.ds[name]
+            zarr_array.attrs.update(encoded_attrs)


I'm trying to understand when we would ever hit this case. It seems like it would only happen if we were trying to modify an existing store in place. That is not currently supported.

…ix/zarr_set_attrs

jhamman · 2018-01-02T21:48:59Z

xarray/backends/common.py

@@ -216,7 +216,11 @@ def store_dataset(self, dataset):

    def store(self, variables, attributes, check_encoding_set=frozenset(),
              unlimited_dims=None):
+        # This seems out of place
+        variables = OrderedDict([(k, maybe_encode_as_char_array(v))
+                                 for k, v in variables.items()])


Should each backend define encode/decode methods? For string variables, we need to encode before setting the dimensions.

jhamman · 2018-01-05T17:30:46Z

xarray/tests/test_backends.py

+
+    @pytest.mark.xfail(reason="Zero-dimension variables are broken")
+    def test_roundtrip_coordinates_with_space(self):
+        super(CFEncodedDataTest, self).test_roundtrip_coordinates_with_space()


@rabernat - I have not been able to get zero-dim variables to work with zarr. I know you iterated on this but did you ever get it to work?

xref: zarr-developers/zarr-python#150

This issue was fixed in zarr-developers/zarr-python#160. But zarr has not had a release since then. So you need to be installing zarr from github master in order to get zero-dim variables.

Zero dim variables are covered by the test suite
https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L170
And you can see the test passing in the zarr-dev environment in my original zarr PR:
https://travis-ci.org/pydata/xarray/jobs/315571803#L1558

you're right. I thought I had Zarr-master in my environment. but I didn't These all work now.

jhamman · 2018-01-05T17:31:15Z

xarray/tests/test_backends.py

+            # del zarr_group.attrs[self.DIMENSION_KEY]
+            # with pytest.raises(KeyError):
+            #     with xr.decode_cf(store) as actual:
+            #         pass


I still need to figure out why this wasn't raising an error.

Probably because the error checking occurs in _get_zarr_dims_and_attrs, but you have bypassed that function in many cases for performance optimizations.

What does the dataset look like if you open it after deleting DIMENSION_KEY? It should be broken in some way.

rabernat · 2018-01-12T15:33:35Z

xarray/backends/zarr.py

        for k, v in iteritems(attrs):
-            attributes[k] = _encode_zarr_attr_value(v)
+            encoded_attrs[k] = _encode_zarr_attr_value(v)


maybe leave this as a #TODO?

jhamman · 2018-01-05T17:33:08Z

xarray/backends/zarr.py

-        _, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
-        attributes[key] = _encode_zarr_attr_value(value)
+
+        dims = {}


make this an OrderedDict

jhamman · 2018-01-09T06:21:40Z

xarray/backends/zarr.py

+            except KeyError:
+                raise KeyError("Zarr object is missing the attribute `%s`, "
+                               "which is required for xarray to determine "
+                               "variable dimensions." % (_DIMENSION_KEY))


@rabernat - I want to get your take on this change. What I've done is remove the _DIMENSION_KEY attribute from the dataset all together and I'm just pulling it from the individual arrays/variables. I realized we were storing this information multiple places and it was getting too tricky keeping everything straight.

I see your point that the coord information was stored twice.

My understanding was that having _DIMENSION_KEY in the global dataset attributes was necessary for set_coords to work on non-dimensions coordinates. But maybe this is wrong--I can't find any justification for it by looking through the code.

I suppose one option here is to just have no dimensions if there is no _DIMENSION_KEY on the variable. That would allow us to open any zarr store, even those not created by xarray.

My understanding was that having _DIMENSION_KEY in the global dataset attributes was necessary for set_coords to work on non-dimensions coordinates. But maybe this is wrong--I can't find any justification for it by looking through the code.

Right, I think that may have been the case but I've worked around that now.

I suppose one option here is to just have no dimensions if there is no _DIMENSION_KEY on the variable. That would allow us to open any zarr store, even those not created by xarray.

I think this is doable but I want to leave that for another PR since we'll need to make a number of decisions about how to name dimensions that were not specified in the store.

…ttrs

…ix/zarr_set_attrs

jhamman · 2018-01-10T16:53:12Z

xarray/backends/zarr.py

+        for k, v in self.ds.arrays():
+            try:
+                for d, s in zip(v.attrs[_DIMENSION_KEY], v.shape):
+                    dimensions[d] = s


I think we need a consistency check here to ensure all the Zarr arrays have the same length on their labeled dimensions.

jhamman · 2018-01-10T16:54:51Z

This is ready for some final reviews. Although the title refers to Zarr, I ended up doing some minor refactoring of all the backends so a critical eye there would be nice. I'm also hoping to use @shoyer's coding/encoding work in #1803 here once that is merged.

…ttrs

rabernat

This seems just about good to go, just a little cleanup required.

I also suggest you use this PR to add some comments / docstrings to the common backend classes which explain, in words, what those methods are for. This will make it easier for new developers to implement new backends. I struggled for a long time to understand the code paths.

rabernat · 2018-01-12T15:30:07Z

xarray/backends/common.py

@@ -234,23 +253,42 @@ def set_variables(self, variables, check_encoding_set,

            self.writer.add(source, target)

-    def set_necessary_dimensions(self, variable, unlimited_dims=None):
+    def set_dimensions(self, variables, unlimited_dims=None):


Can you add a docstring or comment explaining what this method does?

This would help new developers (including myself) come onboard with backend development.

rabernat · 2018-01-12T15:33:01Z

xarray/backends/zarr.py

+            except KeyError:
+                raise KeyError("Zarr object is missing the attribute `%s`, "
+                               "which is required for xarray to determine "
+                               "variable dimensions." % (_DIMENSION_KEY))


rabernat · 2018-01-12T15:33:35Z

xarray/backends/zarr.py

        for k, v in iteritems(attrs):
-            attributes[k] = _encode_zarr_attr_value(v)
+            encoded_attrs[k] = _encode_zarr_attr_value(v)


maybe leave this as a #TODO?

rabernat · 2018-01-12T15:35:01Z

xarray/tests/test_backends.py


            with xr.decode_cf(store) as actual:
                # make sure it is hidden
-                assert self.DIMENSION_KEY not in actual.attrs
+                # assert self.DIMENSION_KEY not in actual.attrs


remove comment

rabernat · 2018-01-12T15:35:39Z

xarray/tests/test_backends.py

            # put it back and try removing from a variable
-            zarr_group.attrs[self.DIMENSION_KEY] = {}
+            # zarr_group.attrs[self.DIMENSION_KEY] = {}


remove comment

…ttrs

jhamman · 2018-01-13T22:15:47Z

@rabernat - thanks for the review. I've addressed all your comments. I've also tested using this to push data to GCS. We're still getting the best performance we've been able to get when this is combined with gcsf:master and zarr:master.

rabernat · 2018-01-13T23:16:38Z

Eventually we will want to think about testing xarray / zarr / gcsfs integration automatically somewhere. (but not here and not now)

rabernat · 2018-01-14T04:18:49Z

xarray/tests/test_backends.py

-        with pytest.raises(KeyError):
-            with self.roundtrip(original,
-                                save_kwargs={'group': group}) as actual:
-                self.assertDatasetIdentical(original, actual)


Why did you remove this part of the test?

This is an interesting one. I wanted to see if the tests all passed after this change before bringing this up for discussion.

The key error that is raised here was being raised when looking for the _DIMENSIONS_KEY on the group. Now that each group doesn't have that key, it obviously isn't raised. So the question up for discussion is what should we do when we open a zarr group that doesn't have anything in it? Currently, an empty group is returned as an empty Dataset.

So this is related to my earlier comment. I think we should pick a behavior we want to support. Either

You can open any zarr group with xarray

You can only open zarr groups that themselves were created by xarray

The current docs say the latter. However, having eliminated _DIMENSION_KEY from the group attributes, there is no way to check for whether the group was created by xarray if there are no variables.

I can't imagine any scenario in which a user would want to open an empty zarr group in read-only mode. That would be useless. I imagine that this means the dataset was written with a group path but then the user forgot to specify a path when reading. So let's raise an error for now if the group is empty.

In a future PR, we can figure out how to more comprehensively support the reading of zarr datasets that do not have xarray's special attributes.

Just to play devils advocate for a moment, The current behavior is consistent with how we treat groups in netCDF4. Consider the following example:

# create a dataset with a group and nothing in the root group In [16]: import netCDF4 as nc4 In [17]: rg = nc4.Dataset('/Users/jhamman/workdir/junk.nc', 'w') In [18]: g1 = rg.createGroup('some/random/path') In [19]: g1 Out[19]: <class 'netCDF4._netCDF4.Group'> group /some/random/path: dimensions(sizes): variables(dimensions): groups: In [20]: g1.createDimension('time', None) Out[20]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 0 In [22]: time = g1.createVariable('time', 'f8', ('time', )) In [23]: time[:] = 30 In [25]: rg Out[25]: <class 'netCDF4._netCDF4.Dataset'> root group (NETCDF4 data model, file format HDF5): dimensions(sizes): variables(dimensions): groups: some In [26]: rg.close() # opening the dataset in xarray returns an empty dataset In [27]: xr.open_dataset('/Users/jhamman/workdir/junk.nc') Out[27]: <xarray.Dataset> Dimensions: () Data variables: *empty* # but if you specify the group, you get the expected dataset In [28]: xr.open_dataset('/Users/jhamman/workdir/junk.nc', group='some/random/path') Out[28]: <xarray.Dataset> Dimensions: (time: 1) Coordinates: * time (time) float64 30.0 Data variables: *empty*

If we were to raise an error in the Zarr backend, would we want to check for the presence of variables or attributes?

Ok, that's a pretty good point.

I'm fine with leaving as is.

…ttrs

jhamman · 2018-01-15T16:22:52Z

Time for final comments. I'll merge tomorrow if nobody has any further changes.

…ttrs

shoyer · 2018-01-16T08:25:01Z

xarray/backends/common.py

+        return variables, attributes
+
+    def store(self, variables, attributes, *args, **kwargs):
+        AbstractWritableDataStore.store(self, variables, attributes,


You can remove this method -- the implementation is exactly the parent class method.

shoyer · 2018-01-16T08:35:33Z

xarray/backends/common.py

@@ -195,6 +198,37 @@ def __init__(self, writer=None):
            writer = ArrayWriter()
        self.writer = writer

+    def encode(self, variables, attributes):


This is definitely cleaner than what we had before, but I am reluctant to give the idea that this this is a new supported method for third-party DataStore classes. Maybe we can call this _encode for now, or add a warning about implementing it?

Eventually, I would like to remove all implementations from the DataStore base classes, and leave them as purely abstract. This will make it clearer to new backend implementers what they actually should/can implement.

So instead of implementing an encode() method, data store classes could have a list of default encoders (see xarray.coding) used when reading/writing data. But xarray.coding isn't quite ready for this yet...

Clearly some refactoring will be needed to this once the overall backend refactoring moves forward. For now, however, this seems like a reasonable compromise.

@shoyer - I'm a bit confused here. As you'll see in the Zarr backend, the encode_variable method is applying a list of encoders. Where in the WritableDataStore were you envisioning the application of the encoders?

@shoyer - What would you say to merging this in its current state and leaving the encoders refactor to a separate PR? I'm happy to make more changes here but a) I'm not sure how to address your last comment, and b) I've already drifted a fair ways off track with this PR.

I'm fine with that. We may want to change it more in the future but this is a clear improvement for now.

mrocklin · 2018-01-24T14:56:57Z

Woot.

…

On Wed, Jan 24, 2018 at 9:55 AM, Joe Hamman ***@***.***> wrote: Merged #1800 <#1800>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1800 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIIrZcJD3vsuqHKJT_8sL7Ryrqljks5tN0R7gaJpZM4RM_pa> .

Joseph Hamman and others added 5 commits December 21, 2017 10:23

move backend append logic to the prepare_variable methods

a0bea98

deprecate variables/dimensions/attrs properties on AbstractWritableDa…

afdb254

…taStore

warnings instead of errors for backend properties

cc02150

use attrs.update when setting zarr attributes

86240cd

more performance improvements to attributes in zarr backend

9c89ef2

jhamman mentioned this pull request Dec 26, 2017

workflow for moving data to cloud pangeo-data/pangeo#48

Closed

jhamman requested a review from rabernat December 26, 2017 20:59

Joseph Hamman and others added 4 commits December 28, 2017 10:06

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

2568d21

…ttrs

fix typo

d459c66

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

c59ca57

…ttrs

Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…

2dd186a

…ix/zarr_set_attrs

shoyer reviewed Dec 31, 2017

View reviewed changes

Joseph Hamman added 2 commits January 1, 2018 22:56

new set_dimensions method for writable data stores

8f71b31

Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…

07b9c21

…ix/zarr_set_attrs

more fixes for zarr

67fcd92

rabernat reviewed Jan 2, 2018

View reviewed changes

Joseph Hamman and others added 4 commits January 2, 2018 13:12

more tests for zarr and remove append logic for zarr

b38e1a6

more tests for zarr and remove append logic for zarr

47ba8b6

Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…

9152b12

…ix/zarr_set_attrs

a few more tweaks to zarr attrs

26b6bcb

jhamman commented Jan 2, 2018

View reviewed changes

jhamman added this to the 0.10.1 milestone Jan 2, 2018

Joseph Hamman added 4 commits January 4, 2018 14:16

Add encode methods to writable data stores, fixes for Zarr tests

b7681ae

fix for InMemoryDataStore

e084e9e

fix for unlimited dimensions Scipy Datastores

a6aeb36

another patch for scipy

264b13f

jhamman commented Jan 5, 2018

View reviewed changes

jhamman commented Jan 9, 2018

View reviewed changes

Joseph Hamman added 3 commits January 9, 2018 08:19

update minimum zarr version for tests

3590d28

Merge branch 'master' into fix/zarr_set_attrs

69cacee

Merge branch 'master' into fix/zarr_set_attrs

8d744e0

jhamman mentioned this pull request Jan 9, 2018

Switch (some) coding/encoding in conventions.py to use xarray.coding. #1803

Merged

3 tasks

Joseph Hamman added 2 commits January 9, 2018 23:30

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

a8dabdf

…ttrs

Merge branch 'fix/zarr_set_attrs' of github.com:jhamman/xarray into f…

7858db7

…ix/zarr_set_attrs

jhamman commented Jan 10, 2018

View reviewed changes

jhamman mentioned this pull request Jan 11, 2018

v0.10.1 Release #1821

Closed

5 tasks

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

48bf7ef

…ttrs

rabernat requested changes Jan 12, 2018

View reviewed changes

Joseph Hamman added 2 commits January 12, 2018 11:58

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

53260c9

…ttrs

cleanup and docs for zarr performance branch

7ed6bf8

rabernat approved these changes Jan 13, 2018

View reviewed changes

fix two failing tests when using zarr master

3872da2

rabernat reviewed Jan 14, 2018

View reviewed changes

Joseph Hamman added 3 commits January 14, 2018 15:53

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

e6b7068

…ttrs

flake8

c31decf

back to zarr 2.2

189d262

Merge branch 'master' of github.com:pydata/xarray into fix/zarr_set_a…

07b92e2

…ttrs

shoyer reviewed Jan 16, 2018

View reviewed changes

remove extra store method

96996ef

rabernat mentioned this pull request Jan 16, 2018

groupby on dask objects doesn't handle chunks well #1832

Closed

jhamman merged commit 0a0593d into pydata:master Jan 24, 2018

jhamman mentioned this pull request Feb 11, 2018

WIP: Zarr backend #1528

Merged

4 tasks

WIP: Performance improvements for zarr backend #1800

WIP: Performance improvements for zarr backend #1800

Conversation

jhamman commented Dec 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 10, 2018

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 13, 2018

rabernat commented Jan 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat Jan 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Jan 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Jan 24, 2018 via email

jhamman commented Dec 26, 2017 •

edited

Loading

rabernat commented Jan 13, 2018 •

edited

Loading

rabernat Jan 14, 2018 •

edited

Loading