Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Performance improvements for zarr backend #1800

Merged
merged 38 commits into from
Jan 24, 2018
Merged

WIP: Performance improvements for zarr backend #1800

merged 38 commits into from
Jan 24, 2018

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Dec 26, 2017

  • Closes #workflow for moving data to cloud pangeo-data/pangeo#48
  • Tests added (for all bug fixes or enhancements)
  • Tests passed (for all non-documentation changes)
  • Passes git diff upstream/master **/*py | flake8 --diff (remove if you did not edit any Python files)
  • Fully documented, including whats-new.rst for all changes and api.rst for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)

This is building on top of #1799. Based on the suggestion from @alimanfoo in pangeo-data/pangeo#48 (comment), I have reworked the handling of attributes in the zarr backend. There is more to do here, particularly in the set_dimensions arena but this is giving almost a 2x speedup in writing to GCP.

cc @rabernat, @mrocklin and @alimanfoo

def set_dimension(self, name, length, is_unlimited=False):
if is_unlimited:
# TODO: we need these checks one way or another
# def set_dimension(self, name, length, is_unlimited=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we should probably rewrite our backend logic to write/check dimensions once rather than doing it each time a Variable is written.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea. I've gone ahead and done this and will push some more changes later tonight.

@jhamman
Copy link
Member Author

jhamman commented Jan 2, 2018

@shoyer - I've added a set_dimensions method to the writable datastores. This seems to work quite well except for string variables. I have a hacky fix with a comment in there right now, this works for all backends except h5netcdf. Thoughts on a better way to encode variables prior to setting dimensions would be great.

@@ -331,29 +331,42 @@ def get_variables(self):
for k, v in self.ds.arrays())

def get_attrs(self):
_, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
attributes = HiddenKeyDict(self.ds.attrs, [_DIMENSION_KEY])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for d, l in dims.items():
if d in existing_dims and l != existing_dims[d]:
raise ValueError("Unable to update size for existing dimension"
"%r (%d != %d)" % (d, l, existing_dims[d]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have any tests for this sort of error.

if name in self.ds:
zarr_array = self.ds[name]
zarr_array.attrs.update(encoded_attrs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand when we would ever hit this case. It seems like it would only happen if we were trying to modify an existing store in place. That is not currently supported.

@@ -216,7 +216,11 @@ def store_dataset(self, dataset):

def store(self, variables, attributes, check_encoding_set=frozenset(),
unlimited_dims=None):
# This seems out of place
variables = OrderedDict([(k, maybe_encode_as_char_array(v))
for k, v in variables.items()])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should each backend define encode/decode methods? For string variables, we need to encode before setting the dimensions.

@jhamman jhamman added this to the 0.10.1 milestone Jan 2, 2018

@pytest.mark.xfail(reason="Zero-dimension variables are broken")
def test_roundtrip_coordinates_with_space(self):
super(CFEncodedDataTest, self).test_roundtrip_coordinates_with_space()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat - I have not been able to get zero-dim variables to work with zarr. I know you iterated on this but did you ever get it to work?

xref: zarr-developers/zarr-python#150

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was fixed in zarr-developers/zarr-python#160. But zarr has not had a release since then. So you need to be installing zarr from github master in order to get zero-dim variables.

Zero dim variables are covered by the test suite
https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L170
And you can see the test passing in the zarr-dev environment in my original zarr PR:
https://travis-ci.org/pydata/xarray/jobs/315571803#L1558

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I thought I had Zarr-master in my environment. but I didn't These all work now.

# del zarr_group.attrs[self.DIMENSION_KEY]
# with pytest.raises(KeyError):
# with xr.decode_cf(store) as actual:
# pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to figure out why this wasn't raising an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because the error checking occurs in _get_zarr_dims_and_attrs, but you have bypassed that function in many cases for performance optimizations.

What does the dataset look like if you open it after deleting DIMENSION_KEY? It should be broken in some way.

for k, v in iteritems(attrs):
attributes[k] = _encode_zarr_attr_value(v)
encoded_attrs[k] = _encode_zarr_attr_value(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave this as a #TODO?

_, attributes = _get_zarr_dims_and_attrs(self.ds, _DIMENSION_KEY)
attributes[key] = _encode_zarr_attr_value(value)

dims = {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this an OrderedDict

except KeyError:
raise KeyError("Zarr object is missing the attribute `%s`, "
"which is required for xarray to determine "
"variable dimensions." % (_DIMENSION_KEY))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rabernat - I want to get your take on this change. What I've done is remove the _DIMENSION_KEY attribute from the dataset all together and I'm just pulling it from the individual arrays/variables. I realized we were storing this information multiple places and it was getting too tricky keeping everything straight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point that the coord information was stored twice.

My understanding was that having _DIMENSION_KEY in the global dataset attributes was necessary for set_coords to work on non-dimensions coordinates. But maybe this is wrong--I can't find any justification for it by looking through the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one option here is to just have no dimensions if there is no _DIMENSION_KEY on the variable. That would allow us to open any zarr store, even those not created by xarray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that having _DIMENSION_KEY in the global dataset attributes was necessary for set_coords to work on non-dimensions coordinates. But maybe this is wrong--I can't find any justification for it by looking through the code.

Right, I think that may have been the case but I've worked around that now.

I suppose one option here is to just have no dimensions if there is no _DIMENSION_KEY on the variable. That would allow us to open any zarr store, even those not created by xarray.

I think this is doable but I want to leave that for another PR since we'll need to make a number of decisions about how to name dimensions that were not specified in the store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for k, v in self.ds.arrays():
try:
for d, s in zip(v.attrs[_DIMENSION_KEY], v.shape):
dimensions[d] = s
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a consistency check here to ensure all the Zarr arrays have the same length on their labeled dimensions.

@jhamman
Copy link
Member Author

jhamman commented Jan 10, 2018

This is ready for some final reviews. Although the title refers to Zarr, I ended up doing some minor refactoring of all the backends so a critical eye there would be nice. I'm also hoping to use @shoyer's coding/encoding work in #1803 here once that is merged.

@jhamman jhamman mentioned this pull request Jan 11, 2018
5 tasks
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems just about good to go, just a little cleanup required.

I also suggest you use this PR to add some comments / docstrings to the common backend classes which explain, in words, what those methods are for. This will make it easier for new developers to implement new backends. I struggled for a long time to understand the code paths.

@@ -234,23 +253,42 @@ def set_variables(self, variables, check_encoding_set,

self.writer.add(source, target)

def set_necessary_dimensions(self, variable, unlimited_dims=None):
def set_dimensions(self, variables, unlimited_dims=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a docstring or comment explaining what this method does?

This would help new developers (including myself) come onboard with backend development.

except KeyError:
raise KeyError("Zarr object is missing the attribute `%s`, "
"which is required for xarray to determine "
"variable dimensions." % (_DIMENSION_KEY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

for k, v in iteritems(attrs):
attributes[k] = _encode_zarr_attr_value(v)
encoded_attrs[k] = _encode_zarr_attr_value(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe leave this as a #TODO?


with xr.decode_cf(store) as actual:
# make sure it is hidden
assert self.DIMENSION_KEY not in actual.attrs
# assert self.DIMENSION_KEY not in actual.attrs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

# put it back and try removing from a variable
zarr_group.attrs[self.DIMENSION_KEY] = {}
# zarr_group.attrs[self.DIMENSION_KEY] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

@jhamman
Copy link
Member Author

jhamman commented Jan 13, 2018

@rabernat - thanks for the review. I've addressed all your comments. I've also tested using this to push data to GCS. We're still getting the best performance we've been able to get when this is combined with gcsf:master and zarr:master.

@rabernat
Copy link
Contributor

rabernat commented Jan 13, 2018

Eventually we will want to think about testing xarray / zarr / gcsfs integration automatically somewhere. (but not here and not now)

with pytest.raises(KeyError):
with self.roundtrip(original,
save_kwargs={'group': group}) as actual:
self.assertDatasetIdentical(original, actual)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this part of the test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting one. I wanted to see if the tests all passed after this change before bringing this up for discussion.

The key error that is raised here was being raised when looking for the _DIMENSIONS_KEY on the group. Now that each group doesn't have that key, it obviously isn't raised. So the question up for discussion is what should we do when we open a zarr group that doesn't have anything in it? Currently, an empty group is returned as an empty Dataset.

Copy link
Contributor

@rabernat rabernat Jan 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is related to my earlier comment. I think we should pick a behavior we want to support. Either

  • You can open any zarr group with xarray
  • You can only open zarr groups that themselves were created by xarray

The current docs say the latter. However, having eliminated _DIMENSION_KEY from the group attributes, there is no way to check for whether the group was created by xarray if there are no variables.

I can't imagine any scenario in which a user would want to open an empty zarr group in read-only mode. That would be useless. I imagine that this means the dataset was written with a group path but then the user forgot to specify a path when reading. So let's raise an error for now if the group is empty.

In a future PR, we can figure out how to more comprehensively support the reading of zarr datasets that do not have xarray's special attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to play devils advocate for a moment, The current behavior is consistent with how we treat groups in netCDF4. Consider the following example:

# create a dataset with a group and nothing in the root group
In [16]: import netCDF4 as nc4

In [17]: rg = nc4.Dataset('/Users/jhamman/workdir/junk.nc', 'w')

In [18]: g1 = rg.createGroup('some/random/path')

In [19]: g1
Out[19]:
<class 'netCDF4._netCDF4.Group'>
group /some/random/path:
    dimensions(sizes):
    variables(dimensions):
    groups:

In [20]: g1.createDimension('time', None)
Out[20]: <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 0

In [22]: time = g1.createVariable('time', 'f8', ('time', ))

In [23]: time[:] = 30

In [25]: rg
Out[25]:
<class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    dimensions(sizes):
    variables(dimensions):
    groups: some

In [26]: rg.close()

# opening the dataset in xarray returns an empty dataset
In [27]: xr.open_dataset('/Users/jhamman/workdir/junk.nc')
Out[27]:
<xarray.Dataset>
Dimensions:  ()
Data variables:
    *empty*

# but if you specify the group, you get the expected dataset
In [28]: xr.open_dataset('/Users/jhamman/workdir/junk.nc', group='some/random/path')
Out[28]:
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) float64 30.0
Data variables:
    *empty*

If we were to raise an error in the Zarr backend, would we want to check for the presence of variables or attributes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's a pretty good point.

I'm fine with leaving as is.

@jhamman
Copy link
Member Author

jhamman commented Jan 15, 2018

Time for final comments. I'll merge tomorrow if nobody has any further changes.

return variables, attributes

def store(self, variables, attributes, *args, **kwargs):
AbstractWritableDataStore.store(self, variables, attributes,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove this method -- the implementation is exactly the parent class method.

@@ -195,6 +198,37 @@ def __init__(self, writer=None):
writer = ArrayWriter()
self.writer = writer

def encode(self, variables, attributes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely cleaner than what we had before, but I am reluctant to give the idea that this this is a new supported method for third-party DataStore classes. Maybe we can call this _encode for now, or add a warning about implementing it?

Eventually, I would like to remove all implementations from the DataStore base classes, and leave them as purely abstract. This will make it clearer to new backend implementers what they actually should/can implement.

So instead of implementing an encode() method, data store classes could have a list of default encoders (see xarray.coding) used when reading/writing data. But xarray.coding isn't quite ready for this yet...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly some refactoring will be needed to this once the overall backend refactoring moves forward. For now, however, this seems like a reasonable compromise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - I'm a bit confused here. As you'll see in the Zarr backend, the encode_variable method is applying a list of encoders. Where in the WritableDataStore were you envisioning the application of the encoders?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - What would you say to merging this in its current state and leaving the encoders refactor to a separate PR? I'm happy to make more changes here but a) I'm not sure how to address your last comment, and b) I've already drifted a fair ways off track with this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that. We may want to change it more in the future but this is a clear improvement for now.

@jhamman jhamman merged commit 0a0593d into pydata:master Jan 24, 2018
@mrocklin
Copy link
Contributor

mrocklin commented Jan 24, 2018 via email

@jhamman jhamman mentioned this pull request Feb 11, 2018
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants