WIP: New DataStore / Encoder / Decoder API for review #1087

shoyer · 2016-11-07T05:02:04Z

The goal here is to make something extensible that we can live with for quite
some time, and to clean up the internals of xarray's backend interface.

Most of these are analogues of existing xarray classes with a cleaned up
interface. I have not yet worried about backwards compatibility or tests -- I
would appreciate feedback on the approach here.

Several parts of the logic exist for the sake of dask. I've included the word
"dask" in comments to facilitate inspection by mrocklin.

CC @rabernat, @pwolfram, @jhamman, @mrocklin -- for review

CC @mcgibbon, @JoyMonteiro -- this is relevant to our discussion today about
adding support for appending to netCDF files. Don't let this stop you from
getting started on that with the existing interface, though.

The goal here is to make something extensible that we can live with for quite some time, and to clean up the internals of xarray's backend interface. Most of these are analogues of existing xarray classes with a cleaned up interface. I have not yet worried about backwards compatibility or tests -- I would appreciate feedback on the approach here. Several parts of the logic exist for the sake of dask. I've included the word "dask" in comments to facilitate inspection by mrocklin. CC rabernat, jhamman, mrocklin -- for review CC mcgibbon, JoyMonteiro -- this is relevant to our discussion today about adding support for appending to netCDF files. Don't let this stop you from getting started on that with the existing interface, though.

mrocklin · 2016-11-07T14:35:59Z

xarray/backends/core.py

+
+    def get_token(self):
+        """Return a token identifier suitable for use by dask."""
+        return None


This could default to str(uuid.uuid4())

mrocklin · 2016-11-07T14:41:41Z

xarray/backends/core.py

+        return (self.filename, os.path.getmtime(self.filename))
+
+    def get_name(self):
+        return self.filename


I would think that this would just be 'read-from-disk'

Yeah, maybe that's better than using the full filename.

mrocklin · 2016-11-07T14:44:53Z

xarray/backends/core.py

+        # Note: this mostly exists for the benefit of future support for partial
+        # reads -- we don't actually make use of this in the current version of
+        # xarray.
+        raise NotImplementedError


Presumably the thing returned by this method is never the result of a task and so doesn't need to be serialized?

This will be passed as a target into da.store, so I think it does need to be serializable.

zarr could definitely use this

mrocklin · 2016-11-07T14:46:12Z

xarray/backends/core.py

+        """
+        # Again, we actually have a use for the region argument? Could be useful
+        # to ensure writes to zarr are safe.
+        return None


What behavior does HDF5 allow here? Can we write from multiple threads to non-overlapping blocks of the on-disk array? Is the library safe enough to allow this?

Can we write from multiple threads to non-overlapping blocks of the on-disk array?

I wish! Unfortunately, my understanding is that this is not the case. HDF5 isn't at all threadsafe -- not even for reading entirely different files at the same time. In the best case scenario, you have compiled HDF5 in "threadsafe" mode which just means they add their own global lock around every API call. So we will need to use some sort of global lock for all HDF5 files.

Another reason to like zarr over HDF. ;)

mrocklin · 2016-11-07T14:47:23Z

xarray/backends/core.py

+        self._attributes[name] = copy.deepcopy(value)
+
+    def get_write_lock(self, name, region=Ellipsis):
+        return self._write_locks[name]


This, presumably, is what @alimanfoo would implement if we wanted to support climate data on Zarr?

Indeed. The main complexity would be mapping region (in array coordinates) to the set of overlapping blocks (each of which probably needs it's own lock), but he probably already has such a system.

mrocklin · 2016-11-07T14:48:18Z

xarray/backends/core.py

+            import dask.array as da
+            # TODO: dask.array.store needs to be able to accept a list of Lock
+            # objects.
+            da.store(self.sources, self.targets, lock=self.lock)


Seems doable to me

mrocklin · 2016-11-07T14:50:27Z

xarray/backends/core.py

+def write_datastore(dataset, store, encode=None, encoding=None,
+                    close_on_error=False):
+    # TODO: add compute keyword argument to allow for returning futures, like
+    # dask.array.store.


I suspect that you don't want to deal with futures directly. Instead you want to expose a dask.graph that the distributed client can collect and replace with futures on its own.

I was thinking of simply returning the dask.delayed object returned by da.store (that's what I meant by "futrue"). Unless you think this function should be returning a dask graph directly?

shoyer

I'd like to also add in a prototype of consolidated file handling (with an LRU cache and pickle-ability) that DataStores can plug in to. That will be a cleaner solution for dask.distributed.

shoyer · 2016-11-07T16:14:26Z

xarray/backends/core.py

+        return (self.filename, os.path.getmtime(self.filename))
+
+    def get_name(self):
+        return self.filename


Yeah, maybe that's better than using the full filename.

shoyer · 2016-11-07T16:15:21Z

xarray/backends/core.py

+        # Note: this mostly exists for the benefit of future support for partial
+        # reads -- we don't actually make use of this in the current version of
+        # xarray.
+        raise NotImplementedError


This will be passed as a target into da.store, so I think it does need to be serializable.

shoyer · 2016-11-07T16:17:43Z

xarray/backends/core.py

+        """
+        # Again, we actually have a use for the region argument? Could be useful
+        # to ensure writes to zarr are safe.
+        return None


Can we write from multiple threads to non-overlapping blocks of the on-disk array?

I wish! Unfortunately, my understanding is that this is not the case. HDF5 isn't at all threadsafe -- not even for reading entirely different files at the same time. In the best case scenario, you have compiled HDF5 in "threadsafe" mode which just means they add their own global lock around every API call. So we will need to use some sort of global lock for all HDF5 files.

shoyer · 2016-11-07T16:20:17Z

xarray/backends/core.py

+        self._attributes[name] = copy.deepcopy(value)
+
+    def get_write_lock(self, name, region=Ellipsis):
+        return self._write_locks[name]


Indeed. The main complexity would be mapping region (in array coordinates) to the set of overlapping blocks (each of which probably needs it's own lock), but he probably already has such a system.

shoyer · 2016-11-07T16:22:50Z

xarray/backends/core.py

+def write_datastore(dataset, store, encode=None, encoding=None,
+                    close_on_error=False):
+    # TODO: add compute keyword argument to allow for returning futures, like
+    # dask.array.store.


I was thinking of simply returning the dask.delayed object returned by da.store (that's what I meant by "futrue"). Unless you think this function should be returning a dask graph directly?

shoyer · 2017-11-29T16:07:53Z

CC @alexamici for interest in the backends refactor

rabernat · 2017-11-29T16:35:45Z

xarray/backends/core.py

+
+    def __call__(self, variables, attrs):
+        return conventions.decode_cf_variables(
+            variables, attrs, **self._kwargs)


This looks great. It would definitely solve some of the encoding challenges with zarr.

rabernat · 2017-11-29T16:38:33Z

Stephan this looks awesome! Should simplify the backends a lot!

I do worry that it will be painful to refactor the existing backends. But I guess that is the cost of progress.

shoyer · 2017-11-30T12:05:41Z

OK, I'm going to try to reboot this and finish it up in the form of an API that we'll be happy with going forward. I just discovered two more xarray backends over the past two days (in Unidata's Siphon and something @alexamici and colleagues are writing to reading GRIB files), so clearly the demand is here.

One additional change I'd like to make is try to rewrite the encoding/decoding functions for variables into a series of invertible coding filters that can potentially be chained together in a flexible way (this is somewhat inspired by zarr). This will allow different backends to mix/match filters as necessary, depending on their particular needs. I'll start on that in another PR.

alimanfoo · 2017-11-30T13:07:53Z

FWIW for the filters, if it would be possible to use the numcodecs Codec API http://numcodecs.readthedocs.io/en/latest/abc.html then that could be beneficial beyond xarray, as any work you put into developing filters could then be used elsewhere (e.g., in zarr).

…

On Thu, Nov 30, 2017 at 12:05 PM, Stephan Hoyer ***@***.***> wrote: OK, I'm going to try to reboot this and finish it up in the form of an API that we'll be happy with going forward. I just discovered two more xarray backends over the past two days (in Unidata's Siphon and something @alexamici <https://github.com/alexamici> and colleagues are writing to reading GRIB files), so clearly the demand is here. One additional change I'd like to make is try to rewrite the encoding/decoding functions for variables into a series of invertible coding filters that can potentially be chained together in a flexible way (this is somewhat inspired by zarr). This will allow different backends to mix/match filters as necessary, depending on their particular needs. I'll start on that in another PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1087 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QmzjKBnyjuGDFN6btGfhr2eFrhoiks5s7poXgaJpZM4Kq10M> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

shoyer · 2017-12-01T02:18:22Z

See #1752 for getting started on filters. I had a productive plane ride!

@alimanfoo thanks for the pointer to numcodecs. I'm sure that will come in handy eventually. Most of the "filters" xarray uses here are at a slightly higher level decoding/encoding metadata. The actual filters themselves are generally quite simple, e.g., just coercing a dtype, but there is lots of metadata to keep track of to know when they are appropriate to use.

jhamman

Looks really cool. Will make adding/maintaining backends much easier.

jhamman · 2017-12-02T03:13:59Z

xarray/backends/core.py

+        # type: (Hashable, Union[Ellipsis, Tuple[slice, ...]]) -> object
+        """Return a lock for writing a given variable.
+
+        This method may be useful for DataStores that from which data is


...DataSores for which data...

jhamman · 2017-12-02T03:14:52Z

xarray/backends/core.py

+class InMemoryDataStore(AbstractWritableDataStore):
+    """Stores variables and attributes directly in OrderedDicts.
+
+    This store exists internal testing purposes, e.g., for integration tests


...exists for internal...

jhamman · 2017-12-02T03:19:26Z

xarray/backends/core.py

+    def __init__(self):
+        self._variables = OrderedDict()
+        self._attributes = OrderedDict()
+        # do we need locks? are writes to NumPy arrays thread-safe?


I think the answer is yes, we need locks (no writes are not thread safe), to this question but I imagine @mrocklin can give the final word.

We never do overlapping writes though, right? I've found that locks are not necessary as long as the underlying data store's chunking doesn't overlap poorly with how we're writing chunks. Given that NumPy arrays are entirely fine-grained this doesn't seem like it would be an issue.

jhamman · 2017-12-02T03:30:01Z

xarray/backends/core.py

+from xarray.core.pycompat import OrderedDict, dask_array_type
+
+from xarray import Variable
+


nit: you need to import xarray, and conventions. I'm guessing you are also expecting to put all the Coders in a module coders?

jhamman · 2017-12-02T03:30:54Z

xarray/backends/core.py

+            self.targets.append(target)
+            self.locks.append(lock)
+        else:
+            target[...] = source


This looks great. Nice to get this in a cleaner form.

alexamici · 2017-12-15T07:11:31Z

@shoyer regarding the xarray-grib-driver (not public yet, sorry) we have been working on the GRIB side lately and I didn't review this branch until today. Now we are coming back to the xarray side and I welcome the new "pluggability" of the encoding / decoding engine. Anyway since a lot of the coding work is already done by the ecCodes library my hope is that most of the complexity will stay outside of xarray anyway.

jhamman · 2019-07-29T05:07:32Z

@shoyer - since I've started work on a related subject in #3166, I'm wondering if you think this PR is capable of being revived? I think we'll have some time soon to pick this up in one form or another.

shoyer · 2019-07-29T05:17:46Z

This is gotten a little stale at this point. Coders did make it in, though we haven't moved over everything from conventions.py yet. This interface should also probably be reconciled with the use of CachingFileManager.

shoyer mentioned this pull request Nov 7, 2016

dask.array.store needs to be able to accept a list of Lock objects dask/dask#1762

Closed

mrocklin reviewed Nov 7, 2016

View reviewed changes

shoyer mentioned this pull request Nov 7, 2016

Initial hack to get dask distributed working #1083

Closed

shoyer commented Nov 7, 2016

View reviewed changes

pwolfram mentioned this pull request Jan 12, 2017

Fixes OS error arising from too many files open #1198

Merged

shoyer mentioned this pull request Jan 21, 2017

zarr as persistent store for xarray #1223

Closed

shoyer mentioned this pull request Feb 11, 2017

Add RasterIO backend #1260

Merged

6 tasks

rabernat reviewed Nov 29, 2017

View reviewed changes

rabernat mentioned this pull request Nov 29, 2017

WIP: Zarr backend #1528

Merged

4 tasks

shoyer mentioned this pull request Dec 1, 2017

Refactor xarray.conventions into VariableCoder #1752

Merged

2 tasks

jhamman reviewed Dec 2, 2017

View reviewed changes

rabernat mentioned this pull request Dec 21, 2017

workflow for moving data to cloud pangeo-data/pangeo#48

Closed

rabernat mentioned this pull request Feb 26, 2018

tiledb pangeo-data/pangeo#120

Closed

shoyer mentioned this pull request Mar 6, 2018

Xarray backends or storage api's pangeo-data/storage-benchmarks#6

Open

jhamman mentioned this pull request Mar 6, 2018

API Design for Xarray Backends #1970

Open

TomNicholas added the topic-backends label Apr 5, 2020

shoyer closed this Apr 17, 2020

		from xarray.core.pycompat import OrderedDict, dask_array_type

		from xarray import Variable

WIP: New DataStore / Encoder / Decoder API for review #1087

WIP: New DataStore / Encoder / Decoder API for review #1087

Conversation

shoyer commented Nov 7, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Nov 29, 2017

Choose a reason for hiding this comment

rabernat commented Nov 29, 2017

shoyer commented Nov 30, 2017

alimanfoo commented Nov 30, 2017 via email

shoyer commented Dec 1, 2017

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexamici commented Dec 15, 2017

jhamman commented Jul 29, 2019

shoyer commented Jul 29, 2019