Tools for converting between xray.Dataset and nested dictionaries/JSON #432

shoyer · 2015-06-13T22:25:28Z

This came up in discussion with @freeman-lab -- xray does not have direct support for converting datasets to or from nested dictionaries (i.e., as could be serialized in JSON).

This is quite straightforward to implement oneself, of course, but there's something to be said for making this more obvious. I'm thinking of a serialization format that looks something like this:

{
    'variables': {
        'temperature': {
            'dimensions': ['x'],
            'data': [1, 2, 3],
            'attributes': {}
            }
        ...
    }
    'attributes': {
        'title': 'My example dataset',
        ...
    }
}

The solution here would be to either:

add a few examples to the IO documentation of how to roll this one-self, or
create a few helper methods/functions to make this even easier: xray.Dataset.to_dict/xray.read_dict.

The text was updated successfully, but these errors were encountered:

jsignell · 2016-07-20T21:03:33Z

Is this still of interest? I was thinking it would look something like this:

d = {'coordinates': {}, 'variables': {}, 'attributes': {}}

d['attributes'].update(dict(self.attrs))

for k in self.coords:
    d['coordinates'].update({k: {'data': self[k].data,
                                 'dimensions': list(self[k].dims),
                                 'attributes': dict(self[k].attrs)}})
if hasattr(self, 'data_vars'):
    for k in self.data_vars:
        d['variables'].update({k: {'data': self[k].data,
                                 'dimensions': list(self[k].dims),
                                 'attributes': dict(self[k].attrs)}})
else:
    d['variables'].update({'data': self.data,
                           'dimensions': list(self.dims),
                           'attributes': dict(self.attrs)})

shoyer · 2016-07-22T07:18:36Z

Yes, I think this is still of interest, though of course the devil is in the details.

Do we make this look closer to the xarray.Dataset data model (coords, data_vars, attrs, dims) or netCDF (variables, attributes, dimensions)?
If the later -- do we go so far as to encode all data types (e.g., dates and times) according to CF conventions?
Do we save data in the form of nested lists or in a numpy array?
Do we output directly output to JSON or just a dict?
Do we include dims or dimensions (providing dimension sizes) as a top level field/check?
How does the format differ for xarray.DataArray? Do we even bother with DataArray?

My inclinations:

Mirror xarray.Dataset
NA
Use nested lists of native Python types, e.g., generated with numpy's .tolist() method.
Just a dict, to preserve flexibility for different serialization formats.
Yes, sanity checks are important.
Probably not a bad idea to cover xarray.DataArray, too, but the format should be clearly distinct (not reusing variables as a top level key).

jsignell · 2016-07-22T13:14:43Z

I agree.

.3. Couldn't this make the dict blow up for large datasets? Maybe there could be a flag that lets the user decide whether to leave the data in its current form (could use self.data in case it is a dask array)
6. The trouble with xarray.DataArray is that it doesn't require a name but it can have one. Is that something that we would want to preserve? If not, then maybe it would look more like this.

tolist=True

d = {'coords': {}, 'attrs': dict(self.attrs), 'dims': self.dims}

def func(x, tolist):
    if tolist:
        return x.tolist()
    return x

for k in self.coords:
    d['coords'].update({k: {'data': func(self[k].data, tolist),
                            'dims': list(self[k].dims),
                            'attrs': dict(self[k].attrs)}})
if hasattr(self, 'data_vars'):
    d.update({'data_vars': {}})
    for k in self.data_vars:
        d['data_vars'].update({k: {'data': func(self[k].data, tolist),
                                   'dims': list(self[k].dims),
                                   'attrs': dict(self[k].attrs)}})
else:
    d.update({'data': func(self.data, tolist)})

shoyer · 2016-07-22T16:03:23Z

.3. Couldn't this make the dict blow up for large datasets? Maybe there could be a flag that lets the user decide whether to leave the data in its current form (could use self.data in case it is a dask array)

Which use cases for this functionality would want the numpy/dask array? If you're planning on serializing to JSON or a similar format, then you'll need to add a custom decoder/encoder to handle arrays.

.6. The trouble with xarray.DataArray is that it doesn't require a name but it can have one. Is that something that we would want to preserve?

Yes, we should preserve the name is possible (serialization formats are much more useful if they are not lossy). Fortunately, None is a perfectly valid value when translated into JSON (as null). So I think we could simply use that as a default.

jsignell · 2016-07-22T17:00:29Z

Ok. That makes sense. I added a function to common and submitted a pull request: #917

shoyer · 2016-08-11T21:54:51Z

Fixed by #917

shoyer added API design topic-documentation enhancement labels Jun 13, 2015

jsignell mentioned this issue Jul 22, 2016

added to_dict function for xarray objects #917

Merged

shoyer closed this as completed Aug 11, 2016

nicain mentioned this issue Sep 29, 2017

DataArray to_dict() without converting with numpy tolist() #1599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

shoyer commented Jun 13, 2015

jsignell commented Jul 20, 2016

shoyer commented Jul 22, 2016

jsignell commented Jul 22, 2016

shoyer commented Jul 22, 2016 •

edited

Loading

jsignell commented Jul 22, 2016 •

edited

Loading

shoyer commented Aug 11, 2016

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

Comments

shoyer commented Jun 13, 2015

jsignell commented Jul 20, 2016

shoyer commented Jul 22, 2016

jsignell commented Jul 22, 2016

shoyer commented Jul 22, 2016 • edited Loading

jsignell commented Jul 22, 2016 • edited Loading

shoyer commented Aug 11, 2016

shoyer commented Jul 22, 2016 •

edited

Loading

jsignell commented Jul 22, 2016 •

edited

Loading