Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools for converting between xray.Dataset and nested dictionaries/JSON #432

Closed
shoyer opened this issue Jun 13, 2015 · 6 comments
Closed

Comments

@shoyer
Copy link
Member

shoyer commented Jun 13, 2015

This came up in discussion with @freeman-lab -- xray does not have direct support for converting datasets to or from nested dictionaries (i.e., as could be serialized in JSON).

This is quite straightforward to implement oneself, of course, but there's something to be said for making this more obvious. I'm thinking of a serialization format that looks something like this:

{
    'variables': {
        'temperature': {
            'dimensions': ['x'],
            'data': [1, 2, 3],
            'attributes': {}
            }
        ...
    }
    'attributes': {
        'title': 'My example dataset',
        ...
    }
}

The solution here would be to either:

  1. add a few examples to the IO documentation of how to roll this one-self, or
  2. create a few helper methods/functions to make this even easier: xray.Dataset.to_dict/xray.read_dict.
@jsignell
Copy link
Contributor

Is this still of interest? I was thinking it would look something like this:

d = {'coordinates': {}, 'variables': {}, 'attributes': {}}

d['attributes'].update(dict(self.attrs))

for k in self.coords:
    d['coordinates'].update({k: {'data': self[k].data,
                                 'dimensions': list(self[k].dims),
                                 'attributes': dict(self[k].attrs)}})
if hasattr(self, 'data_vars'):
    for k in self.data_vars:
        d['variables'].update({k: {'data': self[k].data,
                                 'dimensions': list(self[k].dims),
                                 'attributes': dict(self[k].attrs)}})
else:
    d['variables'].update({'data': self.data,
                           'dimensions': list(self.dims),
                           'attributes': dict(self.attrs)})

@shoyer
Copy link
Member Author

shoyer commented Jul 22, 2016

Yes, I think this is still of interest, though of course the devil is in the details.

  1. Do we make this look closer to the xarray.Dataset data model (coords, data_vars, attrs, dims) or netCDF (variables, attributes, dimensions)?
  2. If the later -- do we go so far as to encode all data types (e.g., dates and times) according to CF conventions?
  3. Do we save data in the form of nested lists or in a numpy array?
  4. Do we output directly output to JSON or just a dict?
  5. Do we include dims or dimensions (providing dimension sizes) as a top level field/check?
  6. How does the format differ for xarray.DataArray? Do we even bother with DataArray?

My inclinations:

  1. Mirror xarray.Dataset
  2. NA
  3. Use nested lists of native Python types, e.g., generated with numpy's .tolist() method.
  4. Just a dict, to preserve flexibility for different serialization formats.
  5. Yes, sanity checks are important.
  6. Probably not a bad idea to cover xarray.DataArray, too, but the format should be clearly distinct (not reusing variables as a top level key).

@jsignell
Copy link
Contributor

I agree.

.3. Couldn't this make the dict blow up for large datasets? Maybe there could be a flag that lets the user decide whether to leave the data in its current form (could use self.data in case it is a dask array)
6. The trouble with xarray.DataArray is that it doesn't require a name but it can have one. Is that something that we would want to preserve? If not, then maybe it would look more like this.

tolist=True

d = {'coords': {}, 'attrs': dict(self.attrs), 'dims': self.dims}

def func(x, tolist):
    if tolist:
        return x.tolist()
    return x

for k in self.coords:
    d['coords'].update({k: {'data': func(self[k].data, tolist),
                            'dims': list(self[k].dims),
                            'attrs': dict(self[k].attrs)}})
if hasattr(self, 'data_vars'):
    d.update({'data_vars': {}})
    for k in self.data_vars:
        d['data_vars'].update({k: {'data': func(self[k].data, tolist),
                                   'dims': list(self[k].dims),
                                   'attrs': dict(self[k].attrs)}})
else:
    d.update({'data': func(self.data, tolist)})

@shoyer
Copy link
Member Author

shoyer commented Jul 22, 2016

.3. Couldn't this make the dict blow up for large datasets? Maybe there could be a flag that lets the user decide whether to leave the data in its current form (could use self.data in case it is a dask array)

Which use cases for this functionality would want the numpy/dask array? If you're planning on serializing to JSON or a similar format, then you'll need to add a custom decoder/encoder to handle arrays.

.6. The trouble with xarray.DataArray is that it doesn't require a name but it can have one. Is that something that we would want to preserve?

Yes, we should preserve the name is possible (serialization formats are much more useful if they are not lossy). Fortunately, None is a perfectly valid value when translated into JSON (as null). So I think we could simply use that as a default.

@jsignell
Copy link
Contributor

jsignell commented Jul 22, 2016

Ok. That makes sense. I added a function to common and submitted a pull request: #917

@shoyer
Copy link
Member Author

shoyer commented Aug 11, 2016

Fixed by #917

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants