Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between the types of Dataset.dims and DataArray.dims #921

Closed
shoyer opened this issue Jul 29, 2016 · 10 comments
Closed

Inconsistency between the types of Dataset.dims and DataArray.dims #921

shoyer opened this issue Jul 29, 2016 · 10 comments
Labels

Comments

@shoyer
Copy link
Member

shoyer commented Jul 29, 2016

DataArray.dims is currently a tuple, whereas Dataset.dims is a dict. This results in ugly code like this, taken from xarray/core/group.py:

        try:  # Dataset
            expected_size = obj.dims[group_dim]
        except TypeError:  # DataArray
            expected_size = obj.shape[obj.get_axis_num(group_dim)]

One way to resolve this inconsistency would be switch DataArray.dims to a (frozen) OrderedDict. The downside is that code like x.dims[0] wouldn't work anymore (unless we add some terrible fallback logic). You'd have to write x.dims.keys()[0], which is pretty ugly. On the plus side, x.dims['time'] would always return the size of the time dimension, regardless of whether x is a DataArray or Dataset.

Another option would be to add an attribute dim_shape (or some other name) that serves as an alias to dims on Dataset and an alias to OrderedDict(zip(dims, shape)) on DataArray. This would be fully backwards compatible, but further pollute the namespace on Dataset and DataArray.

@max-sixty
Copy link
Collaborator

What are the arguments against having dims be a tuple for both DataArray & Datasets?

@shoyer
Copy link
Member Author

shoyer commented Jul 29, 2016

What are the arguments against having dims be a tuple for both DataArray & Datasets?

I guess that's a third option! In that case, we'd have to add shape to Dataset, too, to ensure there is some way to get size information.

The main downside is that we'd still be stuck without any idea way to get dimension sizes by name -- ds.shape[ds.get_axis_num('time')] is pretty bad. Also, I'm not sure how many use cases there are for actually pulling out the first dimension of a dataset (or it's shape) by position, given that dimension order is currently sorted independently of dimension order on any of its data arrays.

@max-sixty
Copy link
Collaborator

I frequently need to pull out the dimension names, and I've never needed to pull out their sizes. I realize that the internals need to do that though. And maybe my use case is specific.

Zooming out a bit: the access pattern for most of these (variables, coords) is indeed an OrderedDict of name to object. But the stickler for dim is why the size should be the value here? It seems very different from having dim objects as values.

I could certainly understand a convenience dims_shape (for {dim: ds[dim].shape for dim in ds.dims}.

But I"m coming from a place of less context without that much confidence...

@shoyer
Copy link
Member Author

shoyer commented Jul 29, 2016

I frequently need to pull out the dimension names, and I've never needed to pull out their sizes.

Indeed, this is why this discrepancy has survived for so long. Both 'foo' in x and iterations over x work the same for dictionary keys and tuples (thanks to duck typing).

Zooming out a bit: the access pattern for most of these (variables, coords) is indeed an OrderedDict of name to object. But the stickler for dim is why the size should be the value here? It seems very different from having dim objects as values.

Agreed, it is a little weird, and probably unnecessary. It is arguably a left-over from the very early days of xarray (before we had a DataArray type!).

It occurs to me that it's not hard to pull sizes out of ds.coords['time'].size if necessary, and that's probably totally sufficient for getting dimension sizes for both DataArray and Dataset. We don't really need dims_shape.

@max-sixty
Copy link
Collaborator

Thanks for the, as ever, thoughtful response

Should we move Dataset.dims to be a tuple, then?

@shoyer shoyer changed the title Switch DataArray.dims to a OrderedDict? Inconsistency between the types of Dataset.dims and DataArray.dims Jul 29, 2016
@shoyer
Copy link
Member Author

shoyer commented Aug 3, 2016

One thing I like about dims being a frozen dictionary is that is makes the fixed, immutable nature of dimension sizes on the Dataset very obvious in the data model. If you had to use ds.coords[...].size to look up dimension sizes, this would be much less obvious.

CC @jhamman in case he has thoughts here.

@jhamman
Copy link
Member

jhamman commented Aug 3, 2016

It seems to me that xarray has the same convention for dims that netCDF4 has for dimensions, OrderedDict for the dataset, tuple for the variables (DataArrays).

I do see the utility in switching to an OrderedDict for both. I have had times where I need to know the size of a dimension on a data array and as @shoyer mentioned, its a bit of a hunt at the moment. That would be my vote if we're willing to make that breaking change.

@shoyer
Copy link
Member Author

shoyer commented Oct 22, 2016

With optional indexes (#1017) meaning that ds.coords[...].size may not always succeed (OK, depending on some implementation choices), the need for a consistent way to look up dimension sizes becomes even more severe.

Rather than resolving the type inconsistency of dims, we could simply add a new attribute sizes to both Dataset and DataArray that is exactly equivalent to Dataset.dims. The Dataset API ends up a little bit redundant, but we have a consistent way to access this information.

shoyer added a commit to shoyer/xarray that referenced this issue Nov 3, 2016
This allows for consistent access to dimension lengths on ``Dataset`` and
``DataArray``

xref pydata#921 (doesn't resolve it 100%, but should help significantly)
shoyer added a commit to shoyer/xarray that referenced this issue Nov 3, 2016
This allows for consistent access to dimension lengths on ``Dataset`` and
``DataArray``

xref pydata#921 (doesn't resolve it 100%, but should help significantly)
shoyer added a commit that referenced this issue Nov 4, 2016
This allows for consistent access to dimension lengths on ``Dataset`` and
``DataArray``

xref #921 (doesn't resolve it 100%, but should help significantly)
@stale
Copy link

stale bot commented Jan 25, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Jan 25, 2019
@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2019

We have .sizes now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants