Inconsistency between the types of Dataset.dims and DataArray.dims #921

shoyer · 2016-07-29T03:30:08Z

DataArray.dims is currently a tuple, whereas Dataset.dims is a dict. This results in ugly code like this, taken from xarray/core/group.py:

        try:  # Dataset
            expected_size = obj.dims[group_dim]
        except TypeError:  # DataArray
            expected_size = obj.shape[obj.get_axis_num(group_dim)]

One way to resolve this inconsistency would be switch DataArray.dims to a (frozen) OrderedDict. The downside is that code like x.dims[0] wouldn't work anymore (unless we add some terrible fallback logic). You'd have to write x.dims.keys()[0], which is pretty ugly. On the plus side, x.dims['time'] would always return the size of the time dimension, regardless of whether x is a DataArray or Dataset.

Another option would be to add an attribute dim_shape (or some other name) that serves as an alias to dims on Dataset and an alias to OrderedDict(zip(dims, shape)) on DataArray. This would be fully backwards compatible, but further pollute the namespace on Dataset and DataArray.

The text was updated successfully, but these errors were encountered:

max-sixty · 2016-07-29T04:45:32Z

What are the arguments against having dims be a tuple for both DataArray & Datasets?

shoyer · 2016-07-29T05:26:22Z

What are the arguments against having dims be a tuple for both DataArray & Datasets?

I guess that's a third option! In that case, we'd have to add shape to Dataset, too, to ensure there is some way to get size information.

The main downside is that we'd still be stuck without any idea way to get dimension sizes by name -- ds.shape[ds.get_axis_num('time')] is pretty bad. Also, I'm not sure how many use cases there are for actually pulling out the first dimension of a dataset (or it's shape) by position, given that dimension order is currently sorted independently of dimension order on any of its data arrays.

max-sixty · 2016-07-29T16:15:03Z

I frequently need to pull out the dimension names, and I've never needed to pull out their sizes. I realize that the internals need to do that though. And maybe my use case is specific.

Zooming out a bit: the access pattern for most of these (variables, coords) is indeed an OrderedDict of name to object. But the stickler for dim is why the size should be the value here? It seems very different from having dim objects as values.

I could certainly understand a convenience dims_shape (for {dim: ds[dim].shape for dim in ds.dims}.

But I"m coming from a place of less context without that much confidence...

shoyer · 2016-07-29T17:00:24Z

I frequently need to pull out the dimension names, and I've never needed to pull out their sizes.

Indeed, this is why this discrepancy has survived for so long. Both 'foo' in x and iterations over x work the same for dictionary keys and tuples (thanks to duck typing).

Zooming out a bit: the access pattern for most of these (variables, coords) is indeed an OrderedDict of name to object. But the stickler for dim is why the size should be the value here? It seems very different from having dim objects as values.

Agreed, it is a little weird, and probably unnecessary. It is arguably a left-over from the very early days of xarray (before we had a DataArray type!).

It occurs to me that it's not hard to pull sizes out of ds.coords['time'].size if necessary, and that's probably totally sufficient for getting dimension sizes for both DataArray and Dataset. We don't really need dims_shape.

max-sixty · 2016-07-29T19:26:50Z

Thanks for the, as ever, thoughtful response

Should we move Dataset.dims to be a tuple, then?

shoyer · 2016-08-03T07:26:29Z

One thing I like about dims being a frozen dictionary is that is makes the fixed, immutable nature of dimension sizes on the Dataset very obvious in the data model. If you had to use ds.coords[...].size to look up dimension sizes, this would be much less obvious.

CC @jhamman in case he has thoughts here.

jhamman · 2016-08-03T14:20:02Z

It seems to me that xarray has the same convention for dims that netCDF4 has for dimensions, OrderedDict for the dataset, tuple for the variables (DataArrays).

I do see the utility in switching to an OrderedDict for both. I have had times where I need to know the size of a dimension on a data array and as @shoyer mentioned, its a bit of a hunt at the moment. That would be my vote if we're willing to make that breaking change.

shoyer · 2016-10-22T01:08:52Z

With optional indexes (#1017) meaning that ds.coords[...].size may not always succeed (OK, depending on some implementation choices), the need for a consistent way to look up dimension sizes becomes even more severe.

Rather than resolving the type inconsistency of dims, we could simply add a new attribute sizes to both Dataset and DataArray that is exactly equivalent to Dataset.dims. The Dataset API ends up a little bit redundant, but we have a consistent way to access this information.

This allows for consistent access to dimension lengths on ``Dataset`` and ``DataArray`` xref pydata#921 (doesn't resolve it 100%, but should help significantly)

This allows for consistent access to dimension lengths on ``Dataset`` and ``DataArray`` xref #921 (doesn't resolve it 100%, but should help significantly)

stale · 2019-01-25T21:57:08Z

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity
If this issue remains relevant, please comment here; otherwise it will be marked as closed automatically

shoyer · 2019-01-25T22:01:46Z

We have .sizes now.

shoyer changed the title ~~Switch DataArray.dims to a OrderedDict?~~ Inconsistency between the types of Dataset.dims and DataArray.dims Jul 29, 2016

shoyer mentioned this issue Nov 3, 2016

New properties Dataset.sizes and DataArray.sizes #1076

Merged

shoyer added a commit that referenced this issue Nov 4, 2016

New properties Dataset.sizes and DataArray.sizes (#1076)

3f490a3

This allows for consistent access to dimension lengths on ``Dataset`` and ``DataArray`` xref #921 (doesn't resolve it 100%, but should help significantly)

stale bot added the stale label Jan 25, 2019

shoyer closed this as completed Jan 25, 2019

weiji14 mentioned this issue Jul 10, 2022

Use .sizes instead of .dims for xr.Dataset/xr.DataArray compatibility xarray-contrib/xbatcher#71

Merged

Beauprel mentioned this issue Feb 8, 2023

timeseries fonctionalité essentielle Ouranosinc/figanos#12

Merged

This was referenced Dec 1, 2023

Dataset.dims should return a set, not a dict of sizes #8496

Open

Deprecate ds.dims returning dict #8500

Merged

philippemiron mentioned this issue Dec 9, 2023

fix deprecation warnings in tests Cloud-Drift/clouddrift#340

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between the types of Dataset.dims and DataArray.dims #921

Inconsistency between the types of Dataset.dims and DataArray.dims #921

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Aug 3, 2016

jhamman commented Aug 3, 2016

shoyer commented Oct 22, 2016

stale bot commented Jan 25, 2019

shoyer commented Jan 25, 2019

Inconsistency between the types of Dataset.dims and DataArray.dims #921

Inconsistency between the types of Dataset.dims and DataArray.dims #921

Comments

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Jul 29, 2016

max-sixty commented Jul 29, 2016

shoyer commented Aug 3, 2016

jhamman commented Aug 3, 2016

shoyer commented Oct 22, 2016

stale bot commented Jan 25, 2019

shoyer commented Jan 25, 2019