-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many methods are broken (e.g., concat/stack/sortby) when using repeated dimensions #1378
Comments
Yes, also happening on latest master. I suspect there are several other things which won't work properly (or at least unexpectedly) when having repeated dims... |
Indeed, we don't have very good test coverage for operations with repeated dimensions. Fixes would certainly be appreciated, though they might be somewhat tricky. Even failing loudly with |
Right, also positional indexing works unexpectedly in this case, though I understand it's tricky and should probably be discouraged: A[0,:] # returns A
A[:,0] # returns A.isel(dim0=0) |
I guess it would be good to document the expected behaviour with repeated dims somewhere? I.e. what should happen when doing:
? |
I cannot see a use case in which repeated dims actually make sense. In my case this situation originates from h5 files which indeed contains repeated dimensions ( |
Yes this happened to me too. First thing I did is converting the files to proper netcdf datasets... |
Agreed. I would have disallowed them entirely, but sometimes it's useful to allow loading variables with duplicate dimensions, even if the only valid operation you can do is de-duplicate them. Every routine that looks up dimensions by name should go through the |
I use repeated dimensions to store a covariance matrix. The data variable containing the covariance matrix has 4 dimensions, of which the last 2 are repeated. For example, I have a data variable with dimensions ( This is valid NetCDF and should be valid in xarray. It would be a significant problem for me if they became disallowed. |
@gerritholl - rereading this issue, I don't think we're particularly opposed to supporting duplicate dimensions. We do know there are things that don't work right now and that we don't have test coverage for operations that use duplicate dimensions. This is marked as a |
@jhamman Ok, good to hear it's not slated to be removed. I would love to work on this, I wish I had the time! I'll keep it in mind if I do find some spare time. |
This also affects the |
Annotating distance matrices with xarray is not possible as well due to the duplicate dimension. |
I'm not too fond of having multiple dimensions with the same name because, whenever you need to operate on one but not the other, you have little to no choice but revert to positional indexing. Consider also how many methods expect either **kwargs or a dict-like parameter with the dimension or variable names as the keys. I would not be surprised to find that many API design choices fall apart in the face of this use case. Also, having two non positional (as it should always be in xarray!) dimensions with the same name only makes sense when modelling symmetric N:N relationships. Two good examples are covariance matrices and the weights for a Dijkstra algorithm. The problems start when the object represents an asymmetric relationship, e.g:
I could easily come up with many other cases. What if, instead of allowing for duplicate dimensions, we allowed sharing an index across different dimensions? Something like river_transport = Dataset(
coords={
'station': ['Kingston', 'Montreal'],
'station_from': ('station', )
'station_to': ('station', )
},
data_vars={
cost=(('station_from', 'station_to'), [[0, 20], [15, 0]]),
}
} or, for DataArrays: river_transport = DataArray(
[[0, 20], [15, 0]],
dims=('station_from', 'station_to'),
coords={
'station': ['Kingston', 'Montreal'],
'station_from': ('station', )
'station_to': ('station', )
},
} Note how this syntax doesn't exist as of today: 'station_from': ('station', )
'station_to': ('station', ) From an implementation point of view, I think it could be easily implemented by keeping track of a map of aliases and with some This design would not resolve the issue of compatibility with NetCDF though. |
According to xarray issues: pydata/xarray#3286 pydata/xarray#1378 The open_mfdataset function has problems in creating a merged dataset from multiple files in which variables have repeated dimension names. The easiest thing to do in this case is to prevent such variables from being read in. We now have added the drop_variables keyword to avoid reading in the "anchor" variable in all calls to open_dataset and open_mfdataset in both benchmark.py and core.py. This variable is only present in GCHP-created netCDF files using MAPL v1.0.0, which is in GCHP 12.5.0 and later. This commit should resolve GCPy issue #26: #26 Signed-off-by: Bob Yantosca <yantosca@seas.harvard.edu>
Just wondering what the status of this is. I've been running into bugs trying to model symmetric distance matrices using the same dimension. Interestingly, it does work very well for selecting, e.g. if use |
Alternatively to @gimperiale's suggestion of coordinate aliases, I wonder if a solution could be extending da = DataArray(
[[0, 20], [15, 0]],
dims=(('station', 'from'), ('station', 'to')),
coords={'station': ['Kingston', 'Montreal']},
)
da.dims
# ('station', 'station')
da.dims_full
# (('station', 'from'), ('station', 'to')) da2 = DataArray([[1, 2], [3, 4]], dims=('x', 'y'))
da2.dims
# ('x', 'y')
da2.dims_full
# ('x', 'y')
Repr could look like: da
# <xarray.DataArray (station[from]: 2, station[to]: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station (station) <U8 'Kingston' 'Montreal'
da.to_dataset(name='river_transport')
# <xarray.Dataset>
# Dimensions: (station: 2)
# Coordinates:
# * station (station) <U8 'Kingston' 'Montreal'
# Data variables:
# river_transport (station[from], station[to]) int64 0 20 15 0 |
Some more food for thoughts. Instead of extending the Xarray data model we could now leverage Xarray flexible indexes and provide some utility methods like this: da = DataArray(
[[0, 20], [15, 0]],
dims=('station', 'station'),
coords={'station': ['Kingston', 'Montreal']},
)
da
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station (station) <U8 'Kingston' 'Montreal'
da_split = da.split_repeated_dims(station=('station_from', 'station_to'))
da_split
# <xarray.DataArray (station_from: 2, station_to: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station_from (station_from) <U8 'Kingston' 'Montreal'
# * station_to (station_to) <U8 'Kingston' 'Montreal'
# * station (station_from, station_to) object ...
# Indexes:
# ┌ station_from RepeatedIndex
# │ station_to
# └ station
da_merged = da_split.merge_repeated_index('station')
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station (station) <U8 'Kingston' 'Montreal' Where The coordinate da_split.station.data
# some custom lazy duck-array object
da_split.station.values
# [[('Kingston', 'Kingston') ('Kingston', 'Montreal')]
# [('Montreal', 'Kingston') ('Montreal', 'Montreal')]] So RepeatedIndex could support three kinds of selection (maybe more): da_split.sel(station="Kingston") # shorthand for station=('Kingston', 'Kingston')
# <xarray.DataArray ()>
# array(0)
# Coordinates:
# station <U8 'Kingston'
da_split.sel(station_from="Kingston")
# array([ 0, 20])
# <xarray.DataArray (station_to: 2)>
# Coordinates:
# * station_to (station_to) <U8 'Kingston' 'Montreal'
# station_from <U8 'Kingston'
da_split.sel(station_to="Kingston")
# array([ 0, 15])
# <xarray.DataArray (station_from: 2)>
# Coordinates:
# * station_from (station_from) <U8 'Kingston' 'Montreal'
# station_to <U8 'Kingston' Now regarding the concat example mentioned in the top comment of this issue, there's an extra step required: da_split = da.split_repeated_dims(station=('station_from', 'station_to'))
da_concat = xr.concat([da_split, da_split], 'newdim')
da_result = da_concat.merge_repeated_index('station') Overall this would be a pretty nice and explicit way to solve the issue of repeated dimensions, IMHO. It is 100% compatible with the current Xarray data model, which might be preferred for such a special case. EDIT: sorry for the multiple edits! |
Regarding the suggestion in my previous comment, perhaps it is not even worth having a RepeatedIndex and it might just be enough providing two utility methods da_split = da.split_dims(station=('station_from', 'station_to'))
# <xarray.DataArray (station_from: 2, station_to: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station_from (station_from) <U8 'Kingston' 'Montreal'
# * station_to (station_to) <U8 'Kingston' 'Montreal'
da_merged = da_split.merge_dims(station=('station_from', 'station_to'))
# <xarray.DataArray (station: 2)>
# array([[ 0, 20],
# [15, 0]])
# Coordinates:
# * station (station) <U8 'Kingston' 'Montreal' Converting between the two forms is very cheap so it can be done as many times as desired. It involves just a few operations on metadata and shallow copies of Xarray variable and index objects (+ maybe some index identity and/or equality checks for |
Concatenating DataArrays with repeated dimensions does not work.
fails with
The text was updated successfully, but these errors were encountered: