-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in DataArray.from_dict(data_array.to_dict()) when using pd.MultiIndex #4073
Comments
Didn't know that it is such a fundamental problem. Managed to get around with set/reset index. |
In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity If this issue remains relevant, please comment here or remove the |
This is indeed not ideal. I'm not sure we have great round-trip support for I'll mark as a bug. Any PRs working towards this would be greatly appreciated. (and apologies your excellent issue didn't get picked up @genric ) |
I also ran into this when trying to serialize to Some notes below in case it is useful for anyone else trying to do this. Also - this is all quite ugly, and I may have missed some existing core functionality, so I'll be very happy to hear if there is a better way to handle this. Following the above, a minimal example: import pandas as pd
import xarray as xr
idx = pd.MultiIndex.from_arrays([[1, 2], [3, 4]], names=('one', 'two'))
array = xr.DataArray([0, 1], dims='idx', coords={'idx': idx})
# Stacked multidim coords > dict > recreate array - Fails
xr.DataArray.from_dict(array.to_dict())
# Unstack multidim coords > dict > recreate array - OK
xr.DataArray.from_dict(array.unstack().to_dict())
# Set non-dimensional coord
array2 = array.copy()
array2['Labels'] = ('idx', ['A','B']) # Add non-dim coord
array2 = array2.swap_dims({'idx':'Labels'}) # Swap dims
# Non-dim coord case - also need to reset and drop non-dim coords
# This will fail
array2_dict = array2.unstack().reset_index('idx').to_dict()
xr.DataArray.from_dict(array2_dict)
# This is OK
array2_dict = array2.unstack().reset_index('idx', drop=True).to_dict()
xr.DataArray.from_dict(array2_dict)
# This is also OK
array2_dict = array2.unstack().drop('idx').to_dict()
xr.DataArray.from_dict(array2_dict) In all cases the reconstructed array is flat, and missing non-dim coords. My work-around for this so far is to pull various mappings manually, and dump everything to def mapDims(data):
# Get dims from Xarray
dims = data.dims # Set dim list - this excludes stacked dims
dimsUS = data.unstack().dims # Set unstaked (full) dim list
# List stacked dims and map
# Could also do this by type checking vs. 'pandas.core.indexes.multi.MultiIndex'?
stackedDims = list(set(dims) - set(dimsUS))
stackedDimsMap = {k: list(data.indexes[k].names) for k in stackedDims}
# Get non-dimensional coords
# These may be stacked, are not listed in self.dims, and are not addressed by .unstack()
idxKeys = list(data.indexes.keys())
coordsKeys = list(data.coords.keys())
nonDimCoords = list(set(coordsKeys) - set(idxKeys))
# nonDimCoords = list(set(dims) - set(idxKeys))
# Get non-dim indexes
# nddimIndexes = {k:data.coords[k].to_index() for k,v in data.coords.items() if k in nonDimCoords} # Note this returns Pandas Indexes, so may fail on file IO.
nddimMap = {k:list(data.coords[k].to_index().names) for k,v in data.coords.items() if k in nonDimCoords}
# Get dict maps - to_dict per non-dim coord
# nddimDicts = {k:data.coords[k].reset_index(k).to_dict() for k,v in data.coords.items() if k in nonDimCoords}
# Use Pandas - this allows direct dump of PD multiindex to dicts
nddimDicts = {k:data.coords[k].to_index().to_frame().to_dict() for k,v in data.coords.items() if k in nonDimCoords}
# Get coords correlated to non-dim coords, need these to recreate original links & stacking (?)
nddimDims = {k:data.coords[k].dims for k,v in data.coords.items() if k in nonDimCoords}
return {k:v for k,v in locals().items() if k !='data'}
def deconstructDims(data):
xrDecon = data.copy()
# Map dims
xrDecon.attrs['dimMaps'] = mapDims(data)
# Unstack all coords
xrDecon = xrDecon.unstack()
# Remove non-dim coords
for nddim in xrDecon.attrs['dimMaps']['nonDimCoords']:
xrDecon = xrDecon.drop(nddim)
return xrDecon
def reconstructDims(data):
xrRecon = data.copy()
# Restack coords
for stacked in xrRecon.attrs['dimMaps']['stackedDims']:
xrRecon = xrRecon.stack({stacked:xrRecon.attrs['dimMaps']['stackedDims']})
# General non-dim coord rebuild
for nddim in xrRecon.attrs['dimMaps']['nonDimCoords']:
# Add nddim back into main XR array
xrRecon.coords[nddim] = (xrRecon.attrs['dimMaps']['nddimDims'][nddim] ,pd.MultiIndex.from_frame(pd.DataFrame.from_dict(xrRecon.attrs['dimMaps']['nddimDicts'][nddim]))) # OK
return xrRecon
Dict round-trip is then OK, and the dictionary can also be pushed to standard file types (contains only python native types + numpy array). # IO with funcs
# With additional tuple coord
array2 = array.copy()
array2['Labels'] = ('idx', ['A','B']) # Add non-dim coord
array2 = array2.swap_dims({'idx':'Labels'}) # Swap dims
# Decon to dict
safeDict = deconstructDims(array2).to_dict()
# Rebuild
xrFromDict = reconstructDims(xr.DataArray.from_dict(safeDict))
# Same as array2 (aside from added attrs)
array2.attrs = xrFromDict.attrs
array2.identical(xrFromDict) # True Again, there is likely some cleaner/more obvious thing I'm missing here, but I'm not very familiar with Xarray or Pandas internals here - this is just where I ended up when trying to convert to HDF5 compatible datastructures in a semi-general way. (As a side-note, I ran into similar issues with |
Error recovering DataArray with
from_dict
from what was persisted byto_dict
when usingpandas.MultiIndex
.MCVE Code Sample
Expected Output
No error.
Problem Description
ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (('idx',), [(1, 3), (2, 4)], {}) to Variable.
Versions
python: 3.7.6
xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.4
scipy: 1.4.1
The text was updated successfully, but these errors were encountered: