Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in DataArray.from_dict(data_array.to_dict()) when using pd.MultiIndex #4073

Open
genric opened this issue May 18, 2020 · 4 comments
Open
Labels

Comments

@genric
Copy link

genric commented May 18, 2020

Error recovering DataArray with from_dict from what was persisted by to_dict when using pandas.MultiIndex.

MCVE Code Sample

import pandas as pd
import xarray as xr
idx = pd.MultiIndex.from_arrays([[1, 2], [3, 4]], names=('one', 'two'))
array = xr.DataArray([0, 1], dims='idx', coords={'idx': idx})
assert array.sel(one=1, two=3) == 0
assert array.sel(one=2, two=4) == 1
array_dict = array.to_dict()
xr.DataArray.from_dict(array_dict)

Expected Output

No error.

Problem Description

ValueError: Could not convert tuple of form (dims, data[, attrs, encoding]): (('idx',), [(1, 3), (2, 4)], {}) to Variable.

Versions

python: 3.7.6
xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.4
scipy: 1.4.1

@genric
Copy link
Author

genric commented Jun 10, 2020

Didn't know that it is such a fundamental problem. Managed to get around with set/reset index.

@stale
Copy link

stale bot commented Apr 19, 2022

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 19, 2022
@max-sixty
Copy link
Collaborator

This is indeed not ideal. I'm not sure we have great round-trip support for dict but I'd hope this would be possible to make work.

I'll mark as a bug. Any PRs working towards this would be greatly appreciated.

(and apologies your excellent issue didn't get picked up @genric )

@stale stale bot removed the stale label Apr 19, 2022
@max-sixty max-sixty added the bug label Apr 19, 2022
@phockett
Copy link
Contributor

phockett commented Jun 22, 2022

I also ran into this when trying to serialize to dict for general file writing routines (esp. for HDF5 writing with h5py), but the issue was in my non-dimensional coordinates! I thought I was being careful by already using array.unwrap() in my IO routine, but also required .reset_index() or .drop() for non-dimensional coordinates.

Some notes below in case it is useful for anyone else trying to do this. Also - this is all quite ugly, and I may have missed some existing core functionality, so I'll be very happy to hear if there is a better way to handle this.


Following the above, a minimal example:

import pandas as pd
import xarray as xr
idx = pd.MultiIndex.from_arrays([[1, 2], [3, 4]], names=('one', 'two'))
array = xr.DataArray([0, 1], dims='idx', coords={'idx': idx})


# Stacked multidim coords > dict > recreate array - Fails
xr.DataArray.from_dict(array.to_dict()) 

# Unstack multidim coords > dict > recreate array - OK
xr.DataArray.from_dict(array.unstack().to_dict()) 

# Set non-dimensional coord
array2 = array.copy()
array2['Labels'] = ('idx', ['A','B'])  # Add non-dim coord
array2 = array2.swap_dims({'idx':'Labels'})  # Swap dims

# Non-dim coord case - also need to reset and drop non-dim coords
# This will fail
array2_dict = array2.unstack().reset_index('idx').to_dict()
xr.DataArray.from_dict(array2_dict)

# This is OK
array2_dict = array2.unstack().reset_index('idx', drop=True).to_dict()
xr.DataArray.from_dict(array2_dict)

# This is also OK
array2_dict = array2.unstack().drop('idx').to_dict()
xr.DataArray.from_dict(array2_dict)

In all cases the reconstructed array is flat, and missing non-dim coords. My work-around for this so far is to pull various mappings manually, and dump everything to .attrs, then rebuild from those if required, e.g.

def mapDims(data):
    # Get dims from Xarray
    dims = data.dims # Set dim list - this excludes stacked dims
    dimsUS = data.unstack().dims  # Set unstaked (full) dim list
    
    # List stacked dims and map
    # Could also do this by type checking vs. 'pandas.core.indexes.multi.MultiIndex'?
    stackedDims = list(set(dims) - set(dimsUS))
    stackedDimsMap = {k: list(data.indexes[k].names) for k in stackedDims} 
    
    # Get non-dimensional coords
    # These may be stacked, are not listed in self.dims, and are not addressed by .unstack()
    idxKeys = list(data.indexes.keys())
    coordsKeys = list(data.coords.keys())
    nonDimCoords = list(set(coordsKeys) - set(idxKeys))
    # nonDimCoords = list(set(dims) - set(idxKeys))
    
    # Get non-dim indexes
    # nddimIndexes = {k:data.coords[k].to_index() for k,v in data.coords.items() if k in nonDimCoords}  # Note this returns Pandas Indexes, so may fail on file IO.
    nddimMap = {k:list(data.coords[k].to_index().names) for k,v in data.coords.items() if k in nonDimCoords}
    
    # Get dict maps - to_dict per non-dim coord
    #  nddimDicts = {k:data.coords[k].reset_index(k).to_dict() for k,v in data.coords.items() if k in nonDimCoords}
    # Use Pandas - this allows direct dump of PD multiindex to dicts
    nddimDicts = {k:data.coords[k].to_index().to_frame().to_dict() for k,v in data.coords.items() if k in nonDimCoords}
    # Get coords correlated to non-dim coords, need these to recreate original links & stacking (?)
    nddimDims = {k:data.coords[k].dims for k,v in data.coords.items() if k in nonDimCoords}
    
    return {k:v for k,v in locals().items() if k !='data'}


def deconstructDims(data):
    
    xrDecon = data.copy()
    
    # Map dims
    xrDecon.attrs['dimMaps'] = mapDims(data)
    
    # Unstack all coords
    xrDecon = xrDecon.unstack()
    
    # Remove non-dim coords
    for nddim in xrDecon.attrs['dimMaps']['nonDimCoords']:
        xrDecon = xrDecon.drop(nddim)
        
    return xrDecon
    

def reconstructDims(data):

    xrRecon = data.copy()
    
    # Restack coords
    for stacked in xrRecon.attrs['dimMaps']['stackedDims']:
        xrRecon = xrRecon.stack({stacked:xrRecon.attrs['dimMaps']['stackedDims']})
    
    # General non-dim coord rebuild
    for nddim in xrRecon.attrs['dimMaps']['nonDimCoords']:
        # Add nddim back into main XR array
        xrRecon.coords[nddim] = (xrRecon.attrs['dimMaps']['nddimDims'][nddim] ,pd.MultiIndex.from_frame(pd.DataFrame.from_dict(xrRecon.attrs['dimMaps']['nddimDicts'][nddim])))  # OK

    return xrRecon
    

Dict round-trip is then OK, and the dictionary can also be pushed to standard file types (contains only python native types + numpy array).

# IO with funcs
# With additional tuple coord
array2 = array.copy()
array2['Labels'] = ('idx', ['A','B'])  # Add non-dim coord
array2 = array2.swap_dims({'idx':'Labels'})  # Swap dims

# Decon to dict
safeDict = deconstructDims(array2).to_dict()

# Rebuild
xrFromDict = reconstructDims(xr.DataArray.from_dict(safeDict))

# Same as array2 (aside from added attrs)
array2.attrs = xrFromDict.attrs
array2.identical(xrFromDict)  # True

Again, there is likely some cleaner/more obvious thing I'm missing here, but I'm not very familiar with Xarray or Pandas internals here - this is just where I ended up when trying to convert to HDF5 compatible datastructures in a semi-general way.

(As a side-note, I ran into similar issues with xr.DataArray.to_netcdf() and multi-index coords, or at least did last time I tried it - but I didn't look into this further since I prefer using h5py for other reasons.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants