Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame with MultiIndex -> xarray with sparse array #3206

Closed
shoyer opened this issue Aug 12, 2019 · 1 comment · Fixed by #3210
Closed

DataFrame with MultiIndex -> xarray with sparse array #3206

shoyer opened this issue Aug 12, 2019 · 1 comment · Fixed by #3210
Labels
topic-arrays related to flexible array support

Comments

@shoyer
Copy link
Member

shoyer commented Aug 12, 2019

Now that we have preliminary support for sparse arrays in xarray, one really cool feature we could explore is creating sparse arrays from MultiIndexed pandas DataFrames.

Right now, xarray's methods for creating objects from pandas always create dense arrays, but the size of these dense arrays can get big really quickly if the MultiIndex is sparsely populated, e.g.,

import pandas as pd
import numpy as np
import xarray
df = pd.DataFrame({
    'w': range(10),
    'x': list('abcdefghij'),
    'y': np.arange(0, 100, 10),
    'z': np.ones(10),
}).set_index(['w', 'x', 'y'])
print(xarray.Dataset.from_dataframe(df))

This length 10 DataFrame turned into a dense array with 1000 elements (only 10 of which are not NaN):

<xarray.Dataset>
Dimensions:  (w: 10, x: 10, y: 10)
Coordinates:
  * w        (w) int64 0 1 2 3 4 5 6 7 8 9
  * x        (x) object 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j'
  * y        (y) int64 0 10 20 30 40 50 60 70 80 90
Data variables:
    z        (w, x, y) float64 1.0 nan nan nan nan nan ... nan nan nan nan 1.0

We can imagine xarray.Dataset.from_dataframe(df, sparse=True) would make the same Dataset, but with sparse array (with a NaN fill value) instead of dense arrays.

Once sparse arrays work pretty well, this could actually obviate most of the use cases for MultiIndex in arrays. Arguably the model is quite a bit cleaner.

@shoyer shoyer changed the title MultiIndex -> sparse array DataFrame with MultiIndex -> xarray with sparse array Aug 12, 2019
shoyer added a commit to shoyer/xarray that referenced this issue Aug 13, 2019
Fixes pydata#3206

Example usage:

    In [3]: import pandas as pd
       ...: import numpy as np
       ...: import xarray
       ...: df = pd.DataFrame({
       ...:     'w': range(10),
       ...:     'x': list('abcdefghij'),
       ...:     'y': np.arange(0, 100, 10),
       ...:     'z': np.ones(10),
       ...: }).set_index(['w', 'x', 'y'])
       ...:

    In [4]: ds = xarray.Dataset.from_dataframe(df, sparse=True)

    In [5]: ds.z.data
    Out[5]: <COO: shape=(10, 10, 10), dtype=float64, nnz=10, fill_value=nan>
@crusaderky
Copy link
Contributor

It would be great to have unstack(sparse=True) for the exact same reasons too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants