pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

cottrell · 2014-08-17T10:24:54Z

related #4343

I have found myself on occasion writing code to convert from a pandas Series (with n-level MultiIndex) to a scipy sparse n-dim array (plus dimension labels) and back. I am not terribly familiar with the newer sparse data structure features but my impression is that it provides a totally different kind of functionality. Does this kind of functionality exist somewhere? If not, would something like this be of interest?

TomAugspurger · 2014-08-17T15:31:21Z

Do you have any short, concrete examples of what you're doing? Also let us know why using a SparseSeries doesn't work for it.

cottrell · 2014-08-17T16:41:16Z

Here is a quick example of going from a scipy.sparse array to a pandas Series and back. Mostly I am thinking of the case where you are data munging and are reading in labeled data and need to then switch to a sparse matrix in order to use something in scipy.sparse (multiplication, sparse svds or whatever). I am running this on 0.14.1. Apologies if this functionality is available on the dev branch. I have seen related question on stackoverflow but they all seem to indicate this is not yet implemented and that no one has given a strong opinion that it should be included either (maybe there is a good reason to avoid this kind of thing).

import scipy.sparse.linalg
import scipy.sparse
import pandas

# * limit outselves to 2d situation for now
# * also, for simplicity, all column and row labels are just the integer indices

# create a sparse array
m = int(1e6)
n = int(1e5)
A = scipy.sparse.rand(m, n, density=1e-8)

# scipy.sparse -> sparse series? Is there a better way?
s = pandas.Series(A.data)
s.index = pandas.MultiIndex.from_tuples(list(zip(A.row, A.col)))
s = s.to_sparse(fill_value=0)

# sparse series -> scipy.sparse array? s.unstack() is producing errors for me
i, j = list(zip(*s.index))
data = s.values
sA = scipy.sparse.coo_matrix((data, (i, j)), shape=(m, n))

check = A - sA
print('This should be zero', abs(check).max())

TomAugspurger · 2014-08-17T20:30:56Z

Thanks for the example. I agree with your intuition that s.unstack() should do something like what you want. I haven't used the sparse structures at all, but I'll take a look later. May just be not implemented yet.

jreback · 2014-08-17T21:28:19Z

@cottrell This would definitly be an improvements. See the related issue I linked to. I think several of the sparse routines could easily have an option to return a scipy coo type (or other sparse type matrix). It should be straightforward from a pandas sparse structure (which are very similar to coo types).

jreback · 2014-08-17T21:31:56Z

Here's an idea of the internal structure.

In [40]: ts = Series(randn(10))

In [41]: ts[2:-2] = np.nan

In [42]: sts = ts.to_sparse()

In [43]: sts
Out[43]: 
0   -0.785433
1    1.791538
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -1.192896
9    2.672896
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

In [44]: sts._data.values
Out[44]: 
[-0.785433163534, 1.7915375118, nan, nan, nan, nan, nan, nan, -1.19289573606, 2.67289566074]
Fill: nan
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

In [45]: sts._data
Out[45]: 
SingleBlockManager
Items: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
SparseBlock: 10 dtype: float64

In [46]: sts._data.values.sp_index.blengths
Out[46]: array([2, 2], dtype=int32)

In [47]: sts._data.values.sp_index.blocs
Out[47]: array([0, 8], dtype=int32)

cottrell · 2014-08-18T23:13:21Z

Cool. I am not quite clear on whether the pandas.sparse structure should be the attachment point of some sort of unstack_to_scipy_sparse (this is a bad name) routine or not. Which do you think would make more sense (consider Series only for simplicity):

unstack_to_scipy_sparse for generic MultiIndexed Series
unstack_to_scipy_sparse for pandas.sparse MultiIndex Series

Basically, I think unstacking to scipy.sparse can happen with or without pandas.sparse storage.

jreback · 2014-08-18T23:17:53Z

I think it should go from a sparse structure. if you have a series it is dense by definition. (wether it is unstack from something else or not). The key is you can efficiency translate a pandas sparse structure to scipy w/o densing. any else doesn't make much sense.

shoyer · 2014-08-20T07:58:10Z

@cottrell Think of s.to_sparse().unstack().to_coo() as a better way of spelling s.unstack_to_scipy_sparse().

byronyi · 2014-11-11T22:53:12Z

I would like to know that, as v0.15.2 has been released, is this feature implemented?

shoyer · 2014-11-11T23:03:25Z

@byronyi I don't think so -- that's why this issue is still open! :)

byronyi · 2014-11-11T23:05:18Z

As a suggestion, you can simply do it by this:

from scipy.sparse import coo_matrix
coo = coo_matrix((series.values, zip(*series.index.values)))

Given the series has two-level integer multi-index.

But we need to deal with the fact when the index is not integer, and some kind of mapping is necessary.

byronyi · 2014-11-11T23:11:49Z

I think that implementing this will be kind of tricky when index is not numerical (or even not starting from 0).

But nevertheless, we can still put it on documentation of the Sparse page so people don't bother asking for answers again.

jreback · 2014-11-11T23:39:50Z

@byronyi doc pull-requests are welcome!

sparse is kind of a neglected step-child ATM. need some interest from contributors!

will help you along. lmk.

cottrell · 2014-11-25T19:37:04Z

I have come back to this a few times and have yet to settle on what exactly is the right feature to implement (and haven't had much time to really play around unfortunately). One thing I should point out is that, as far as I know, there is no such thing as an n-dim sparse scipy array so I was having a false memory when I first wrote the comment above. I think there might be two separate features here?

SparseSeries.unstack -> SparseDataFrame and SparseDataFrame.stack/unstack -> SparseDataFrame.
SparseDataFrame.to_coo/to_csr methods as above.

I think 2 is simple and I will hopefully try this soon but I am wondering if there is an simple method of handling the stack/unstack within the sparse framework. The only thing I could see quickly is to use the sparse constructor but I am guessing that is not the right way to go.

cottrell · 2014-11-30T22:26:57Z

I hacked together something together to demo what a SparseSeries.to_coo might behave like (i.e. point 2 above but for Series). If this looks like it is moving in the right direction let me know and I can try to take this a little further.

https://gist.github.com/cottrell/a17fa777afd2cc4a7289

shoyer · 2014-12-01T01:21:47Z

@cottrell For your feature (1), do you really mean "SparseDataFrame.stack/unstack -> SparseDataFrame" or should that last SparseDataFrame be a SparsePanel? If so, I understand and agree with your two features.

It would indeed be nice to have an n-dimensional sparse scipy array -- too bad that doesn't exist! It would be an interesting side project to make that.

Your gist looks like roughly the right direction to me -- though we'll want to break that to_coo function into two parts (if feasible).

jreback · 2014-12-01T11:37:08Z

@cottrell you approach looks reasonable.

Having a SparseSeries.to_coo would be a great start!

pls do a PR and we can have a look at the impl.

Other things that will be necessary:

tests
doc example
vbench (just to track the perf over time)

Futher, related to #4343 I think it would be straightfoward to have the SparseSeries constructor accept a scipy.coo style matrix and return a SparseSeries (and potentially other types of Sparse 1-d structures).

This would be especially useful for testing.

You can implement this with a SparseSeries.from_coo helper method (and then just call it from __init__)

cottrell · 2014-12-03T19:31:45Z

@shoyer Re: "that last SparseDataFrame be a SparsePanel" ... Does stack/unstack (on non-sparse DataFrames) ever take you to Panels? I only ever use stack/unstack to reshape values of DataFrame and modify the (MultiIndex) columns and indices.

shoyer · 2014-12-04T01:00:15Z

@cottrell I was confused. You are correct.

cottrell · 2014-12-14T22:03:09Z

I've created a PR as requested. #9076 There is still some work to do (haven't updated docs yet, for example) but it would be good to get some feedback. Also, am having trouble with Travis CI failures as of this weekend. Even 15.2 appears to be failing now.

jreback · 2015-03-03T00:59:19Z

closed by #9076

kernc · 2017-05-03T10:55:20Z

Does anyone object to SparseSeries no longer having to_coo()/from_coo() methods in favor of SparseDataFrame providing roughly the same API (accepting sparse matrices in the constructor and implementing a somewhat simpler to_coo())?

Discuss in #15634.

cottrell · 2017-05-04T18:50:41Z

I haven't read through the new SparseDataFrame api yet but I think the main convenience with the _coo methods was for the to_coo in the case len(row_levels)>1 or len(column_level)>1 where you need to effectively turn an index of tuples into a single index and get at the codes. There are probably better ways to do this by directly accessing the internals (index labels and hashing tricks on arrays).

For my understanding, is there a replacement function that passes the tests for to_coo and only uses the SparseDataFrame api or would this be a bit of a feature drop (probably fine if no one else is using this stuff)?

kernc · 2017-06-07T11:15:38Z

@cottrell sorry for the delay. Scipy sparse only supports 2d matrices, so with a multi-level indexed series, one would first transform/unstack into a 2d sparse dataframe and then call .to_coo() on it. Like:

# Old .to_coo()
>>> ss = pd.SparseSeries(
...     [3.0, 1.0, 3.0],
...     index=pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
...                                      (1, 1, 'b', 0),
...                                      (1, 1, 'b', 1)],
...                                      names=['A', 'B', 'C', 'D']))
>>> ss
A  B  C  D
1  2  a  0    3.0
   1  b  0    1.0
         1    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

>>> A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
...                              column_levels=['C', 'D'],
...                              sort_labels=True)
>>> A.toarray()
array([[ 0.,  1.,  3.],
       [ 3.,  0.,  0.]])
>>> rows
[(1, 1), (1, 2)]
>>> columns
[('a', 0), ('b', 0), ('b', 1)]

# Instead, we could ... (new)
>>> ss2 = ss.copy()
# ... use any means to make a two-level index
>>> ss2.index = pd.MultiIndex.from_tuples([(v[:2], v[2:])
...                                        for v in ss.index.values],
...                                       names=['AB', 'CD'])
>>> ss2
AB      CD    
(1, 2)  (a, 0)    3.0
(1, 1)  (b, 0)    1.0
        (b, 1)    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

# Would work either way because series unstacks into a sparse data frame ...
>>> sdf = ss2.unstack()
>>> sdf
Out[34]: 
CD      (a, 0)  (b, 0)  (b, 1)
AB                            
(1, 1)     NaN     1.0     3.0
(1, 2)     3.0     NaN     NaN

# ... which has .to_coo()
>>> A = sdf.to_coo()
>>> A
<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>
>>> A.toarray()
array([[ 0.,  1.,  3.],
       [ 3.,  0.,  0.]])
>>> sdf.index  # rows
Index([(1, 1), (1, 2)], dtype='object', name='AB')
>>> sdf.columns.tolist()  # columns, as a list
[('a', 0), ('b', 0), ('b', 1)]

Unstacking SparseSeries currently doesn't work, but is fixed in #16616.

cottrell · 2017-06-08T21:55:50Z

Yeah, that functionally seems to cover it. I have a feeling from_tuples is pretty bad performance wise.

It seems like all the usefulness of that features really comes down to just being able to efficiently create codes for groups of levels (merging levels like you show above). After that, it is just record keeping to get the labels. Do you know if there is a more efficient version of something like this somewhere in the pandas code base?


def index_hash(*codes):
    m = len(codes[0])
    n = len(codes)
    a = np.empty((m, n), dtype=np.int64)
    for i in range(n):
        a[:,i] = codes[i]
    a = np.apply_along_axis(lambda x: hash(x.data.tobytes()), 1, a)
    return a

def index_hash_from_df(df, levels=None, catcols=None):
    assert (levels is not None) or (catcols is not None), 'Must not both be None'
    codes = list()
    if levels is not None:
        for level in levels:
            codes.append(df.index.labels[df.index._get_level_number(level)].values())
    if catcols is not None:
        for catcol in catcols:
            codes.append(df[catcol].cat.codes.values)
    return index_hash(*codes)

jreback · 2017-06-09T11:04:13Z

@cottrell looks like you need generalized data hashing, try

pandas.util.hash_pandas_object these are highly performant (and used internally in indexing)

jreback mentioned this issue Aug 17, 2014

Sparse data structure support for MultiIndex? #445

Closed

jreback added this to the 0.15.1 milestone Aug 17, 2014

jreback added the Sparse label Aug 17, 2014

cottrell mentioned this issue Dec 16, 2014

Add SparseSeries.to_coo method, a single test and one example. #9076

Closed

jreback added the Enhancement label Jan 2, 2015

jreback closed this as completed Mar 3, 2015

shoyer mentioned this issue Apr 11, 2015

API: SparseSeries.to_frame and SparseDataFrame.to_panel result in dense structures #9850

Closed

kernc mentioned this issue Apr 24, 2017

API/DEPR: deprecate SparseSeries.from_coo and accept in constructor #15634

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

cottrell commented Aug 17, 2014

TomAugspurger commented Aug 17, 2014

cottrell commented Aug 17, 2014

TomAugspurger commented Aug 17, 2014

jreback commented Aug 17, 2014

jreback commented Aug 17, 2014

cottrell commented Aug 18, 2014

jreback commented Aug 18, 2014

shoyer commented Aug 20, 2014

byronyi commented Nov 11, 2014

shoyer commented Nov 11, 2014

byronyi commented Nov 11, 2014

byronyi commented Nov 11, 2014

jreback commented Nov 11, 2014

cottrell commented Nov 25, 2014

cottrell commented Nov 30, 2014

shoyer commented Dec 1, 2014

jreback commented Dec 1, 2014

cottrell commented Dec 3, 2014

shoyer commented Dec 4, 2014

cottrell commented Dec 14, 2014

jreback commented Mar 3, 2015

kernc commented May 3, 2017 •

edited

Loading

cottrell commented May 4, 2017

kernc commented Jun 7, 2017 •

edited

Loading

cottrell commented Jun 8, 2017

jreback commented Jun 9, 2017

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

Comments

cottrell commented Aug 17, 2014

TomAugspurger commented Aug 17, 2014

cottrell commented Aug 17, 2014

TomAugspurger commented Aug 17, 2014

jreback commented Aug 17, 2014

jreback commented Aug 17, 2014

cottrell commented Aug 18, 2014

jreback commented Aug 18, 2014

shoyer commented Aug 20, 2014

byronyi commented Nov 11, 2014

shoyer commented Nov 11, 2014

byronyi commented Nov 11, 2014

byronyi commented Nov 11, 2014

jreback commented Nov 11, 2014

cottrell commented Nov 25, 2014

cottrell commented Nov 30, 2014

shoyer commented Dec 1, 2014

jreback commented Dec 1, 2014

cottrell commented Dec 3, 2014

shoyer commented Dec 4, 2014

cottrell commented Dec 14, 2014

jreback commented Mar 3, 2015

kernc commented May 3, 2017 • edited Loading

cottrell commented May 4, 2017

kernc commented Jun 7, 2017 • edited Loading

cottrell commented Jun 8, 2017

jreback commented Jun 9, 2017

kernc commented May 3, 2017 •

edited

Loading

kernc commented Jun 7, 2017 •

edited

Loading