Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas MultiIndex series "unstack" to scipy sparse array functionality #8048

Closed
cottrell opened this issue Aug 17, 2014 · 26 comments
Closed
Labels
Enhancement Sparse Sparse Data Type
Milestone

Comments

@cottrell
Copy link
Contributor

related #4343

I have found myself on occasion writing code to convert from a pandas Series (with n-level MultiIndex) to a scipy sparse n-dim array (plus dimension labels) and back. I am not terribly familiar with the newer sparse data structure features but my impression is that it provides a totally different kind of functionality. Does this kind of functionality exist somewhere? If not, would something like this be of interest?

@TomAugspurger
Copy link
Contributor

Do you have any short, concrete examples of what you're doing? Also let us know why using a SparseSeries doesn't work for it.

@cottrell
Copy link
Contributor Author

Here is a quick example of going from a scipy.sparse array to a pandas Series and back. Mostly I am thinking of the case where you are data munging and are reading in labeled data and need to then switch to a sparse matrix in order to use something in scipy.sparse (multiplication, sparse svds or whatever). I am running this on 0.14.1. Apologies if this functionality is available on the dev branch. I have seen related question on stackoverflow but they all seem to indicate this is not yet implemented and that no one has given a strong opinion that it should be included either (maybe there is a good reason to avoid this kind of thing).

import scipy.sparse.linalg
import scipy.sparse
import pandas

# * limit outselves to 2d situation for now
# * also, for simplicity, all column and row labels are just the integer indices

# create a sparse array
m = int(1e6)
n = int(1e5)
A = scipy.sparse.rand(m, n, density=1e-8)

# scipy.sparse -> sparse series? Is there a better way?
s = pandas.Series(A.data)
s.index = pandas.MultiIndex.from_tuples(list(zip(A.row, A.col)))
s = s.to_sparse(fill_value=0)

# sparse series -> scipy.sparse array? s.unstack() is producing errors for me
i, j = list(zip(*s.index))
data = s.values
sA = scipy.sparse.coo_matrix((data, (i, j)), shape=(m, n))

check = A - sA
print('This should be zero', abs(check).max())

@TomAugspurger
Copy link
Contributor

Thanks for the example. I agree with your intuition that s.unstack() should do something like what you want. I haven't used the sparse structures at all, but I'll take a look later. May just be not implemented yet.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2014

@cottrell This would definitly be an improvements. See the related issue I linked to. I think several of the sparse routines could easily have an option to return a scipy coo type (or other sparse type matrix). It should be straightforward from a pandas sparse structure (which are very similar to coo types).

@jreback jreback added this to the 0.15.1 milestone Aug 17, 2014
@jreback jreback added the Sparse label Aug 17, 2014
@jreback
Copy link
Contributor

jreback commented Aug 17, 2014

Here's an idea of the internal structure.

In [40]: ts = Series(randn(10))

In [41]: ts[2:-2] = np.nan

In [42]: sts = ts.to_sparse()

In [43]: sts
Out[43]: 
0   -0.785433
1    1.791538
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8   -1.192896
9    2.672896
dtype: float64
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

In [44]: sts._data.values
Out[44]: 
[-0.785433163534, 1.7915375118, nan, nan, nan, nan, nan, nan, -1.19289573606, 2.67289566074]
Fill: nan
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)

In [45]: sts._data
Out[45]: 
SingleBlockManager
Items: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
SparseBlock: 10 dtype: float64

In [46]: sts._data.values.sp_index.blengths
Out[46]: array([2, 2], dtype=int32)

In [47]: sts._data.values.sp_index.blocs
Out[47]: array([0, 8], dtype=int32)

@cottrell
Copy link
Contributor Author

Cool. I am not quite clear on whether the pandas.sparse structure should be the attachment point of some sort of unstack_to_scipy_sparse (this is a bad name) routine or not. Which do you think would make more sense (consider Series only for simplicity):

  1. unstack_to_scipy_sparse for generic MultiIndexed Series
  2. unstack_to_scipy_sparse for pandas.sparse MultiIndex Series

Basically, I think unstacking to scipy.sparse can happen with or without pandas.sparse storage.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2014

I think it should go from a sparse structure. if you have a series it is dense by definition. (wether it is unstack from something else or not). The key is you can efficiency translate a pandas sparse structure to scipy w/o densing. any else doesn't make much sense.

@shoyer
Copy link
Member

shoyer commented Aug 20, 2014

@cottrell Think of s.to_sparse().unstack().to_coo() as a better way of spelling s.unstack_to_scipy_sparse().

@byronyi
Copy link

byronyi commented Nov 11, 2014

I would like to know that, as v0.15.2 has been released, is this feature implemented?

@shoyer
Copy link
Member

shoyer commented Nov 11, 2014

@byronyi I don't think so -- that's why this issue is still open! :)

@byronyi
Copy link

byronyi commented Nov 11, 2014

As a suggestion, you can simply do it by this:

from scipy.sparse import coo_matrix
coo = coo_matrix((series.values, zip(*series.index.values)))

Given the series has two-level integer multi-index.

But we need to deal with the fact when the index is not integer, and some kind of mapping is necessary.

@byronyi
Copy link

byronyi commented Nov 11, 2014

I think that implementing this will be kind of tricky when index is not numerical (or even not starting from 0).

But nevertheless, we can still put it on documentation of the Sparse page so people don't bother asking for answers again.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@byronyi doc pull-requests are welcome!

sparse is kind of a neglected step-child ATM. need some interest from contributors!

will help you along. lmk.

@cottrell
Copy link
Contributor Author

I have come back to this a few times and have yet to settle on what exactly is the right feature to implement (and haven't had much time to really play around unfortunately). One thing I should point out is that, as far as I know, there is no such thing as an n-dim sparse scipy array so I was having a false memory when I first wrote the comment above. I think there might be two separate features here?

  1. SparseSeries.unstack -> SparseDataFrame and SparseDataFrame.stack/unstack -> SparseDataFrame.
  2. SparseDataFrame.to_coo/to_csr methods as above.

I think 2 is simple and I will hopefully try this soon but I am wondering if there is an simple method of handling the stack/unstack within the sparse framework. The only thing I could see quickly is to use the sparse constructor but I am guessing that is not the right way to go.

@cottrell
Copy link
Contributor Author

I hacked together something together to demo what a SparseSeries.to_coo might behave like (i.e. point 2 above but for Series). If this looks like it is moving in the right direction let me know and I can try to take this a little further.

https://gist.github.com/cottrell/a17fa777afd2cc4a7289

@shoyer
Copy link
Member

shoyer commented Dec 1, 2014

@cottrell For your feature (1), do you really mean "SparseDataFrame.stack/unstack -> SparseDataFrame" or should that last SparseDataFrame be a SparsePanel? If so, I understand and agree with your two features.

It would indeed be nice to have an n-dimensional sparse scipy array -- too bad that doesn't exist! It would be an interesting side project to make that.

Your gist looks like roughly the right direction to me -- though we'll want to break that to_coo function into two parts (if feasible).

@jreback
Copy link
Contributor

jreback commented Dec 1, 2014

@cottrell you approach looks reasonable.

Having a SparseSeries.to_coo would be a great start!

pls do a PR and we can have a look at the impl.

Other things that will be necessary:

  • tests
  • doc example
  • vbench (just to track the perf over time)

Futher, related to #4343 I think it would be straightfoward to have the SparseSeries constructor accept a scipy.coo style matrix and return a SparseSeries (and potentially other types of Sparse 1-d structures).

This would be especially useful for testing.

You can implement this with a SparseSeries.from_coo helper method (and then just call it from __init__)

@cottrell
Copy link
Contributor Author

cottrell commented Dec 3, 2014

@shoyer Re: "that last SparseDataFrame be a SparsePanel" ... Does stack/unstack (on non-sparse DataFrames) ever take you to Panels? I only ever use stack/unstack to reshape values of DataFrame and modify the (MultiIndex) columns and indices.

@shoyer
Copy link
Member

shoyer commented Dec 4, 2014

@cottrell I was confused. You are correct.

@cottrell
Copy link
Contributor Author

I've created a PR as requested. #9076 There is still some work to do (haven't updated docs yet, for example) but it would be good to get some feedback. Also, am having trouble with Travis CI failures as of this weekend. Even 15.2 appears to be failing now.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2015

closed by #9076

@kernc
Copy link
Contributor

kernc commented May 3, 2017

Does anyone object to SparseSeries no longer having to_coo()/from_coo() methods in favor of SparseDataFrame providing roughly the same API (accepting sparse matrices in the constructor and implementing a somewhat simpler to_coo())?

Discuss in #15634.

@cottrell
Copy link
Contributor Author

cottrell commented May 4, 2017

I haven't read through the new SparseDataFrame api yet but I think the main convenience with the _coo methods was for the to_coo in the case len(row_levels)>1 or len(column_level)>1 where you need to effectively turn an index of tuples into a single index and get at the codes. There are probably better ways to do this by directly accessing the internals (index labels and hashing tricks on arrays).

For my understanding, is there a replacement function that passes the tests for to_coo and only uses the SparseDataFrame api or would this be a bit of a feature drop (probably fine if no one else is using this stuff)?

@kernc
Copy link
Contributor

kernc commented Jun 7, 2017

@cottrell sorry for the delay. Scipy sparse only supports 2d matrices, so with a multi-level indexed series, one would first transform/unstack into a 2d sparse dataframe and then call .to_coo() on it. Like:

# Old .to_coo()
>>> ss = pd.SparseSeries(
...     [3.0, 1.0, 3.0],
...     index=pd.MultiIndex.from_tuples([(1, 2, 'a', 0),
...                                      (1, 1, 'b', 0),
...                                      (1, 1, 'b', 1)],
...                                      names=['A', 'B', 'C', 'D']))
>>> ss
A  B  C  D
1  2  a  0    3.0
   1  b  0    1.0
         1    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

>>> A, rows, columns = ss.to_coo(row_levels=['A', 'B'],
...                              column_levels=['C', 'D'],
...                              sort_labels=True)
>>> A.toarray()
array([[ 0.,  1.,  3.],
       [ 3.,  0.,  0.]])
>>> rows
[(1, 1), (1, 2)]
>>> columns
[('a', 0), ('b', 0), ('b', 1)]

# Instead, we could ... (new)
>>> ss2 = ss.copy()
# ... use any means to make a two-level index
>>> ss2.index = pd.MultiIndex.from_tuples([(v[:2], v[2:])
...                                        for v in ss.index.values],
...                                       names=['AB', 'CD'])
>>> ss2
AB      CD    
(1, 2)  (a, 0)    3.0
(1, 1)  (b, 0)    1.0
        (b, 1)    3.0
dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)

# Would work either way because series unstacks into a sparse data frame ...
>>> sdf = ss2.unstack()
>>> sdf
Out[34]: 
CD      (a, 0)  (b, 0)  (b, 1)
AB                            
(1, 1)     NaN     1.0     3.0
(1, 2)     3.0     NaN     NaN

# ... which has .to_coo()
>>> A = sdf.to_coo()
>>> A
<2x3 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in COOrdinate format>
>>> A.toarray()
array([[ 0.,  1.,  3.],
       [ 3.,  0.,  0.]])
>>> sdf.index  # rows
Index([(1, 1), (1, 2)], dtype='object', name='AB')
>>> sdf.columns.tolist()  # columns, as a list
[('a', 0), ('b', 0), ('b', 1)]

Unstacking SparseSeries currently doesn't work, but is fixed in #16616.

@cottrell
Copy link
Contributor Author

cottrell commented Jun 8, 2017

Yeah, that functionally seems to cover it. I have a feeling from_tuples is pretty bad performance wise.

It seems like all the usefulness of that features really comes down to just being able to efficiently create codes for groups of levels (merging levels like you show above). After that, it is just record keeping to get the labels. Do you know if there is a more efficient version of something like this somewhere in the pandas code base?


def index_hash(*codes):
    m = len(codes[0])
    n = len(codes)
    a = np.empty((m, n), dtype=np.int64)
    for i in range(n):
        a[:,i] = codes[i]
    a = np.apply_along_axis(lambda x: hash(x.data.tobytes()), 1, a)
    return a

def index_hash_from_df(df, levels=None, catcols=None):
    assert (levels is not None) or (catcols is not None), 'Must not both be None'
    codes = list()
    if levels is not None:
        for level in levels:
            codes.append(df.index.labels[df.index._get_level_number(level)].values())
    if catcols is not None:
        for catcol in catcols:
            codes.append(df[catcol].cat.codes.values)
    return index_hash(*codes)

@jreback
Copy link
Contributor

jreback commented Jun 9, 2017

@cottrell looks like you need generalized data hashing, try

pandas.util.hash_pandas_object these are highly performant (and used internally in indexing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants