Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap: pandas SparseDataFrame may be deprecated #12800

Closed
justmarkham opened this issue Dec 17, 2018 · 16 comments · Fixed by #16728
Closed

Roadmap: pandas SparseDataFrame may be deprecated #12800

justmarkham opened this issue Dec 17, 2018 · 16 comments · Fixed by #16728
Projects

Comments

@justmarkham
Copy link
Contributor

justmarkham commented Dec 17, 2018

I noticed in the new roadmap that the second item on the list relates to "Pandas DataFrames and SparseDataFrames". I wanted to mention that according to an August 2018 talk by @datapythonista (who is a pandas core contributor), SparseDataFrames will be deprecated. Here are the slides from the talk (in notebook form).

I mentioned this to @amueller on Twitter, and he suggested opening an issue to document this.

This PR seems to be the most up-to-date discussion among the pandas team about deprecation.

Hope this is helpful!

@justmarkham
Copy link
Contributor Author

justmarkham commented Dec 17, 2018

Additional information, that I just noticed on Twitter from @jorisvandenbossche (another pandas core contributor):

Small clarification: for now, DataFrames with sparse data will not be deprecated, but we are considering to deprecate the SparseDataFrame subclass (because a normal DataFrame can also hold sparse data) https://mail.python.org/pipermail/pandas-dev/2018-November/000855.html

@amueller
Copy link
Member

amueller commented Dec 17, 2018

Thanks for the input.
So it looks like they might

Deprecate SparseDataFrame in favor of a DataFrame holding sparse arrays

So maybe we should change it to "DataFrames for dense and sparse data" to accommodate what ever solution they are going for

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 17, 2018

Just to note: the memory usage of the old SparseDataFrame and the "new" (it's been possible for a while) DataFrame where some of the columns are sparse is identical.

And, I don't think either are that good for very wide matrices. Each column is stored independently as two arrays: One array for the "sparse index" (the positions where the value is explicitly) and the "sparse values" (the value at each sparse index position).

@jnothman
Copy link
Member

jnothman commented Dec 18, 2018

@mitar
Copy link
Contributor

mitar commented May 26, 2019

So if one would pass into scikit-learn method which accepts "sparse array" as input a DataFrame with SparseSeries, would that work or explode at the moment?

@jnothman
Copy link
Member

jnothman commented May 26, 2019

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 26, 2019

A check_array on a (sparse) DataFrame will convert it to a numpy array (__array__), which for a sparse DataFrame means it gets densified, and memory usage will increase.
You can convert your sparse DataFrame to a scipy sparse matrix, which is probably the best approach for now if you want to keep it sparse.

@mitar
Copy link
Contributor

mitar commented May 26, 2019

So for me the beauty of using DataFrame is that I can have both string columns and numerical columns and throw it at random forest and it just works.

Now, with some columns sparse, we cannot just convert it to scipy sparse matrix, because the latter does not support object dtype.

What would be the ideal implementation here to support this properly in scikit-learn?

@jnothman
Copy link
Member

jnothman commented May 26, 2019

@amueller
Copy link
Member

amueller commented Sep 18, 2019

I agree that we should make it a scipy sparse matrix. Should densifying be considered a bug? also cc @thomasjpfan as a potential todo.

@adrinjalali adrinjalali added this to To do in Pandas Oct 21, 2019
Pandas automation moved this from To do to Done Mar 27, 2020
@mitar
Copy link
Contributor

mitar commented Mar 27, 2020

So currently sparse Pandas DataFrames are supported only if all columns are sparse. What about supporting mixed case where dense columns could be converted to sparse ones?

@thomasjpfan
Copy link
Member

thomasjpfan commented Apr 5, 2020

What kind of behavior would be reasonable for the mixed case?

@mitar
Copy link
Contributor

mitar commented Apr 5, 2020

Making dense columns into sparse columns, I would say, and then everything into sparse?

@jnothman
Copy link
Member

jnothman commented Apr 5, 2020

@mitar
Copy link
Contributor

mitar commented Apr 5, 2020

That would be ideal, but it might be hard to know in advance?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 6, 2020

I don't think trying to guess here is a good idea. I would direct the user to process the dense and sparse subsets independently using a ColumnTrasformer. Or if they really want them processed together then the user should convert the dense to sparse ahead of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Pandas
  
Done
Development

Successfully merging a pull request may close this issue.

7 participants