Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap: pandas SparseDataFrame may be deprecated #12800

Open
justmarkham opened this issue Dec 17, 2018 · 10 comments
Open

Roadmap: pandas SparseDataFrame may be deprecated #12800

justmarkham opened this issue Dec 17, 2018 · 10 comments
Projects

Comments

@justmarkham
Copy link
Contributor

@justmarkham justmarkham commented Dec 17, 2018

I noticed in the new roadmap that the second item on the list relates to "Pandas DataFrames and SparseDataFrames". I wanted to mention that according to an August 2018 talk by @datapythonista (who is a pandas core contributor), SparseDataFrames will be deprecated. Here are the slides from the talk (in notebook form).

I mentioned this to @amueller on Twitter, and he suggested opening an issue to document this.

This PR seems to be the most up-to-date discussion among the pandas team about deprecation.

Hope this is helpful!

@justmarkham

This comment has been minimized.

Copy link
Contributor Author

@justmarkham justmarkham commented Dec 17, 2018

Additional information, that I just noticed on Twitter from @jorisvandenbossche (another pandas core contributor):

Small clarification: for now, DataFrames with sparse data will not be deprecated, but we are considering to deprecate the SparseDataFrame subclass (because a normal DataFrame can also hold sparse data) https://mail.python.org/pipermail/pandas-dev/2018-November/000855.html

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Dec 17, 2018

Thanks for the input.
So it looks like they might

Deprecate SparseDataFrame in favor of a DataFrame holding sparse arrays

So maybe we should change it to "DataFrames for dense and sparse data" to accommodate what ever solution they are going for

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

@TomAugspurger TomAugspurger commented Dec 17, 2018

Just to note: the memory usage of the old SparseDataFrame and the "new" (it's been possible for a while) DataFrame where some of the columns are sparse is identical.

And, I don't think either are that good for very wide matrices. Each column is stored independently as two arrays: One array for the "sparse index" (the positions where the value is explicitly) and the "sparse values" (the value at each sparse index position).

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented Dec 18, 2018

@mitar

This comment has been minimized.

Copy link
Contributor

@mitar mitar commented May 26, 2019

So if one would pass into scikit-learn method which accepts "sparse array" as input a DataFrame with SparseSeries, would that work or explode at the moment?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 26, 2019

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

@jorisvandenbossche jorisvandenbossche commented May 26, 2019

A check_array on a (sparse) DataFrame will convert it to a numpy array (__array__), which for a sparse DataFrame means it gets densified, and memory usage will increase.
You can convert your sparse DataFrame to a scipy sparse matrix, which is probably the best approach for now if you want to keep it sparse.

@mitar

This comment has been minimized.

Copy link
Contributor

@mitar mitar commented May 26, 2019

So for me the beauty of using DataFrame is that I can have both string columns and numerical columns and throw it at random forest and it just works.

Now, with some columns sparse, we cannot just convert it to scipy sparse matrix, because the latter does not support object dtype.

What would be the ideal implementation here to support this properly in scikit-learn?

@jnothman

This comment has been minimized.

Copy link
Member

@jnothman jnothman commented May 26, 2019

@amueller

This comment has been minimized.

Copy link
Member

@amueller amueller commented Sep 18, 2019

I agree that we should make it a scipy sparse matrix. Should densifying be considered a bug? also cc @thomasjpfan as a potential todo.

@adrinjalali adrinjalali added this to To do in Pandas Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Pandas
  
To do
6 participants
You can’t perform that action at this time.