Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap: pandas SparseDataFrame may be deprecated #12800

Closed
justmarkham opened this issue Dec 17, 2018 · 16 comments · Fixed by #16728
Closed

Roadmap: pandas SparseDataFrame may be deprecated #12800

justmarkham opened this issue Dec 17, 2018 · 16 comments · Fixed by #16728

Comments

@justmarkham
Copy link
Contributor

I noticed in the new roadmap that the second item on the list relates to "Pandas DataFrames and SparseDataFrames". I wanted to mention that according to an August 2018 talk by @datapythonista (who is a pandas core contributor), SparseDataFrames will be deprecated. Here are the slides from the talk (in notebook form).

I mentioned this to @amueller on Twitter, and he suggested opening an issue to document this.

This PR seems to be the most up-to-date discussion among the pandas team about deprecation.

Hope this is helpful!

@justmarkham
Copy link
Contributor Author

Additional information, that I just noticed on Twitter from @jorisvandenbossche (another pandas core contributor):

Small clarification: for now, DataFrames with sparse data will not be deprecated, but we are considering to deprecate the SparseDataFrame subclass (because a normal DataFrame can also hold sparse data) https://mail.python.org/pipermail/pandas-dev/2018-November/000855.html

@amueller
Copy link
Member

Thanks for the input.
So it looks like they might

Deprecate SparseDataFrame in favor of a DataFrame holding sparse arrays

So maybe we should change it to "DataFrames for dense and sparse data" to accommodate what ever solution they are going for

@TomAugspurger
Copy link
Contributor

Just to note: the memory usage of the old SparseDataFrame and the "new" (it's been possible for a while) DataFrame where some of the columns are sparse is identical.

And, I don't think either are that good for very wide matrices. Each column is stored independently as two arrays: One array for the "sparse index" (the positions where the value is explicitly) and the "sparse values" (the value at each sparse index position).

@jnothman
Copy link
Member

jnothman commented Dec 18, 2018 via email

@mitar
Copy link
Contributor

mitar commented May 26, 2019

So if one would pass into scikit-learn method which accepts "sparse array" as input a DataFrame with SparseSeries, would that work or explode at the moment?

@jnothman
Copy link
Member

jnothman commented May 26, 2019 via email

@jorisvandenbossche
Copy link
Member

A check_array on a (sparse) DataFrame will convert it to a numpy array (__array__), which for a sparse DataFrame means it gets densified, and memory usage will increase.
You can convert your sparse DataFrame to a scipy sparse matrix, which is probably the best approach for now if you want to keep it sparse.

@mitar
Copy link
Contributor

mitar commented May 26, 2019

So for me the beauty of using DataFrame is that I can have both string columns and numerical columns and throw it at random forest and it just works.

Now, with some columns sparse, we cannot just convert it to scipy sparse matrix, because the latter does not support object dtype.

What would be the ideal implementation here to support this properly in scikit-learn?

@jnothman
Copy link
Member

jnothman commented May 26, 2019 via email

@amueller
Copy link
Member

I agree that we should make it a scipy sparse matrix. Should densifying be considered a bug? also cc @thomasjpfan as a potential todo.

@mitar
Copy link
Contributor

mitar commented Mar 27, 2020

So currently sparse Pandas DataFrames are supported only if all columns are sparse. What about supporting mixed case where dense columns could be converted to sparse ones?

@thomasjpfan
Copy link
Member

What kind of behavior would be reasonable for the mixed case?

@mitar
Copy link
Contributor

mitar commented Apr 5, 2020

Making dense columns into sparse columns, I would say, and then everything into sparse?

@jnothman
Copy link
Member

jnothman commented Apr 5, 2020 via email

@mitar
Copy link
Contributor

mitar commented Apr 5, 2020

That would be ideal, but it might be hard to know in advance?

@TomAugspurger
Copy link
Contributor

I don't think trying to guess here is a good idea. I would direct the user to process the dense and sparse subsets independently using a ColumnTrasformer. Or if they really want them processed together then the user should convert the dense to sparse ahead of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants