-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
-
-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap: pandas SparseDataFrame may be deprecated #12800
Comments
Additional information, that I just noticed on Twitter from @jorisvandenbossche (another pandas core contributor):
|
Thanks for the input.
So maybe we should change it to "DataFrames for dense and sparse data" to accommodate what ever solution they are going for |
Just to note: the memory usage of the old SparseDataFrame and the "new" (it's been possible for a while) DataFrame where some of the columns are sparse is identical. And, I don't think either are that good for very wide matrices. Each column is stored independently as two arrays: One array for the "sparse index" (the positions where the value is explicitly) and the "sparse values" (the value at each sparse index position). |
That's essentially equivalent to csc which is still much better than dense.
Yes, update the wording to say that we need to directly support DataFrame
sparsity
|
So if one would pass into scikit-learn method which accepts "sparse array" as input a DataFrame with SparseSeries, would that work or explode at the moment? |
Check? I assume it would work but inefficiently by making dense, unless
pandas raises an error when converting to array
|
A |
So for me the beauty of using Now, with some columns sparse, we cannot just convert it to scipy sparse matrix, because the latter does not support object dtype. What would be the ideal implementation here to support this properly in scikit-learn? |
Well at the moment I wouldn't have thought that throwing categorical dtypes
at our random forest "just works" either.
|
I agree that we should make it a scipy sparse matrix. Should densifying be considered a bug? also cc @thomasjpfan as a potential todo. |
So currently sparse Pandas DataFrames are supported only if all columns are sparse. What about supporting mixed case where dense columns could be converted to sparse ones? |
What kind of behavior would be reasonable for the mixed case? |
Making dense columns into sparse columns, I would say, and then everything into sparse? |
Choosing whichever has a smaller memory representation?
|
That would be ideal, but it might be hard to know in advance? |
I don't think trying to guess here is a good idea. I would direct the user to process the dense and sparse subsets independently using a ColumnTrasformer. Or if they really want them processed together then the user should convert the dense to sparse ahead of time. |
I noticed in the new roadmap that the second item on the list relates to "Pandas DataFrames and SparseDataFrames". I wanted to mention that according to an August 2018 talk by @datapythonista (who is a pandas core contributor), SparseDataFrames will be deprecated. Here are the slides from the talk (in notebook form).
I mentioned this to @amueller on Twitter, and he suggested opening an issue to document this.
This PR seems to be the most up-to-date discussion among the pandas team about deprecation.
Hope this is helpful!
The text was updated successfully, but these errors were encountered: