-
-
Notifications
You must be signed in to change notification settings - Fork 25.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds support for pandas dataframe with only sparse arrays #16728
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice! Please add a what's new entry for check_array
.
The conversions DataFrame (sparse) -> COO/CSC etc should be fairly efficient already now.
@@ -498,6 +500,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
estimator_name = "Estimator" | |||
context = " by %s" % estimator_name if estimator is not None else "" | |||
|
|||
# handles pandas sparse by checking for sparse attribute | |||
if hasattr(array, 'sparse') and array.ndim > 1: | |||
array = array.sparse.to_coo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why COO and not CSC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, .sparse
only supports .to_coo()
.
assert sp.issparse(result) | ||
assert result.format == 'coo' | ||
assert_allclose_dense_sparse(sp_mat, result) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add check that, e.g.
result = check_array(sdf, accept_sparse='csr')
assert result.format == csr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR updated with a parametrized test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks @thomasjpfan
Codecov says this is always false in our CI?
|
Looks like codecov is more often wrong than right. Pandas 1.0 should be installed in pylatest_pip_openblas_pandas: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=14767&view=logs&j=78a0bf4f-79e5-5387-94ec-13e67d216d6e&t=75a90307-b084-59e7-ba24-7f7161804495&l=252 |
But that's pandas 1.0, not something older than 0.24. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan , a few comments but looks good
sklearn/utils/validation.py
Outdated
@@ -451,10 +451,12 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
# DataFrame), and store them. If not, store None. | |||
dtypes_orig = None | |||
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'): | |||
# throw warning if pandas dataframe is sparse | |||
# throw warning if columns are sparse. If all columns are sparse, then | |||
# array.sparse exist and sparsity will be perserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# array.sparse exist and sparsity will be perserved. | |
# array.sparse exists and sparsity will be perserved (later). |
doc/whats_new/v0.23.rst
Outdated
@@ -408,6 +408,10 @@ Changelog | |||
pandas sparse DataFrame. | |||
:pr:`16021` by :user:`Rushabh Vasani <rushabh-v>`. | |||
|
|||
- |Enhancement| :func:`utils.validation.check_array` now constructs a sparse | |||
matrix from a pandas DataFrame with containing only `SparseArray`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matrix from a pandas DataFrame with containing only `SparseArray`s. | |
matrix from a pandas DataFrame that contains only `SparseArray`s. |
sdf = pd.DataFrame.sparse.from_spmatrix(sp_mat) | ||
result = check_array(sdf, accept_sparse=True) | ||
|
||
# by default pandas converts to coo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pandas default, or is it rather what we did in check_array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pandas default, or is it rather what we did in check_array?
Both -- pandas only have direct conversion to coo at the moment, so it makes sense to keep that as default. Conversions from COO to either CSC or CSR is fast.
sklearn/utils/validation.py
Outdated
@@ -498,6 +500,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
estimator_name = "Estimator" | |||
context = " by %s" % estimator_name if estimator is not None else "" | |||
|
|||
# handles pandas sparse by checking for sparse attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# handles pandas sparse by checking for sparse attribute | |
# When all dataframe columns are sparse, convert to a sparse array |
assert_allclose_dense_sparse(sp_mat, result) | ||
|
||
|
||
@pytest.mark.parametrize('sp_format', ['csr', 'csc', 'coo', 'bsr']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can add the True
option here to merge both tests?
Also as usual, a short comment describing the test would help for future us
I wonder if the coverage error has something to do with this error Coverage.py warning: Data file '...' doesn't seem to be a coverage data file:
Couldn't use data file '/home/vsts/work/tmp_folder/.coverage.fv-az762.6807.984702':
no such table: coverage_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @thomasjpfan , LGTM. I assume you need this for #16772 . Merging.
Indeed. Opened #16775 for a follow up discussion. |
@@ -498,6 +500,11 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True, | |||
estimator_name = "Estimator" | |||
context = " by %s" % estimator_name if estimator is not None else "" | |||
|
|||
# When all dataframe columns are sparse, convert to a sparse array | |||
if hasattr(array, 'sparse') and array.ndim > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I would also add a check that all columns are indeed sparse (the .sparse
accessor now raises if not all columns are sparse, but I am not sure this is necessarily guaranteed to stay that way).
The explicit check would be something like:
df.dtypes.apply(pd.api.types.is_sparse).all()
See also pandas-dev/pandas#26706
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved my comment to a new issue: #16845
Reference Issues/PRs
Fixes #12800
What does this implement/fix? Explain your changes.
Does not convert pandas dataframe into a dense array if ALL its columns are sparse arrays.
Any other comments?
Since we have pandas sparse support on our minds: CC @rth