Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Adds support for pandas dataframe with only sparse arrays #16728

Merged
merged 10 commits into from Mar 27, 2020

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #12800

What does this implement/fix? Explain your changes.

Does not convert pandas dataframe into a dense array if ALL its columns are sparse arrays.

Any other comments?

Since we have pandas sparse support on our minds: CC @rth

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice! Please add a what's new entry for check_array.

The conversions DataFrame (sparse) -> COO/CSC etc should be fairly efficient already now.

@@ -498,6 +500,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
estimator_name = "Estimator"
context = " by %s" % estimator_name if estimator is not None else ""

# handles pandas sparse by checking for sparse attribute
if hasattr(array, 'sparse') and array.ndim > 1:
array = array.sparse.to_coo()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why COO and not CSC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, .sparse only supports .to_coo().

assert sp.issparse(result)
assert result.format == 'coo'
assert_allclose_dense_sparse(sp_mat, result)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add check that, e.g.

result = check_array(sdf, accept_sparse='csr')
assert result.format == csr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated with a parametrized test.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks @thomasjpfan

@adrinjalali
Copy link
Member

Codecov says this is always false in our CI?

if LooseVersion(pd.__version__) < '0.24.0'

@rth
Copy link
Member

rth commented Mar 23, 2020

Looks like codecov is more often wrong than right. Pandas 1.0 should be installed in pylatest_pip_openblas_pandas: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=14767&view=logs&j=78a0bf4f-79e5-5387-94ec-13e67d216d6e&t=75a90307-b084-59e7-ba24-7f7161804495&l=252

@adrinjalali
Copy link
Member

But that's pandas 1.0, not something older than 0.24.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan , a few comments but looks good

@@ -451,10 +451,12 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
# DataFrame), and store them. If not, store None.
dtypes_orig = None
if hasattr(array, "dtypes") and hasattr(array.dtypes, '__array__'):
# throw warning if pandas dataframe is sparse
# throw warning if columns are sparse. If all columns are sparse, then
# array.sparse exist and sparsity will be perserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# array.sparse exist and sparsity will be perserved.
# array.sparse exists and sparsity will be perserved (later).

@@ -408,6 +408,10 @@ Changelog
pandas sparse DataFrame.
:pr:`16021` by :user:`Rushabh Vasani <rushabh-v>`.

- |Enhancement| :func:`utils.validation.check_array` now constructs a sparse
matrix from a pandas DataFrame with containing only `SparseArray`s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
matrix from a pandas DataFrame with containing only `SparseArray`s.
matrix from a pandas DataFrame that contains only `SparseArray`s.

sdf = pd.DataFrame.sparse.from_spmatrix(sp_mat)
result = check_array(sdf, accept_sparse=True)

# by default pandas converts to coo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pandas default, or is it rather what we did in check_array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pandas default, or is it rather what we did in check_array?

Both -- pandas only have direct conversion to coo at the moment, so it makes sense to keep that as default. Conversions from COO to either CSC or CSR is fast.

@@ -498,6 +500,10 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
estimator_name = "Estimator"
context = " by %s" % estimator_name if estimator is not None else ""

# handles pandas sparse by checking for sparse attribute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# handles pandas sparse by checking for sparse attribute
# When all dataframe columns are sparse, convert to a sparse array

assert_allclose_dense_sparse(sp_mat, result)


@pytest.mark.parametrize('sp_format', ['csr', 'csc', 'coo', 'bsr'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can add the True option here to merge both tests?

Also as usual, a short comment describing the test would help for future us

@thomasjpfan
Copy link
Member Author

I wonder if the coverage error has something to do with this error

Coverage.py warning: Data file '...' doesn't seem to be a coverage data file:
Couldn't use data file '/home/vsts/work/tmp_folder/.coverage.fv-az762.6807.984702':
no such table: coverage_schema

rth
rth approved these changes Mar 27, 2020
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan , LGTM. I assume you need this for #16772 . Merging.

@rth
Copy link
Member

rth commented Mar 27, 2020

But that's pandas 1.0, not something older than 0.24.

Indeed. Opened #16775 for a follow up discussion.

@rth rth merged commit fa1ea2a into scikit-learn:master Mar 27, 2020
@@ -498,6 +500,11 @@ def check_array(array, accept_sparse=False, accept_large_sparse=True,
estimator_name = "Estimator"
context = " by %s" % estimator_name if estimator is not None else ""

# When all dataframe columns are sparse, convert to a sparse array
if hasattr(array, 'sparse') and array.ndim > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, I would also add a check that all columns are indeed sparse (the .sparse accessor now raises if not all columns are sparse, but I am not sure this is necessarily guaranteed to stay that way).
The explicit check would be something like:

df.dtypes.apply(pd.api.types.is_sparse).all()

See also pandas-dev/pandas#26706

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved my comment to a new issue: #16845

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Roadmap: pandas SparseDataFrame may be deprecated
5 participants