New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Large sparse matrix support #11327

Merged
merged 39 commits into from Jun 25, 2018

Conversation

Projects
None yet
4 participants
@jnothman
Member

jnothman commented Jun 20, 2018

Supersedes and closes #9678, given @kdhingra307's silence
Fixes #9545
Fixes #4149
Partially addresses #2969

TODO:

  • set accept_large_sparse=True by default
  • review error message wording

Dhingra and others added some commits Sep 3, 2017

Dhingra Dhingra
Fix to Issue #9545, it includes a new param accept_large_sparse to ch…
…eck_array and new functon sparse_indices_check
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn
…into sparse_indices_check

Merging changes between fork and master:
Estimator Check now supports int32 and int64 based indices, _large_sp…
…arse validation function has been extended to COO and CSC too
Dhingra Dhingra

@jnothman jnothman added this to the 0.20 milestone Jun 20, 2018

@jnothman jnothman changed the title from [WIP] Large sparse matrix support to [MRG] Large sparse matrix support Jun 20, 2018

@jnothman jnothman changed the title from [MRG] Large sparse matrix support to [WIP] Large sparse matrix support Jun 20, 2018

@jnothman jnothman changed the title from [WIP] Large sparse matrix support to [MRG] Large sparse matrix support Jun 20, 2018

@rth

Thanks for continuing this PR!

def test_check_array_accept_large_sparse_no_exception(X_64bit):
# When large sparse are allowed
if LARGE_SPARSE_SUPPORTED:

This comment has been minimized.

@rth

rth Jun 20, 2018

Member

Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED) instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.

@rth

rth Jun 20, 2018

Member

Maybe mark this with pytest.mark.skipif(not LARGE_SPARSE_SUPPORTED) instead? otherwise we will be generating 64 bit sparse in the fixture with a scipy that doesn't support it.

This comment has been minimized.

@jnothman

jnothman Jun 20, 2018

Member

No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.

@jnothman

jnothman Jun 20, 2018

Member

No, we want to do this: we want to test what happens if the user passes one that they could not have constructed directly with scipy.

Show outdated Hide outdated sklearn/utils/tests/test_validation.py Outdated
Show outdated Hide outdated sklearn/utils/validation.py Outdated
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 20, 2018

Member
Member

jnothman commented Jun 20, 2018

@ogrisel

LGTM besides the following comments.

Show outdated Hide outdated sklearn/decomposition/nmf.py Outdated
Show outdated Hide outdated sklearn/decomposition/nmf.py Outdated
@@ -598,6 +640,13 @@ def check_X_y(X, y, accept_sparse=False, dtype="numeric", order=None,
deprecated in version 0.19 "and will be removed in 0.21. Use
``accept_sparse=False`` instead.
accept_large_sparse : bool (default=True)
If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by
accept_sparse, accept_large_sparse will cause it to be accepted only

This comment has been minimized.

@ogrisel

ogrisel Jun 20, 2018

Member

accept_large_sparse=False instead of just accept_large_sparse.

@ogrisel

ogrisel Jun 20, 2018

Member

accept_large_sparse=False instead of just accept_large_sparse.

Show outdated Hide outdated sklearn/utils/validation.py Outdated
@@ -433,6 +434,40 @@ def pairwise_estimator_convert_X(X, estimator, kernel=linear_kernel):
return X
def _generate_sparse_matrix(X_csr):
"""Generate sparse matrices with {32,64}bit indices of diverse format

This comment has been minimized.

@rth

rth Jun 21, 2018

Member

Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )

@rth

rth Jun 21, 2018

Member

Side comment: once this is merged, I think there are some other existing tests where this function could be useful (cc @glemaitre )

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 21, 2018

Member

Want to give an opinion on ValueError vs TypeError?? ;)

Member

jnothman commented Jun 21, 2018

Want to give an opinion on ValueError vs TypeError?? ;)

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 22, 2018

Member

I think I'm still in favor (+0) of ValueError for consistency's sake, considering the existing behavior of our estimators when fed numpy arrays with an invalid dtype.

Member

ogrisel commented Jun 22, 2018

I think I'm still in favor (+0) of ValueError for consistency's sake, considering the existing behavior of our estimators when fed numpy arrays with an invalid dtype.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jun 22, 2018

Member

One more data point: we explicitly reject complex dtyped arrays with ValueError in sklearn validation (not a numpy side-effect):

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
  File "<ipython-input-9-70badd21e619>", line 1, in <module>
    LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
    "{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]]
Member

ogrisel commented Jun 22, 2018

One more data point: we explicitly reject complex dtyped arrays with ValueError in sklearn validation (not a numpy side-effect):

>>> import numpy as np
>>> from sklearn.linear_model import LogisticRegression
>>> LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
Traceback (most recent call last):
  File "<ipython-input-9-70badd21e619>", line 1, in <module>
    LogisticRegression().fit(np.array([[1.2 + 1j]]), [0])
  File "/home/ogrisel/code/scikit-learn/sklearn/linear_model/logistic.py", line 1218, in fit
    order="C")
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 671, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/ogrisel/code/scikit-learn/sklearn/utils/validation.py", line 497, in check_array
    "{}\n".format(array))
ValueError: Complex data not supported
[[1.2+1.j]]

jnothman added some commits Jun 23, 2018

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 24, 2018

Member

@rth does this have your +1?

Member

jnothman commented Jun 24, 2018

@rth does this have your +1?

@rth

rth approved these changes Jun 25, 2018

I don't have a strong opinion on the TypeError vs ValueError. I feel that for complex arrays (and possibly 32/64 bit indices), a TypeError would have made more sense. At least scipy does raise it in case of dtype mismatch (see e.g. scipy/scipy#8360 (comment) or this line) But ValueError seems fine for consistency sake.

LGTM. Thanks @jnothman !

@rth rth merged commit a7e1711 into scikit-learn:master Jun 25, 2018

8 checks passed

LGTM analysis: Python No alert changes
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 96.66% of diff hit (target 95.34%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +1.32% compared to 62301aa
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 25, 2018

Member
Member

jnothman commented Jun 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment