Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informative exception when passing a sparse matrix #7566

Merged
merged 11 commits into from
Apr 8, 2018
Merged

Conversation

urigoren
Copy link
Contributor

@urigoren urigoren commented Jul 5, 2017

Proposed solution to
#7565

@@ -1364,6 +1364,11 @@ def pdist(X, metric='euclidean', p=None, w=None, V=None, VI=None):
# between all pairs of vectors in X using the distance metric 'abc' but
# with a more succinct, verifiable, but less efficient implementation.


if scipy.sparse.issparse(X):
raise NotImplementedError("pdist does not support sparse matrices, use skearn's pairwise_distances\nhttp://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs indicate that NotImplementedError should be used for abstract methods or in-progress methods. I don't think there's a plan to support sparse inputs here, so TypeError might be more appropriate.

Also, I'd like a shorter message. Maybe something like: "pdist doesn't support sparse matrix inputs, use sklearn.metrics.pairwise_distances instead."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Done...

@@ -1364,6 +1364,11 @@ def pdist(X, metric='euclidean', p=None, w=None, V=None, VI=None):
# between all pairs of vectors in X using the distance metric 'abc' but
# with a more succinct, verifiable, but less efficient implementation.


if scipy.sparse.issparse(X):
raise TypeError("pdist does not support sparse matrix input, use skearn's pairwise_distances instead")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: skearn -> sklearn

Also wrap the string to fit in 80 columns, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Done !

@@ -1364,6 +1364,11 @@ def pdist(X, metric='euclidean', p=None, w=None, V=None, VI=None):
# between all pairs of vectors in X using the distance metric 'abc' but
# with a more succinct, verifiable, but less efficient implementation.


if scipy.sparse.issparse(X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scipy.sparse needs to be imported before use.

@urigoren
Copy link
Contributor Author

urigoren commented Jul 8, 2017

@perimosocordiae , please re-review

Copy link
Member

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @urigoren!

My final concern is that this will cause scipy.spatial to depend on importing scipy.sparse, which will slightly increase the time it takes to run import scipy.spatial the first time. I don't expect this to be a big slowdown, though, so I'm +1 to merge.

@rgommers rgommers added maintenance Items related to regular maintenance tasks scipy.sparse labels Jul 10, 2017
@rgommers
Copy link
Member

rgommers commented Jul 10, 2017

This doesn't look quite right to me:

  • _copy_array_if_base_present is called from other functions than pdist (e.g. squareform), so is not the right place to put this check
  • adding coupling between scipy submodules just for an exception message is slightly questionable. there are many types of inputs that aren't accepted.......
    EDIT: and there lots of other scipy functions where one could add this check - doesn't seem desirable

@rgommers
Copy link
Member

Looks like a documentation change is a better way to go about this. And perhaps it would make sense to move the scikit-learn distance metrics into scipy.spatial?

@urigoren
Copy link
Contributor Author

@rgommers , So what do you suggest to do next? Move this check to _copy_array_if_base_present or copy the code from sklearn's code ?

@perimosocordiae
Copy link
Member

For now, let's just add a warning not to use sparse matrix inputs to the docstring of pdist and cdist, perhaps in the "Notes" section.

I do think that eventually scipy.spatial should grow sparse matrix support similar to what scikit-learn provides, but that will be a large effort that needs to be designed carefully.

@urigoren
Copy link
Contributor Author

What are the downsides of merging this PR, only the possible increase in loading time ?
Correct me if I'm wrong, but that can be tested with a simple %timeit call, right ?

I agree with @perimosocordiae , that we definitely want pdist and cdist to support sparse matrices in the future,
but until that happens, what's the harm of having an informative exception ?

@rgommers
Copy link
Member

rgommers commented Jul 11, 2017

What are the downsides of merging this PR, only the possible increase in loading time ?
Correct me if I'm wrong, but that can be tested with a simple %timeit call, right ?

No you can't. Measuring import time is quite tricky to get right (after the first import it's a lookup in globals so much faster) - we had some discussion on that on the numpy-discussion list last year.

Either way, import time is not the main issue. The issues are the two I listed in #7566 (comment).
EDIT: for the first one, just try squareform(some_sparse_matrix) to see what is wrong

@rgommers
Copy link
Member

I do think that eventually scipy.spatial should grow sparse matrix support similar to what scikit-learn provides, but that will be a large effort that needs to be designed carefully.

Agreed

For now, let's just add a warning not to use sparse matrix inputs to the docstring of pdist and cdist, perhaps in the "Notes" section.

+1

@ev-br
Copy link
Member

ev-br commented Jul 22, 2017

Just noting that there's a similar utility function instead of a direct use of issparse, used e.g. in scipy.linalg IIRC: https://github.com/scipy/scipy/blob/master/scipy/_lib/_util.py#L192

@pv pv added the needs-work Items that are pending response from the author label Aug 18, 2017
Uri Goren added 4 commits February 21, 2018 12:45
@urigoren
Copy link
Contributor Author

@rgommers , @ev-br , I fixed the sparsity check as you suggested.

The test seem to pass (despite what the CI says), it's an accuracy issue on python 3.5 - please have a look at the logs.

Could you please merge ?

@pv pv merged commit e9fd161 into scipy:master Apr 8, 2018
@pv pv added this to the 1.1.0 milestone Apr 8, 2018
@pv pv removed the needs-work Items that are pending response from the author label Apr 8, 2018
StrahinjaLukic pushed a commit to StrahinjaLukic/scipy that referenced this pull request Apr 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.sparse
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants