Skip to content

[MRG] value error when distance matrix is not square and affinity=precomputed #16257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

simonamaggio
Copy link
Contributor

Reference Issues/PRs

Closes #4901

What does this implement/fix? Explain your changes.

Added a check in AgglomerativeClustering, to make sure the input matrix of fit() is square, when affinity=precomputed.
This raises a more clear error when a non square matrix is used as input (instead of out of bounds IndexError).

Any other comments?

@simonamaggio simonamaggio changed the title [MRG] [MRG] value error when distance matrix is not square and affinity=precomputed Jan 28, 2020
@@ -455,6 +455,11 @@ def linkage_tree(X, connectivity=None, n_clusters=None, linkage='complete',
# data, provide as first argument an ndarray of the shape returned
# by pdist: it is a flat array containing the upper triangular of
# the distance matrix.
if len(X.shape) != 2 or X.shape[0] != X.shape[1]:
raise ValueError(
'Matrix should be square, as returned by pdist. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pairwise_distances would be more familiar to our users than pdist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. Let me change the Error Message.

# containing the upper triangular of the distance matrix.
if len(X.shape) != 2 or X.shape[0] != X.shape[1]:
raise ValueError(
'Matrix should be square, '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused here. I have to say I am not sure what shape X should have when affinity == 'precomputed'.

The thing is at the moment, the error message says that X should be square and the comment above says that X should be flat ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremiedbb thinks that the matrix should be square because of the code below:

i, j = np.triu_indices(X.shape[0], k=1)
X = X[i, j]  # now X is a flat array containing the upper triangular of the distance matrix 

So the comment needs to be changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than

provide as first argument an ndarray of the shape returned
by pdist: it is a flat array containing the upper triangular of
by sklearn.metrics.pairwise_distances: it is a flat array

say something like:

X should be a distance matrix as returned by sklearn.metrics.pairwise_distances 	

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the comment. The input matrix should be squared. And for sure it shouldn't be flat, as there is another check somewhere else asking the user to reshape the potential 1D input as a 2D matrix (n,1).

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @simonamaggio. A few comments below.

Comment on lines 459 to 461
'Matrix should be square, '
'as returned by pairwise_distances. '
'Found dimensionality %s' % str(X.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say as return by pairwise_distances, because pairwise distances can return non square matrix (if called on X and Y with different shapes). I would just say

'Matrix should be squared. Found matrix of shape {}'.format(X.shape)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Just added 'Distance matrix' instead of 'Matrix' to be more specific.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the .format syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides my last comment, lgtm !

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

if len(X.shape) != 2 or X.shape[0] != X.shape[1]:
raise ValueError(
'Distance matrix should be square, '
'Found dimensionality %s' % str(X.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Found dimensionality %s' % str(X.shape)
'Got shape %s' % str(X.shape)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need an entry in the whats new as well.

@jeremiedbb
Copy link
Member

@simonamaggio could you please add an entry in the what's new (scikit-learn/doc/whats_new/v0.23.rst) ? you can tag it |Fix|.

Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not see the typo the first time.
LGTM otherwise

# by sklearn.metrics.pairwise_distances.
if X.shape[0] != X.shape[1]:
raise ValueError(
'Distance matrix should be square, '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Distance matrix should be square, '
'Distance matrix should be squared, '

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I think it's "square matrix", not "squared matrix". This doesn't seem right, but if you insist, I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right.

if X.shape[0] != X.shape[1]:
raise ValueError(
'Distance matrix should be square, '
'Got matrix of shape {}'.format(X.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'Got matrix of shape {}'.format(X.shape)
f'Got matrix of shape {X.shape}.'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for this.

# and a non square matrix is passed (PR #16257).
rng = np.random.RandomState(0)
X = rng.rand(5, 3)
with pytest.raises(ValueError, match="Distance matrix should be square, "):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with pytest.raises(ValueError, match="Distance matrix should be square, "):
with pytest.raises(ValueError, match="Distance matrix should be squared, "):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I think it's "square matrix", not "squared matrix". This doesn't seem right, but if you insist, I'll change it.

@glemaitre glemaitre merged commit 06c71ec into scikit-learn:master Feb 17, 2020
@glemaitre
Copy link
Member

@simonamaggio Thank you for finishing the PR. Do not hesitate to take another one :)

@simonamaggio simonamaggio deleted the check_precomputed_shape_agglo_clustering branch February 18, 2020 10:55
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Feb 22, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants