-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] value error when distance matrix is not square and affinity=precomputed #16257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] value error when distance matrix is not square and affinity=precomputed #16257
Conversation
sklearn/cluster/_agglomerative.py
Outdated
@@ -455,6 +455,11 @@ def linkage_tree(X, connectivity=None, n_clusters=None, linkage='complete', | |||
# data, provide as first argument an ndarray of the shape returned | |||
# by pdist: it is a flat array containing the upper triangular of | |||
# the distance matrix. | |||
if len(X.shape) != 2 or X.shape[0] != X.shape[1]: | |||
raise ValueError( | |||
'Matrix should be square, as returned by pdist. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pairwise_distances would be more familiar to our users than pdist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. Let me change the Error Message.
…e_distances. Update test accordingly.
sklearn/cluster/_agglomerative.py
Outdated
# containing the upper triangular of the distance matrix. | ||
if len(X.shape) != 2 or X.shape[0] != X.shape[1]: | ||
raise ValueError( | ||
'Matrix should be square, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused here. I have to say I am not sure what shape X
should have when affinity == 'precomputed'
.
The thing is at the moment, the error message says that X
should be square and the comment above says that X
should be flat ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremiedbb thinks that the matrix should be square because of the code below:
i, j = np.triu_indices(X.shape[0], k=1)
X = X[i, j] # now X is a flat array containing the upper triangular of the distance matrix
So the comment needs to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than
provide as first argument an ndarray of the shape returned
by pdist: it is a flat array containing the upper triangular of
by sklearn.metrics.pairwise_distances: it is a flat array
say something like:
X should be a distance matrix as returned by sklearn.metrics.pairwise_distances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the comment. The input matrix should be squared. And for sure it shouldn't be flat, as there is another check somewhere else asking the user to reshape the potential 1D input as a 2D matrix (n,1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @simonamaggio. A few comments below.
sklearn/cluster/_agglomerative.py
Outdated
'Matrix should be square, ' | ||
'as returned by pairwise_distances. ' | ||
'Found dimensionality %s' % str(X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not say as return by pairwise_distances, because pairwise distances can return non square matrix (if called on X and Y with different shapes). I would just say
'Matrix should be squared. Found matrix of shape {}'.format(X.shape)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Just added 'Distance matrix' instead of 'Matrix' to be more specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the .format
syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides my last comment, lgtm !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise lgtm
sklearn/cluster/_agglomerative.py
Outdated
if len(X.shape) != 2 or X.shape[0] != X.shape[1]: | ||
raise ValueError( | ||
'Distance matrix should be square, ' | ||
'Found dimensionality %s' % str(X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Found dimensionality %s' % str(X.shape) | |
'Got shape %s' % str(X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need an entry in the whats new as well.
@simonamaggio could you please add an entry in the what's new ( |
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not see the typo the first time.
LGTM otherwise
# by sklearn.metrics.pairwise_distances. | ||
if X.shape[0] != X.shape[1]: | ||
raise ValueError( | ||
'Distance matrix should be square, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Distance matrix should be square, ' | |
'Distance matrix should be squared, ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I think it's "square matrix", not "squared matrix". This doesn't seem right, but if you insist, I'll change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
sklearn/cluster/_agglomerative.py
Outdated
if X.shape[0] != X.shape[1]: | ||
raise ValueError( | ||
'Distance matrix should be square, ' | ||
'Got matrix of shape {}'.format(X.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Got matrix of shape {}'.format(X.shape) | |
f'Got matrix of shape {X.shape}.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for this.
# and a non square matrix is passed (PR #16257). | ||
rng = np.random.RandomState(0) | ||
X = rng.rand(5, 3) | ||
with pytest.raises(ValueError, match="Distance matrix should be square, "): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with pytest.raises(ValueError, match="Distance matrix should be square, "): | |
with pytest.raises(ValueError, match="Distance matrix should be squared, "): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I think it's "square matrix", not "squared matrix". This doesn't seem right, but if you insist, I'll change it.
@simonamaggio Thank you for finishing the PR. Do not hesitate to take another one :) |
Reference Issues/PRs
Closes #4901
What does this implement/fix? Explain your changes.
Added a check in AgglomerativeClustering, to make sure the input matrix of fit() is square, when affinity=precomputed.
This raises a more clear error when a non square matrix is used as input (instead of out of bounds IndexError).
Any other comments?