New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set diagonal entries of distance matrices to zero in silhoutte_samples #12258
Changes from 16 commits
76d7769
d7f8a7f
7df723c
6f4b439
2fdfc4d
76f28a4
51a256d
6f60257
b2e995d
d78da43
7566be4
8ee5ab8
c0d9fe8
34c11d7
ed74752
02645a7
9fbbc7c
e4a6256
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -170,6 +170,22 @@ def test_non_numpy_labels(): | |
silhouette_score(list(X), list(y)), silhouette_score(X, y)) | ||
|
||
|
||
def test_silhouette_nonzero_diag(): | ||
# Construct a zero-diagonal matrix | ||
dists = pairwise_distances( | ||
np.array([[0.2, 0.1, 0.12, 1.34, 1.11, 1.6]]).transpose()) | ||
|
||
# Construct a nonzero-diagonal distance matrix | ||
diag_dists = dists.copy() | ||
np.fill_diagonal(diag_dists, 1) | ||
|
||
labels = [0, 0, 0, 1, 1, 1] | ||
|
||
assert_raise_message(ValueError, "distance matrix contains non-zero", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we use |
||
silhouette_samples, | ||
diag_dists, labels, metric='precomputed') | ||
|
||
sjtrny marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def assert_raises_on_only_one_label(func): | ||
"""Assert message when there is only one label""" | ||
rng = np.random.RandomState(seed=0) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -210,6 +210,16 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds): | |
|
||
""" | ||
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr']) | ||
|
||
# Check for diagonal entries in precomputed distance matrix | ||
if metric == 'precomputed': | ||
diag_indices = np.diag_indices(X.shape[0]) | ||
if np.any(X[diag_indices]): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we still need to consider floating point error here? e.g., There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the purpose? Is it to allow some small but non-zero values on the diagonal? That seems counter to the pull request because tiny values will change the result of the silhouette score. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The point here is to avoid unexpected error, we generally avoid doing things like
Do you have an example to show that small numbers on the diagonal (e.g., 1e-10) will change the score significantly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Uhhhhh, errors are generally unexpected, it's pretty rare that you aim to write code with errors.
How much is "significant"? To me any answer that is not exactly the same (up to floating point precision limits) is not good enough. The easiest way to achieve this goal is to prevent any non-zero entries on the diagonal. Stuffing about with tolerances seems like a waste of time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sjtrny ,
Also, ideally, I wouldn't call reviews from long-time core-devs a "waste of time". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Keeping that in mind it does not make sense to tolerate some values that are close to but not zero. I did not call the review a waste of time. I was referring to the effort in implementing it and deciding on threshold value since it is at odds of the goal of this PR. I highly value the work and effort of all contributors. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We understand what you want to do. The point is that very small non-zero values may be present in the pre-computed matrix, and maybe these small non-zeros values have no significant effect on the silhouette score. If that's the case, then the check is a bit too strict. We basically need you to illustrate your claim "tiny values will change the result of the silhouette score". You can do this e.g. by computing the score when setting the diagonal to zero vs setting the diagonal to random values with e.g. In any case, I would suggest:
|
||
raise ValueError( | ||
'The precomputed distance matrix contains non-zero ' | ||
'elements on the diagonal.' | ||
) | ||
|
||
le = LabelEncoder() | ||
labels = le.fit_transform(labels) | ||
n_samples = len(labels) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.