Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set diagonal entries of distance matrices to zero in silhoutte_samples #12258

Merged
merged 18 commits into from Aug 12, 2019
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 8 additions & 1 deletion doc/whats_new/v0.22.rst
Expand Up @@ -72,8 +72,15 @@ Changelog
memory if ``fit_intercept=False`` and ``X`` is a CSR sparse matrix.
:pr:`14108` by :user:`Alex Henrie <alexhenrie>`.

:mod:`sklearn.metrics`
......................

- |Fix| Raise a ValueError in :func:`metrics.silhouette_score` when a
precomputed distance matrix contains non-zero diagonal entries.
:issue:`12258` by :user:`Stephen Tierney <sjtrny>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:issue:`12258` by :user:`Stephen Tierney <sjtrny>`.
:issue:`pr` by :user:`Stephen Tierney <sjtrny>`.


:mod:`sklearn.model_selection`
..................
..............................

- |Enhancement| :class:`model_selection.learning_curve` now accepts parameter
``return_times`` which can be used to retrieve computation times in order to
Expand Down
16 changes: 16 additions & 0 deletions sklearn/metrics/cluster/tests/test_unsupervised.py
Expand Up @@ -170,6 +170,22 @@ def test_non_numpy_labels():
silhouette_score(list(X), list(y)), silhouette_score(X, y))


def test_silhouette_nonzero_diag():
# Construct a zero-diagonal matrix
dists = pairwise_distances(
np.array([[0.2, 0.1, 0.12, 1.34, 1.11, 1.6]]).transpose())

# Construct a nonzero-diagonal distance matrix
diag_dists = dists.copy()
np.fill_diagonal(diag_dists, 1)

labels = [0, 0, 0, 1, 1, 1]

assert_raise_message(ValueError, "distance matrix contains non-zero",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use with pytest.raises(ValueError, match=...): now, but I wouldn't prevent this from merging.

silhouette_samples,
diag_dists, labels, metric='precomputed')

sjtrny marked this conversation as resolved.
Show resolved Hide resolved

def assert_raises_on_only_one_label(func):
"""Assert message when there is only one label"""
rng = np.random.RandomState(seed=0)
Expand Down
10 changes: 10 additions & 0 deletions sklearn/metrics/cluster/unsupervised.py
Expand Up @@ -210,6 +210,16 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

"""
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])

# Check for diagonal entries in precomputed distance matrix
if metric == 'precomputed':
diag_indices = np.diag_indices(X.shape[0])
if np.any(X[diag_indices]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we still need to consider floating point error here? e.g., np.any([1e-18]) is True. We can use np.isclose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose? Is it to allow some small but non-zero values on the diagonal?

That seems counter to the pull request because tiny values will change the result of the silhouette score.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to avoid unexpected error, we generally avoid doing things like float == 0.

tiny values will change the result of the silhouette score.

Do you have an example to show that small numbers on the diagonal (e.g., 1e-10) will change the score significantly?

Copy link
Contributor Author

@sjtrny sjtrny Jun 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point here is to avoid unexpected error

Uhhhhh, errors are generally unexpected, it's pretty rare that you aim to write code with errors.

Do you have an example to show that small numbers on the diagonal (e.g., 1e-10) will change the score significantly?

How much is "significant"? To me any answer that is not exactly the same (up to floating point precision limits) is not good enough. The easiest way to achieve this goal is to prevent any non-zero entries on the diagonal. Stuffing about with tolerances seems like a waste of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhhhhh, errors are generally unexpected, it's pretty rare that you aim to write code with errors.

I mean sometimes some elements on the diagonal are actually zero, but due to floating point errors, we need to use np.isclose to check whether they're zero, see e.g., #10005 #11591

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjtrny ,

  1. you can just use np.diagonal to directly get the values
  2. As noted by @qinhanmin2014 , exact comparison with float values are almost always false, due to numerical precision issues. Having a small tolerance threshold would make sense here.

Also, ideally, I wouldn't call reviews from long-time core-devs a "waste of time".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I want it to return False if it is not exactly zero since these numerical precision issues cause incorrect calculation. I want to force users (myself) to set the diag to zero before calling this function to avoid incorrect calculation.

Keeping that in mind it does not make sense to tolerate some values that are close to but not zero.

I did not call the review a waste of time. I was referring to the effort in implementing it and deciding on threshold value since it is at odds of the goal of this PR. I highly value the work and effort of all contributors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We understand what you want to do.

The point is that very small non-zero values may be present in the pre-computed matrix, and maybe these small non-zeros values have no significant effect on the silhouette score. If that's the case, then the check is a bit too strict.

We basically need you to illustrate your claim "tiny values will change the result of the silhouette score". You can do this e.g. by computing the score when setting the diagonal to zero vs setting the diagonal to random values with e.g. |x| < 1e-10.

In any case, I would suggest:

  • address previous comments
  • indicate in the docstring for metric that the matrix must have zero diagonal entries when it's precomputed
  • indicate in the error message how to set the diagonal entries to zero (e.g. np.fill_diagonal(X, 0))

raise ValueError(
'The precomputed distance matrix contains non-zero '
'elements on the diagonal.'
)

le = LabelEncoder()
labels = le.fit_transform(labels)
n_samples = len(labels)
Expand Down