Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set diagonal entries of distance matrices to zero in silhoutte_samples #12258

Merged
merged 18 commits into from Aug 12, 2019

Conversation

@sjtrny
Copy link
Contributor

sjtrny commented Oct 3, 2018

Reference Issues/PRs

Fixes #12178

What does this implement/fix? Explain your changes.

  • Added a warning in the presence of nonzero diagonal entries in the distance matrix.
  • Added a test to ensure that this warning is present.

Any other comments?

Long time listener, first time caller. Be gentle!

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Oct 3, 2018

Please correct the flake8 error according to https://travis-ci.org/scikit-learn/scikit-learn/jobs/436612965.
I trust the decision from Joel but honestly I have mixed feelings about the PR. filling the matrix in a way that leaves these cells arbitrary seems strange from my side. What's more, we don't do such validation in many other classes/functions (e.g., DBSCAN). But I'm not opposed to the PR.

sjtrny added 2 commits Oct 4, 2018
@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Oct 4, 2018

Please correct the flake8 error according to https://travis-ci.org/scikit-learn/scikit-learn/jobs/436612965.

All done.

I trust the decision from Joel but honestly I have mixed feelings about the PR. filling the matrix in a way that leaves these cells arbitrary seems strange from my side. What's more, we don't do such validation in many other classes/functions (e.g., DBSCAN). But I'm not opposed to the PR.

I wasn't filling them arbitrarily. The value of the cells had a specific meaning in my application. I assumed that these cells would be ignored when passed to silhouette_samples.

I believe validation via a warning is required. If I ran into this issue then it is likely that:

  1. future users will have the same problem, with or without being aware
  2. current users may be using incorrect values in production without being aware of it

I would expect that if similar bugs were discovered in other classes/functions such as DBSCAN then a similar warning should be implemented.

Copy link
Member

jnothman left a comment

I'd rather an error over a warning. And you need to use not np.allclose(X[diag_indices], 0) rather than np.any(X[diag_indices]), since pairwise distance calculations cannot always assure exact 0.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 4, 2018

I think this is input validation like you we do to tell the users that their data contains nan and inf and something is not as expected. It should raise a ValueError.

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Oct 4, 2018

And you need to use not np.allclose(X[diag_indices], 0) rather than np.any(X[diag_indices]), since pairwise distance calculations cannot always assure exact 0.

The distance matrix MUST have exact zero on the diagonal entries otherwise silhoutte_samples will be erroneous. Even a small nonzero value like the default atol of 1e-08 for np.allclose could be catastrophic if the differences between points is also very small.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 4, 2018

Even a small nonzero value like the default atol of 1e-08 for np.allclose could be catastrophic if the differences between points is also very small.

I don't think the concern is in reality that significant.

However, I think it is no great cost to just ignore those diagonal terms with something like:

--- a/sklearn/metrics/cluster/unsupervised.py
+++ b/sklearn/metrics/cluster/unsupervised.py
@@ -142,6 +142,8 @@ def _silhouette_reduce(D_chunk, start, labels, label_freqs):
     intra_index = (np.arange(len(D_chunk)), labels[start:start + len(D_chunk)])
     # intra_clust_dists are averaged over cluster size outside this function
     intra_clust_dists = clust_dists[intra_index]
+    # subtract off diagonal distances in case they are not zero
+    intra_clust_dists -= np.diag(D_chunk, start)
     # of the remaining distances we normalise and extract the minimum
     clust_dists[intra_index] = np.inf
     clust_dists /= label_freqs
@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Oct 5, 2018

I don't think the concern is in reality that significant.

I more meant that adding code which is not invariant to the scale is just as bad as doing nothing.

However, I think it is no great cost to just ignore those diagonal terms with something like:

+++ b/sklearn/metrics/cluster/unsupervised.py
+    intra_clust_dists -= np.diag(D_chunk, start)

Yep, that was my original idea in #12178. Should I change the PR to this?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 5, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 5, 2018

sjtrny added 3 commits Oct 10, 2018
Copy link
Member

jnothman left a comment

Thanks, Stephen.

........................

- |Fix| Fixed a bug in :func:`metrics.silhouette_score` where diagonal entries
of a precomputed distance matrix affected the final score.

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 10, 2018

Member

It would have affected non-precomputed too, wouldn't it, if there were precision issues in calculating those diagonals? We should handle that case too

@@ -210,6 +210,10 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
"""
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
if metric == 'precomputed':
diag_indices = np.diag_indices(X.shape[0])
X[diag_indices] = 0

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 10, 2018

Member

This modifies the user's data. It either needs a copy, or needs to happen somewhere else. I'd suggested doing it in the chunked reduce operation

sjtrny added 2 commits Oct 14, 2018
@sjtrny sjtrny changed the title Add warning to silhoutte_samples in presence of nonzero diagonal entries Set diagonal entries of distance matrices to zero in silhoutte_samples Oct 14, 2018
Copy link
Member

jnothman left a comment

Hopefully only minor things now


- |Fix| Fixed a bug in :func:`metrics.silhouette_score` where diagonal entries
of a precomputed distance matrix affected the final score.
:issue:`12258` by `Stephen Tierney`.

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 14, 2018

Member

Use :user: here

This comment has been minimized.

Copy link
@sjtrny

sjtrny Oct 15, 2018

Author Contributor

Sorry, I followed the bottom example, which seems to be missing the user tag.

- |Fix| Fixed bug in :class:`preprocessing.OrdinalEncoder` when passing
  manually specified categories. :issue:`12365` by `Joris Van den Bossche`_.
@@ -1356,4 +1363,4 @@ Mehta, Vinit, Vinod Kumar L, Viraj Mavani, Viraj Navkal, Vivek Kumar, Vlad
Niculae, vqean3, Vrishank Bhardwaj, vufg, wallygauze, Warut Vijitbenjaronk,
wdevazelhes, Wenhao Zhang, Wes Barnett, Will, William de Vazelhes, Will
Rosenfeld, Xin Xiong, Yiming (Paul) Li, ymazari, Yufeng, Zach Griffith, Zé
Vinícius, Zhenqing Hu, Zhiqing Xiao, Zijie (ZJ) Poh
Vinícius, Zhenqing Hu, Zhiqing Xiao, Zijie (ZJ) Poh

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 14, 2018

Member

Please don't remove this trailing newline

This comment has been minimized.

Copy link
@sjtrny

sjtrny Oct 15, 2018

Author Contributor

Not sure what happened here. My local copy is not missing this line.

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Oct 15, 2018

The test that is failing on appveyor is in sklearn\utils\tests\test_sparsefuncs.py

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 15, 2018


- |Fix| Fixed a bug in :func:`metrics.silhouette_score` where diagonal entries
of a precomputed distance matrix affected the final score.
:issue:`12258` by :user:`Stephen Tierney`.

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 15, 2018

Member

Yes, core devs get a simplified syntax. This syntax, provided by the sphinx-issues plugin, generates a link to GitHub. But you need something like :user:`Stephen Tierney <sjtrny>`

@@ -131,6 +131,11 @@ def _silhouette_reduce(D_chunk, start, labels, label_freqs):
label_freqs : array
distribution of cluster labels in ``labels``
"""
# Copy the data so that we don't change the original

This comment has been minimized.

Copy link
@jnothman

jnothman Oct 15, 2018

Member

I hope this is an acceptable sacrifice to make for an edge case...

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Oct 15, 2018

Could you please separately check that there is no similar issue affecting davies-bouldin?

Copy link
Member

rth left a comment

I think it's better to explicitly raise a ValueError when the user provided precomputed distance matrix doesn't have a diagonal less than a few np.finfo(X.dtype).eps, rather than trying to correct for it (and make a copy of the data in the process). IMO it's up to the code that generates that precomputed distance matrix to make sure that the diagonal is 0.

We could have some validation function, this affects all estimators that take a precomputed distance matrix as input.

Even a small nonzero value like the default atol of 1e-08 for np.allclose could be catastrophic if the differences between points is also very small.

I more meant that adding code which is not invariant to the scale is just as bad as doing nothing.

Well if the data has a scale of 1e-8, a number of numerical algorithms would provide unsatisfactory results, so it is probably better to scale it in any case.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Jun 26, 2019

I think it's better to explicitly raise a ValueError when the user provided precomputed distance matrix doesn't have a diagonal less than a few np.finfo(X.dtype).eps,

+1 to raise an error
Actually we manually set the diagonal to 0 when calculating some distances (e.g., euclidean_distance), though I think we still need to discuss whether it's appropriate to do so and when should we do so. (see the discussions in #13877)

sjtrny added 3 commits Jun 26, 2019
@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Jun 26, 2019

I've changed the behaviour to raise a ValueError if silhouette_samples is passed a precomputed distance matrix with non-zero diagonal entries. Could someone please re-review the changes?

Overall this seems like the preferred approach by the community and it creates less code and tests than the alternative methods.

Copy link
Member

jnothman left a comment

This is not also applicable to davies-bouldin?

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Jun 26, 2019

This is not also applicable to davies-bouldin?

From the current API perspective, no, as you can't pass a pre-computed distance matrix to davies_bouldin_score. If we wanted to support pre-computed distance matrices or a function to compute distances then yes as discussed #12409

# Check for diagonal entries in precomputed distance matrix
if metric == 'precomputed':
diag_indices = np.diag_indices(X.shape[0])
if np.any(X[diag_indices]):

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jun 26, 2019

Member

Maybe we still need to consider floating point error here? e.g., np.any([1e-18]) is True. We can use np.isclose.

This comment has been minimized.

Copy link
@sjtrny

sjtrny Jun 26, 2019

Author Contributor

What's the purpose? Is it to allow some small but non-zero values on the diagonal?

That seems counter to the pull request because tiny values will change the result of the silhouette score.

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jun 26, 2019

Member

The point here is to avoid unexpected error, we generally avoid doing things like float == 0.

tiny values will change the result of the silhouette score.

Do you have an example to show that small numbers on the diagonal (e.g., 1e-10) will change the score significantly?

This comment has been minimized.

Copy link
@sjtrny

sjtrny Jun 26, 2019

Author Contributor

The point here is to avoid unexpected error

Uhhhhh, errors are generally unexpected, it's pretty rare that you aim to write code with errors.

Do you have an example to show that small numbers on the diagonal (e.g., 1e-10) will change the score significantly?

How much is "significant"? To me any answer that is not exactly the same (up to floating point precision limits) is not good enough. The easiest way to achieve this goal is to prevent any non-zero entries on the diagonal. Stuffing about with tolerances seems like a waste of time.

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Jun 26, 2019

Member

Uhhhhh, errors are generally unexpected, it's pretty rare that you aim to write code with errors.

I mean sometimes some elements on the diagonal are actually zero, but due to floating point errors, we need to use np.isclose to check whether they're zero, see e.g., #10005 #11591

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 7, 2019

Contributor

@sjtrny ,

  1. you can just use np.diagonal to directly get the values
  2. As noted by @qinhanmin2014 , exact comparison with float values are almost always false, due to numerical precision issues. Having a small tolerance threshold would make sense here.

Also, ideally, I wouldn't call reviews from long-time core-devs a "waste of time".

This comment has been minimized.

Copy link
@sjtrny

sjtrny Aug 8, 2019

Author Contributor
  1. I want it to return False if it is not exactly zero since these numerical precision issues cause incorrect calculation. I want to force users (myself) to set the diag to zero before calling this function to avoid incorrect calculation.

Keeping that in mind it does not make sense to tolerate some values that are close to but not zero.

I did not call the review a waste of time. I was referring to the effort in implementing it and deciding on threshold value since it is at odds of the goal of this PR. I highly value the work and effort of all contributors.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 8, 2019

Contributor

We understand what you want to do.

The point is that very small non-zero values may be present in the pre-computed matrix, and maybe these small non-zeros values have no significant effect on the silhouette score. If that's the case, then the check is a bit too strict.

We basically need you to illustrate your claim "tiny values will change the result of the silhouette score". You can do this e.g. by computing the score when setting the diagonal to zero vs setting the diagonal to random values with e.g. |x| < 1e-10.

In any case, I would suggest:

  • address previous comments
  • indicate in the docstring for metric that the matrix must have zero diagonal entries when it's precomputed
  • indicate in the error message how to set the diagonal entries to zero (e.g. np.fill_diagonal(X, 0))
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Jun 26, 2019


- |Fix| Raise a ValueError in :func:`metrics.silhouette_score` when a
precomputed distance matrix contains non-zero diagonal entries.
:issue:`12258` by :user:`Stephen Tierney <sjtrny>`.

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 7, 2019

Contributor
Suggested change
:issue:`12258` by :user:`Stephen Tierney <sjtrny>`.
:issue:`pr` by :user:`Stephen Tierney <sjtrny>`.

labels = [0, 0, 0, 1, 1, 1]

assert_raise_message(ValueError, "distance matrix contains non-zero",

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 7, 2019

Contributor

we use with pytest.raises(ValueError, match=...): now, but I wouldn't prevent this from merging.

# Check for diagonal entries in precomputed distance matrix
if metric == 'precomputed':
diag_indices = np.diag_indices(X.shape[0])
if np.any(X[diag_indices]):

This comment has been minimized.

Copy link
@NicolasHug

NicolasHug Aug 7, 2019

Contributor

@sjtrny ,

  1. you can just use np.diagonal to directly get the values
  2. As noted by @qinhanmin2014 , exact comparison with float values are almost always false, due to numerical precision issues. Having a small tolerance threshold would make sense here.

Also, ideally, I wouldn't call reviews from long-time core-devs a "waste of time".

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Aug 8, 2019

I agree with your comments about the documentation, the others not so much.

There’s no need for me to illustrate “tiny values will change the result of silhouette score” as I have already done so.

As for “no significant effect on silhouette score”, what is deemed to be significant?

We can argue forever about the tolerance value for “significance” since it is subjective. I would rather take the easy and less time consuming route of strictness.

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Aug 8, 2019

There’s no need for me to illustrate “tiny values will change the result of silhouette score” as I have already done so.

I'm sorry if I missed something, but I don't see where you back this claim with data.

We cannot just take your word for it, we're simple scientists, we need data.


@sjtrny, if I may: addressing reviewers comments and concerns is always way, way faster than arguing over and over about what you think makes sense and what does not. As reviewers we don't make comments because we like to annoy people, we do it because we want what's the most sensible for scikit-learn users. Sometimes these comments will make sense, sometimes they won't.

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Aug 8, 2019

Supporting evidence can be found in the linked issue #12178

@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Aug 8, 2019

Supporting evidence can be found in the linked issue #12178

No. You're setting the diagonal to ones here. What we suggested were very small values e.g. 1e-10.

@rth

This comment has been minimized.

Copy link
Member

rth commented Aug 8, 2019

Talking the example from the parent issue it doesn't look like a diagonal at 1e-10 has an effect (or at least a measurable one there),

In [1]: import numpy as np 
   ...: from sklearn.metrics.pairwise import pairwise_distances 
   ...: from sklearn.metrics import silhouette_samples 
   ...:  
   ...: dists = pairwise_distances(np.array([[0.2, 0.1, 0.12, 1.34, 1.11, 1.6]]).transpose())                                                  

In [5]: labels = [0,0,0,1,1,1]                                                                                                                 

In [7]: silhouette_samples(dists, labels, metric="precomputed")                                                                                
Out[7]: 
array([0.92173913, 0.952     , 0.95934959, 0.79583333, 0.62886598,
       0.74315068])

In [8]: silhouette_samples(dists + np.diag(1e-10*np.ones(6)), labels, metric="precomputed")                                                    
Out[8]: 
array([0.92173913, 0.952     , 0.95934959, 0.79583333, 0.62886598,
       0.74315068])

In [11]: silhouette_samples(dists + np.diag(1e-8*np.ones(6)), labels, metric="precomputed")                                                    
Out[11]: 
array([0.92173913, 0.952     , 0.95934959, 0.79583333, 0.62886597,
       0.74315068])

Raising an error if a float is not exactly equal to some value is just looking for trouble, we had failing tests on some architectures due to this in the past (ref #10562) and I would also be +1 to raising an error only if the diagonal matrix is larger in absolute value than some threshold.

@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Aug 8, 2019

No. You're setting the diagonal to ones here. What we suggested were very small values e.g. 1e-10.

You literally said, and I quote, "We basically need you to illustrate your claim "tiny values will change the result of the silhouette score". You only said "change". Tiny values do change the result of the silhouette score.

np.random.seed(0)

dist_scale = 1

dists = pairwise_distances(dist_scale * np.random.rand(6, 1))

labels = [0,0,0,1,1,1]

sil1 = silhouette_samples(dists, labels, metric="precomputed")
sil2 = silhouette_samples(dists + np.diag(1e-10*np.ones(6)), labels, metric="precomputed")

sil1, sil2, np.array_equal(sil1, sil2)

Output:

(array([-0.31565073,  0.21262581,  0.10908099, -0.30370346,  0.13528038,
        -0.56791659]),
 array([-0.31565073,  0.21262581,  0.10908099, -0.30370346,  0.13528038,
        -0.56791659]),
 False)
@sjtrny

This comment has been minimized.

Copy link
Contributor Author

sjtrny commented Aug 8, 2019

An important point that seems to be lost is that any tolerance needs to be with respect to the scale of the precomputed distances. Smaller distances are more negatively affected.

np.random.seed(0)

dist_scale = 0.0000001

dists = pairwise_distances(dist_scale * np.random.rand(6, 1))

labels = [0,0,0,1,1,1]

sil1 = silhouette_samples(dists, labels, metric="precomputed")
sil2 = silhouette_samples(dists + np.diag(1e-10*np.ones(6)), labels, metric="precomputed")

sil1, sil2

Output:

(array([-0.31565073,  0.21262581,  0.10908099, -0.30370346,  0.13528038,
        -0.56791659]),
 array([-0.31874277,  0.20980168,  0.10372613, -0.30682252,  0.13276277,
        -0.56924916]))
sjtrny added 2 commits Aug 8, 2019
@NicolasHug

This comment has been minimized.

Copy link
Contributor

NicolasHug commented Aug 12, 2019

Failures are unrelated.

Merging, I'll address the remaining issues in another PR. Thanks @sjtrny

@NicolasHug NicolasHug merged commit 3eacf94 into scikit-learn:master Aug 12, 2019
14 of 17 checks passed
14 of 17 checks passed
scikit-learn.scikit-learn Build #20190808.60 failed
Details
scikit-learn.scikit-learn (Linux py35_conda_openblas) Linux py35_conda_openblas failed
Details
scikit-learn.scikit-learn (Linux pylatest_conda_mkl_pandas) Linux pylatest_conda_mkl_pandas failed
Details
LGTM analysis: C/C++ No code changes detected
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python No new or fixed alerts
Details
ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: doc Your tests passed on CircleCI!
Details
ci/circleci: doc-min-dependencies Your tests passed on CircleCI!
Details
ci/circleci: lint Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.87%)
Details
codecov/project Absolute coverage decreased by -1.09% but relative coverage increased by +3.12% compared to a7a834b
Details
scikit-learn.scikit-learn (Linux py35_ubuntu_atlas) Linux py35_ubuntu_atlas succeeded
Details
scikit-learn.scikit-learn (Linux32 py35_ubuntu_atlas_32bit) Linux32 py35_ubuntu_atlas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py35_pip_openblas_32bit) Windows py35_pip_openblas_32bit succeeded
Details
scikit-learn.scikit-learn (Windows py37_conda_mkl) Windows py37_conda_mkl succeeded
Details
scikit-learn.scikit-learn (macOS pylatest_conda_mkl) macOS pylatest_conda_mkl succeeded
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.