Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix regression in silhouette_score for clusters of size 1 #7438

Merged
merged 7 commits into from
Sep 22, 2016

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Sep 15, 2016

fixes the regression described in #5988. #6089, and the initial patch here, were incorrect. Replaces #6089. This is #6089 rebased with an error context manager.

@jnothman jnothman added this to the 0.18 milestone Sep 15, 2016
@jnothman
Copy link
Member Author

the test could prob be strengthened

@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2016

@jnothman here is an excerpt from the Silhouette paper mentionned in the docstring:

When cluster A contains only a single object it is unclear how u(i) should be defined, and then
we simply set s(i) equal to zero. This choice is of course arbitrary, but a value of zero appears to
be most neutral. Indeed, from the above definition we easily see that -1 <= s(i) <= 1 for each object i.

http://www.sciencedirect.com/science/article/pii/0377042787901257

Unfortunately your PR does not seem to implement this:

>>> import numpy as np
>>> from sklearn.metrics import silhouette_samples
>>> silhouette_samples(np.array([[0], [0.1], [1]]), np.array([0, 0, 1]))
array([ 0.9       ,  0.88888889, -0.05      ])

The last value of that array should be 0 instead of -0.05.

@jnothman
Copy link
Member Author

Good catch. Though I think it might have been similarly misbehaved in 0.17. And annoyingly, the main example in the paper appears not to cover this case.

@jnothman
Copy link
Member Author

Yes, this patch is wrong in a few ways.

@jnothman
Copy link
Member Author

@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1?

Should we leave this broken for 0.18.0? Add a note that we know it's broken?

@ogrisel
Copy link
Member

ogrisel commented Sep 20, 2016

@ogrisel, what do you think behaviour should be for a cluster exclusively of multiple identical points??? Score 0 or 1?

I think it should be 1 by continuity with a cluster with 2 very close points.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great to fix this before the 0.18 release although I would not make it a strict blocker.

labels = np.array([1, 0, 1, 1, 1])
# The distance matrix doesn't actually matter.
D = np.random.RandomState(0).rand(len(labels), len(labels))
silhouette = silhouette_score(D, labels, metric='precomputed')
assert_false(np.isnan(silhouette))
ss = silhouette_samples(D, labels, metric='precomputed')
assert_false(np.isnan(ss).any())
assert_equal(ss[1], 0.)
Copy link
Member

@ogrisel ogrisel Sep 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a random precomputed metric array, could you please use a toy 2D X array with shape (4, 1) in 1D euclidean spaces such as X = [[1., 0., 1., 2.]] and then check that the output of silhouette_samples is [1., 0., 1., 0.] as expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review and for handing an example to me on a silver platter.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider the following comments:

# silhouette = [.5, .5, 0]
# Cluster 3: intra-cluster = [0, 0]
# inter-cluster = [arbitrary, arbitrary]
# silhouette = [1., 1.]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the cluster ids (1, 2, 3) in this comment to match 0, 1, and 2 used in the labels variable.

@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf * intra_clust_dists

for curr_label in unique_labels:
for curr_label in range(len(unique_labels)):

# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now broken if labels are not contiguous integers starting at zero, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was broken previously ::/

# is not discussed in reference material, and we choose for it a sample
# score of 1.
X = [[0.], [1.], [1.], [2.], [3.], [3.]]
labels = np.array([0, 1, 1, 1, 2, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually please update this test to have non-contiguous labels that do not start at zero, for instance:

labels = np.array([1, 3, 3, 3, 4, 4])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be confusing here. Indeed there's already a "test_non_encoded_labels", and if I add a *2 in it, it's gappy. Actually, it was only testing silhouette_score not silhouette_samples. Changed. Also removed the double encoding of labels from silhouette_score.

@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still raise a division by zero warning if all samples are duplicate of one another and all belong to the same cluster. But maybe we don't care that much of such a rare edge case...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if they belong to different cluster. Quick example.

 silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])

However, according to the formula, this might be expected behaviour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all samples belong to one cluster, silhouette_score is undefined according to the paper. All points being identical is not covered in the paper. Understandably.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough then.

# (cluster 0). We also test the case where there are identical samples
# as the only members of a cluster (cluster 2). To our knowledge, this case
# is not discussed in reference material, and we choose for it a sample
# score of 1.
Copy link
Member

@MechCoder MechCoder Sep 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is having identical samples in a cluster a special case? It comes directly from the formula, right? Here

After fixing i, a(i) can be viewed as mean distance to all other identical points which is zero and b(i) can be anything. So the formula reduces to b(i) / b(i) which is 1, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is having identical samples in a cluster a special case?

It's not a special case, it just represents a non-smooth part of the function, where you might have expected, for clustering evaluation, that duplicating a point and assigning it the same label should not affect the score.

As a matter of implementation, I could have, even accidentally, handled the single-point cluster case in a way that assigned this 0, so it's important that we make a decision on it and test it.

@@ -180,22 +182,24 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf * intra_clust_dists

for curr_label in unique_labels:
for curr_label in range(len(unique_labels)):

# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was broken previously ::/

@MechCoder
Copy link
Member

LGTM pending @ogrisel 's comments

@MechCoder MechCoder changed the title [MRG] Fix regression in silhouette_score for clusters of size 1 [MRG+1] Fix regression in silhouette_score for clusters of size 1 Sep 21, 2016
if n_samples_curr_lab != 0:
intra_clust_dists[mask] = np.sum(
current_distances[:, mask], axis=1) / n_samples_curr_lab
else:
intra_clust_dists[mask] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthy to initialize intra_clust_dists with np.zeros this at the start and get rid of the else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -205,6 +209,8 @@ def silhouette_samples(X, labels, metric='euclidean', **kwds):

sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if they belong to different cluster. Quick example.

 silhouette_samples([[1.0], [1.0], [1.0], [1.0]], [0, 0, 1, 1])

However, according to the formula, this might be expected behaviour.

silhouette_score(X, labels + 10), silhouette_score(X, labels))
silhouette_score(X, labels * 2 + 10), silhouette_score(X, labels))
assert_array_equal(
silhouette_samples(X, labels * 2 + 10), silhouette_samples(X, labels))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great checks, thanks!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@ogrisel ogrisel merged commit 3e7a7ca into scikit-learn:master Sep 22, 2016
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* tag '0.18': (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* releases: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* dfsg: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants