[MRG+1] Make Silhouette_score support sparse X #7170

Merged
merged 7 commits into from Aug 11, 2016

Conversation

Projects
None yet
3 participants
@yenchenlin
Contributor

yenchenlin commented Aug 10, 2016

Currently silhouette_score only support dense data, this PR would make it support sparse input.
Also, this PR fix #6317.

+ silhouette = silhouette_score(D, y, metric='precomputed',
+ sample_size=int(X.shape[0] / 2),
+ random_state=0)
+ silhouette_metric = silhouette_score(X, y, metric='euclidean',

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

bad naming

@jnothman

jnothman Aug 10, 2016

Member

bad naming

- silhouette = silhouette_score(D, y, metric='precomputed')
- assert(silhouette > 0)
+
+ for X in [X_dense, X_sparse]:

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

but we also want to test that the score equal whether the input is sparse/dense

@jnothman

jnothman Aug 10, 2016

Member

but we also want to test that the score equal whether the input is sparse/dense

@@ -15,31 +15,29 @@
def test_silhouette():
# Tests the Silhouette Coefficient.
dataset = datasets.load_iris()
- X = dataset.data
+ X_dense = dataset.data
+ X_sparse = csr_matrix(X_dense)

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

should also test with something other than CSR

@jnothman

jnothman Aug 10, 2016

Member

should also test with something other than CSR

This comment has been minimized.

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

You mean csc_matrix?

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

You mean csc_matrix?

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

I'd rather something a bit weirder, like LIL or DOK, just to be sure it's not relying on data matrix properties, etc.

@jnothman

jnothman Aug 10, 2016

Member

I'd rather something a bit weirder, like LIL or DOK, just to be sure it's not relying on data matrix properties, etc.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Comments addressed, please have a look!

ping @jnothman

Contributor

yenchenlin commented Aug 10, 2016

Comments addressed, please have a look!

ping @jnothman

+ assert(silhouette_metric > 0)
+ assert_almost_equal(score_euclidean, score_precomputed)
+ if X is X_dense:
+ X_dense_score['precomputed'] = score_precomputed

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

These are asserted to be equal, so we needn't test for both.

@jnothman

jnothman Aug 10, 2016

Member

These are asserted to be equal, so we needn't test for both.

This comment has been minimized.

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

done

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

done

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 10, 2016

Member

LGTM

Member

jnothman commented Aug 10, 2016

LGTM

@jnothman jnothman changed the title from Make Silhouette_score support sparse X to [MRG+1] Make Silhouette_score support sparse X Aug 10, 2016

+ D = pairwise_distances(X, metric='euclidean')
+ # Given that the actual labels are used, we can assume that S would be
+ # positive.
+ silhouette = silhouette_score(D, y, metric='precomputed')

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

Can you make the name changes for this and the next metric variable?

@jnothman

jnothman Aug 10, 2016

Member

Can you make the name changes for this and the next metric variable?

This comment has been minimized.

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Fixed!

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Fixed!

+ sample_size=int(X.shape[0] / 2),
+ random_state=0)
+ assert(score_precomputed > 0)
+ assert(score_euclidean > 0)

This comment has been minimized.

@ogrisel

ogrisel Aug 10, 2016

Member

This PR is a great opportunity to use assert_greater to check those assertions and get meaningful error messages in case of future regression.

@ogrisel

ogrisel Aug 10, 2016

Member

This PR is a great opportunity to use assert_greater to check those assertions and get meaningful error messages in case of future regression.

This comment has been minimized.

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Let's use it!

@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Let's use it!

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

So now you're not using assert_greater? I meant use the helper, but don't provide a custom message. assert_greater will construct a message that says what the arguments' values are, unlike this form of assertion.

@jnothman

jnothman Aug 10, 2016

Member

So now you're not using assert_greater? I meant use the helper, but don't provide a custom message. assert_greater will construct a message that says what the arguments' values are, unlike this form of assertion.

This comment has been minimized.

@yenchenlin

yenchenlin Aug 11, 2016

Contributor

Sorry for my misunderstanding, fixed.

@yenchenlin

yenchenlin Aug 11, 2016

Contributor

Sorry for my misunderstanding, fixed.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Aug 10, 2016

Member

LGTM as well. +1 for merge as soon as CI is all green (with or without my nitpick comment addressed).

Member

ogrisel commented Aug 10, 2016

LGTM as well. +1 for merge as soon as CI is all green (with or without my nitpick comment addressed).

@@ -22,7 +22,6 @@ def test_silhouette():
X_lil = sp.lil_matrix(X_dense)
y = dataset.target
- X_dense_score = {}
for X in [X_dense, X_csr, X_dok, X_lil]:

This comment has been minimized.

@ogrisel

ogrisel Aug 10, 2016

Member

Only 'csc' and 'csr' are now accepted. I don't understand how dok and lil are expected to work.

@ogrisel

ogrisel Aug 10, 2016

Member

Only 'csc' and 'csr' are now accepted. I don't understand how dok and lil are expected to work.

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

That's not how accept_sparse works: "If the input is sparse but not in the allowed format, it will be converted to the first listed format."

@jnothman

jnothman Aug 10, 2016

Member

That's not how accept_sparse works: "If the input is sparse but not in the allowed format, it will be converted to the first listed format."

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 10, 2016

Contributor

@ogrisel , CI has passed, and I also adopt assert_greater here.

Please have a look, thanks!

Contributor

yenchenlin commented Aug 10, 2016

@ogrisel , CI has passed, and I also adopt assert_greater here.

Please have a look, thanks!

+ # Given that the actual labels are used, we can assume that S would be
+ # positive.
+ score_precomputed = silhouette_score(D, y, metric='precomputed')
+ assert_greater(score_precomputed, 0, "silhouette score is not greater than 0")

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

PEP8 line length!

And actually I think the explicit error message adds little. Just use the default.

@jnothman

jnothman Aug 10, 2016

Member

PEP8 line length!

And actually I think the explicit error message adds little. Just use the default.

+ score_precomputed = silhouette_score(D, y, metric='precomputed',
+ sample_size=int(X.shape[0] / 2),
+ random_state=0)
+ assert_greater(score_precomputed, 0, "silhouette score is not greater than 0")

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

ditto

+ score_euclidean = silhouette_score(X, y, metric='euclidean',
+ sample_size=int(X.shape[0] / 2),
+ random_state=0)
+ assert_greater(score_euclidean, 0, "silhouette score is not greater than 0")

This comment has been minimized.

@jnothman

jnothman Aug 10, 2016

Member

ditto

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 10, 2016

Contributor

Okay, done!

Contributor

yenchenlin commented Aug 10, 2016

Okay, done!

yenchenlin added some commits Aug 10, 2016

@jnothman jnothman merged commit 37dbb1c into scikit-learn:master Aug 11, 2016

1 of 3 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci Your tests passed on CircleCI!
Details
@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 11, 2016

Member

Thanks!

Member

jnothman commented Aug 11, 2016

Thanks!

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Aug 11, 2016

Member

Please submit PR to add what's new entry as bug fix.

Member

jnothman commented Aug 11, 2016

Please submit PR to add what's new entry as bug fix.

@yenchenlin

This comment has been minimized.

Show comment
Hide comment
@yenchenlin

yenchenlin Aug 12, 2016

Contributor

okay!

Contributor

yenchenlin commented Aug 12, 2016

okay!

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment