Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Make Silhouette_score support sparse X #7170

Merged
merged 7 commits into from Aug 11, 2016

Conversation

@yenchenlin
Copy link
Contributor

yenchenlin commented Aug 10, 2016

Currently silhouette_score only support dense data, this PR would make it support sparse input.
Also, this PR fix #6317.

silhouette = silhouette_score(D, y, metric='precomputed',
sample_size=int(X.shape[0] / 2),
random_state=0)
silhouette_metric = silhouette_score(X, y, metric='euclidean',

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

bad naming

silhouette = silhouette_score(D, y, metric='precomputed')
assert(silhouette > 0)

for X in [X_dense, X_sparse]:

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

but we also want to test that the score equal whether the input is sparse/dense

@@ -15,31 +15,29 @@
def test_silhouette():
# Tests the Silhouette Coefficient.
dataset = datasets.load_iris()
X = dataset.data
X_dense = dataset.data
X_sparse = csr_matrix(X_dense)

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

should also test with something other than CSR

This comment has been minimized.

Copy link
@yenchenlin

yenchenlin Aug 10, 2016

Author Contributor

You mean csc_matrix?

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

I'd rather something a bit weirder, like LIL or DOK, just to be sure it's not relying on data matrix properties, etc.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 10, 2016

Comments addressed, please have a look!

ping @jnothman

assert(silhouette_metric > 0)
assert_almost_equal(score_euclidean, score_precomputed)
if X is X_dense:
X_dense_score['precomputed'] = score_precomputed

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

These are asserted to be equal, so we needn't test for both.

This comment has been minimized.

Copy link
@yenchenlin

yenchenlin Aug 10, 2016

Author Contributor

done

@jnothman
Copy link
Member

jnothman commented Aug 10, 2016

LGTM

@jnothman jnothman changed the title Make Silhouette_score support sparse X [MRG+1] Make Silhouette_score support sparse X Aug 10, 2016
D = pairwise_distances(X, metric='euclidean')
# Given that the actual labels are used, we can assume that S would be
# positive.
silhouette = silhouette_score(D, y, metric='precomputed')

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

Can you make the name changes for this and the next metric variable?

This comment has been minimized.

Copy link
@yenchenlin

yenchenlin Aug 10, 2016

Author Contributor

Fixed!

@yenchenlin yenchenlin force-pushed the yenchenlin:silhouette-score branch from 30942b0 to 614baa4 Aug 10, 2016
sample_size=int(X.shape[0] / 2),
random_state=0)
assert(score_precomputed > 0)
assert(score_euclidean > 0)

This comment has been minimized.

Copy link
@ogrisel

ogrisel Aug 10, 2016

Member

This PR is a great opportunity to use assert_greater to check those assertions and get meaningful error messages in case of future regression.

This comment has been minimized.

Copy link
@yenchenlin

yenchenlin Aug 10, 2016

Author Contributor

Let's use it!

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

So now you're not using assert_greater? I meant use the helper, but don't provide a custom message. assert_greater will construct a message that says what the arguments' values are, unlike this form of assertion.

This comment has been minimized.

Copy link
@yenchenlin

yenchenlin Aug 11, 2016

Author Contributor

Sorry for my misunderstanding, fixed.

@ogrisel
Copy link
Member

ogrisel commented Aug 10, 2016

LGTM as well. +1 for merge as soon as CI is all green (with or without my nitpick comment addressed).

@yenchenlin yenchenlin force-pushed the yenchenlin:silhouette-score branch from 3a42817 to abd3038 Aug 10, 2016
@@ -22,7 +22,6 @@ def test_silhouette():
X_lil = sp.lil_matrix(X_dense)
y = dataset.target

X_dense_score = {}
for X in [X_dense, X_csr, X_dok, X_lil]:

This comment has been minimized.

Copy link
@ogrisel

ogrisel Aug 10, 2016

Member

Only 'csc' and 'csr' are now accepted. I don't understand how dok and lil are expected to work.

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

That's not how accept_sparse works: "If the input is sparse but not in the allowed format, it will be converted to the first listed format."

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 10, 2016

@ogrisel , CI has passed, and I also adopt assert_greater here.

Please have a look, thanks!

# Given that the actual labels are used, we can assume that S would be
# positive.
score_precomputed = silhouette_score(D, y, metric='precomputed')
assert_greater(score_precomputed, 0, "silhouette score is not greater than 0")

This comment has been minimized.

Copy link
@jnothman

jnothman Aug 10, 2016

Member

PEP8 line length!

And actually I think the explicit error message adds little. Just use the default.

score_precomputed = silhouette_score(D, y, metric='precomputed',
sample_size=int(X.shape[0] / 2),
random_state=0)
assert_greater(score_precomputed, 0, "silhouette score is not greater than 0")

This comment has been minimized.

Copy link
@jnothman
score_euclidean = silhouette_score(X, y, metric='euclidean',
sample_size=int(X.shape[0] / 2),
random_state=0)
assert_greater(score_euclidean, 0, "silhouette score is not greater than 0")

This comment has been minimized.

Copy link
@jnothman
@yenchenlin yenchenlin force-pushed the yenchenlin:silhouette-score branch from 9240c7c to abd3038 Aug 10, 2016
@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 10, 2016

Okay, done!

@yenchenlin yenchenlin force-pushed the yenchenlin:silhouette-score branch from abd3038 to e57bf76 Aug 10, 2016
@jnothman jnothman merged commit 37dbb1c into scikit-learn:master Aug 11, 2016
1 of 3 checks passed
1 of 3 checks passed
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
ci/circleci Your tests passed on CircleCI!
Details
@jnothman
Copy link
Member

jnothman commented Aug 11, 2016

Thanks!

@jnothman
Copy link
Member

jnothman commented Aug 11, 2016

Please submit PR to add what's new entry as bug fix.

@yenchenlin
Copy link
Contributor Author

yenchenlin commented Aug 12, 2016

okay!

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants
You can’t perform that action at this time.