New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Add Davies-Bouldin index #10827

Merged
merged 7 commits into from May 18, 2018

Conversation

Projects
None yet
4 participants
@logc
Contributor

logc commented Mar 17, 2018

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

Reference Issues/PRs

closes #7942

What does this implement/fix? Explain your changes.

This implements an unsupervised quality metric for clustering results, the Davies-Bouldin Index, based on an already existing PR that was stalled. The differences between this commit and the changes proposed there are minimal. In particular, the tests are copied verbatim, to ensure that this implementation does still conform to what was expected.

Any other comments?

I noticed while working on a toy problem that there are not many alternatives in sklearn for unsupervised metrics of clustering quality. In particular, I missed Davies-Bouldin from working with other ML packages (Weka, IIRC).

Looking through sklearn, I found the above mentioned PR, noticed the author @tomron does not seem to answer anymore, and decided to push for this change to get accepted by doing a similar proposal. I fixed all remaining open comments.

If there is a better way to get the DB index into sklearn, please tell me. If there are other comments that can still be improved in this implementation, I will do my best to correct them, too.

@jnothman

Thanks for this. Also, flake8 is failing

----------
.. [1] `Davies, David L.; Bouldin, Donald W. (1979).
"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_

This comment has been minimized.

@jnothman

jnothman Mar 18, 2018

Member

Please add an Examples Sexton

This comment has been minimized.

@logc

logc Mar 18, 2018

Contributor

I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?)

This comment has been minimized.

@jnothman

jnothman Mar 18, 2018

Member

It should just be a couple of lines showing how you would use this function in a simple case.

This comment has been minimized.

@logc

logc Mar 18, 2018

Contributor

Isn't the doctest example in lines 1625-1637 doing that?

This comment has been minimized.

@jnothman

jnothman Apr 9, 2018

Member

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

cluster analysis.
>>> from sklearn.cluster import KMeans
>>> from sklearn.metrics import davis_bouldin_index

This comment has been minimized.

@jnothman

jnothman Mar 18, 2018

Member

Typo here

This comment has been minimized.

@logc

logc Mar 18, 2018

Contributor

Fixed

rng.rand(10, 2), np.arange(10))
# Assert the value is 0. when all samples are equals
assert_equal(0., davies_bouldin_index(np.ones((10, 2)),

This comment has been minimized.

@jnothman

jnothman Mar 18, 2018

Member

These days we would prefer a bare assert, rather than assert_equal

This comment has been minimized.

@logc

logc Mar 18, 2018

Contributor

Fixed

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 18, 2018

Feel like wrapping up #8135 for us also so we don't need to add tests here?

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Mar 18, 2018

#10828 is taking over #8135 and could be almost merge with we don't consider to add the two tests which I mentioned in the summary.

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 18, 2018

@logc, I don't think you've pushed a second commit.

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 18, 2018

Please merge in the latest master, where we have just added common tests for clustering metrics. Please add this to the common tests in sklearn/metrics/cluster/tests/test_common.py

@logc

This comment has been minimized.

Contributor

logc commented Mar 18, 2018

@jnothman sorry, commented before pushing the second commit. The tests run really long, locally (!)

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 18, 2018

The tests run really long, locally (!)

Do you mean the full test suite? The clustering metrics tests?

@logc

This comment has been minimized.

Contributor

logc commented Mar 18, 2018

@jnothman yes, the full test suite. Currently, I am running it with make from the project root, before creating a commit. If I only have changes to the test_unsupervised.py file, then I only run that. If there is a better strategy, please tell me.

@jnothman

This comment has been minimized.

Member

jnothman commented Mar 18, 2018

If I were you, I'd run pytest sklearn/metrics/cluster and let Travis do the rest.

@logc

This comment has been minimized.

Contributor

logc commented Mar 18, 2018

@jnothman added davies_bouldin_index to test_common.

In order to pass test_format_invariance, I had to change the dtype of the centroids variable -- from adapting to X to hardcoded as float.

This was one of the open questions in the previous PR, and now I know the answer: if you do not do that assumption, then the resulting index is not the same for an array of ints as for the same array cast as float.

@logc

This comment has been minimized.

Contributor

logc commented Apr 5, 2018

@jnothman can I have another review here? 😄

Also, @tguillemot had comments on the previously existing PR; maybe I could have a review here, too?

Thanks!

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 9, 2018

@jnothman

LGTM, thanks

----------
.. [1] `Davies, David L.; Bouldin, Donald W. (1979).
"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_

This comment has been minimized.

@jnothman

jnothman Apr 9, 2018

Member

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
assert_almost_equal(davies_bouldin_index(X, labels),
2*np.sqrt(0.5)/3)

This comment has been minimized.

@jnothman

jnothman Apr 9, 2018

Member

Please include spaces around * and /

This comment has been minimized.

@logc

logc Apr 10, 2018

Contributor

Done

X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
assert_almost_equal(davies_bouldin_index(X, labels),
(5./4)/3)

This comment has been minimized.

@jnothman

jnothman Apr 9, 2018

Member

Spaces around /, please

This comment has been minimized.

@logc

logc Apr 10, 2018

Contributor

Done

@jnothman jnothman changed the title from Add Davies-Bouldin index to [MRG+1] Add Davies-Bouldin index Apr 9, 2018

@logc

This comment has been minimized.

Contributor

logc commented Apr 11, 2018

My ping to @tguillemot was not very successful 😄 ; maybe @glemaitre would help here in having a second opinion? Thanks to any and all reviewers.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 11, 2018

I put it in my list of revision :) You should receive my review soon

@glemaitre

Sorry for the delay.

Couple of comments and this is missing a what's new entry.

DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij}
>>> from sklearn import datasets

This comment has been minimized.

@glemaitre

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

.. topic:: References
* Davies, David L.; Bouldin, Donald W. (1979).

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

You are missing an indent in the references to properly render them:

https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#id27

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done! My RST skills are rusty, I only write Markdown these days ...

@@ -80,6 +81,7 @@
'confusion_matrix',
'consensus_score',
'coverage_error',
'davies_bouldin_index',

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

@jnothman do we have a convention finishing by score since that this is bounded between 0 and 1.

rng = np.random.RandomState(seed=0)
# Assert message when there is only one label
assert_raise_message(ValueError, "Number of labels is",

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

I think this is time to factorize the error test which is the same between the different metrics.

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

"A Cluster Separation Measure". IEEE Transactions on
Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227`_
"""
X, labels = check_X_y(X, labels)

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

I would factorize the part which is the same in all different metrics.
I think that this is redundant and stand there for a kind of check/validation

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind.

rng.rand(10, 2), np.zeros(10))
# Assert message when all point are in different clusters
assert_raise_message(ValueError, "Number of labels is",

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

Same as above

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

rng.rand(10, 2), np.arange(10))
# Assert the value is 0. when all samples are equals
assert 0. == davies_bouldin_index(np.ones((10, 2)),

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

you should write:
assert computed == expected. With float use pytest.approx

assert davies_bouldin_index(...) == pytest.approx(0.0)

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

[0] * 5 + [1] * 5)
# Assert the value is 0. when all the mean cluster are equal
assert 0. == davies_bouldin_index([[-1, -1], [1, 1]] * 10,

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

Same as above

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

X = ([[0, 0], [1, 1]] * 5 + [[3, 3], [4, 4]] * 5 +
[[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5)
labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10
assert_almost_equal(davies_bouldin_index(X, labels),

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

Use pytest approx

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done! Also, refactored other usages of assert_almost_equal to pytest.approx in this module, for consistency.

# General case - cluster have one sample
X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
assert_almost_equal(davies_bouldin_index(X, labels),

This comment has been minimized.

@glemaitre

glemaitre Apr 21, 2018

Contributor

pytest approx

This comment has been minimized.

@logc

logc Apr 22, 2018

Contributor

Done!

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

Thanks for your PR, but could you please provide supporting evidence that Davies-Bouldin index is bounded on the interval [0, 1]? The original paper by Davies and Bouldin [1] only explicitly states that this measure is grater or equal than zero. On the other hand, the MATLAB documentation shows a scenario where the index is larger than zero under K-means clustering:

NumObservations: 600
InspectedK: [1 2 3 4 5 6]
CriterionValues: [NaN 0.4663 0.4454 0.8316 1.0444 0.9236]
OptimalK: 3

Reference: [1] Davies, David L., and Donald W. Bouldin. "A cluster separation measure." IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224-227.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 22, 2018

@lzfelix Reading the references, it seems that there is nothing mentioning an upper bound equal to 1.
I am not sure exactly how to make the formal proof.

My intuition to find an upper bound to 1 would be in the below configuration:
circle_cluster

But honestly, this is just a draw and I would appreciate with somebody could have a formal proof, either way.

@jnothman @GaelVaroquaux @lesteve @agramfort

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

@glemaitre it seems that DBI > 1 if there's some degree of overlap between two clusters, at least for the bidimensional case. It's harder to figure out what happens on higher-dimensional spaces without a better analysis. Still, it is possible to obtain DBI ~ 1.19 using the code below:

X, y = sklearn.datasets.make_blobs(n_samples=400, centers=1, center_box=(0, 1))
y_hat = sklearn.cluster.KMeans(n_clusters=2).fit_predict(X)

plt.scatter(*X.T, c=y_hat, alpha=0.4)
print('Davies-Bouldin index: {:4.4}'.format(davies_bouldin_index(X, y_hat)))

which produces:
clustering_output

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 22, 2018

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

@glemaitre you are right. Maybe the upper bound helps on deciding if the generated clusters are not well defined, in the opposite sense that DBI ~ 0 corresponds to a good partitioning of the data? Still, it's just speculation...

Anyways, I just wanted to try to contribute on the docs of this PR to avoid later confusion :)

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 22, 2018

@logc

This comment has been minimized.

Contributor

logc commented Apr 22, 2018

@lzfelix Thanks for pointing out that inconsistency! I am having a close look at the original reference, to see where this "upper bound is 1" idea is rooted ... I am not sure if it is the documentation, or the implementation which is wrong.

By the way, if I understand correctly make_blobs, in your example you are forcing the data to be scattered around a single center, and then fit a KMeans model with n=2. One could argue that this is a fundamental shortcoming of the algorithm -- it tries to partition the dataset always in as many labels as given.

In fact, DBI is "warning" (via the RuntimeWarning) that the clustering is artificial because it results in centroid distances that are 0 ... ?

@logc

This comment has been minimized.

Contributor

logc commented Apr 22, 2018

@glemaitre What is the correct way to create a "What's new" entry for this change?

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

@logc the argument that DBI is warning that the clusters are malformed are rooted on the assumption that DBI should be smaller than one. On the other hand, so far we have not proven this bound nor are sure if it has a meaning, so these conclusions are not being built on solid assumptions.

Your observations are correct about make_blobs. I created the example this way because it was easy to understand and reproduce. On the other hand I was able to produce the same behaviour on a dataset with high-dimensional data: while DBI > 1, the V-measure is larger than 0.5. Still, since the data is high dimensional (and dense) it's hard to wonder what's going on there. 😞

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented Apr 22, 2018

@logc

This comment has been minimized.

Contributor

logc commented Apr 22, 2018

I removed the reference to bounds in the DBI value. Now the doc only says that

A lower index value is indicative of a better clustering partition

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

@logc great! Maybe "index values closer to zero indicate a better clustering partition"? This way you ensure to convey the message that there is a lower bound only.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 22, 2018

@lzfelix

This comment has been minimized.

lzfelix commented Apr 22, 2018

Actually, the smaller the better.

@jnothman

This comment has been minimized.

Member

jnothman commented Apr 23, 2018

@logc

This comment has been minimized.

Contributor

logc commented Apr 23, 2018

I went for a mix of both formulations 😄

Zero is the lowest possible score. Values closer to zero indicate a better partition.

@logc

This comment has been minimized.

Contributor

logc commented May 14, 2018

@glemaitre I am sure this PR is on your list, but let me ask: is there something else that needs fixing here? thanks! 😄

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 14, 2018

Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology:
https://22132-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#calinski-harabaz-index

@jnothman Do you think that this is right to do so.

@jnothman

This comment has been minimized.

Member

jnothman commented May 14, 2018

@logc

This comment has been minimized.

Contributor

logc commented May 14, 2018

@jnothman @glemaitre renamed to score in implementation. Following the example in Calinski-Harabaz index linked, the doc still refers to it as an "index".

logc added some commits Mar 17, 2018

Add Davies-Bouldin index
Add another unsupervised quality metric for clustering results, the
Davies-Bouldin Index.
Code review changes
- Add Davies-Bouldin to clustering test_common
- Fix flake8 issue
- Use bare `assert` in test
- Fix typo
Second code review changes
- Fix incorrectly parsed documentation block
- Fix references indentation
- Refactor test assertions
@logc

This comment has been minimized.

Contributor

logc commented May 14, 2018

Rebased on top of the current master

@logc

This comment has been minimized.

Contributor

logc commented May 14, 2018

The failing check does not seem related to these changes, but I don't know how to deal with the error:

60    Complete output from command python setup.py egg_info:
61    This backport is meant only for Python 2.
62    It does not work on Python 3, and Python 3 users do not need it as the concurrent.futures package is available in the standard library.
63    For projects that work on both Python 2 and 3, the dependency needs to be conditional on the Python version, like so:
64    extras_require={':python_version == "2.7"': ['futures']}
65    
66    ----------------------------------------
67Command "python setup.py egg_info" failed with error code 1 in C:\Users\appveyor\AppData\Local\Temp\1\pip-install-dzi69ulw\futures\

futures is a transitive dependency pulled in from wheelhouse_uploader. Any ideas?

# General case - cluster have one sample
X = ([[0, 0], [2, 2], [3, 3], [5, 5]])
labels = [0, 0, 1, 2]
pytest.approx(davies_bouldin_index(X, labels), (5. / 4) / 3)

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor
  1. -> 5
.. math::
R_{ij} = \frac{s_i + s_j}{d_{ij}}
Then the DB index is defined as:

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor

We should probably defined DB -> Davies-Bouldin (DB) index

This comment has been minimized.

@logc

logc May 16, 2018

Contributor

Changed to "Davies-Bouldin"

Zero is the lowest possible score. Values closer to zero indicate a better
partition.
In normal usage, the Davies-Bouldin index is applied to the results of a

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor

DB index

This comment has been minimized.

@logc

logc May 16, 2018

Contributor

I am sorry, I do not understand what is requested here (?)

This comment has been minimized.

@logc

logc May 16, 2018

Contributor

I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> X = iris.data

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor

remove this line

~~~~~~~~~~
- The computation of Davies-Bouldin is simpler than that of Silhouette scores.

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor

remove this line

DBSCAN.
- The usage of centroid distance limits the distance metric to Euclidean space.

This comment has been minimized.

@glemaitre

glemaitre May 16, 2018

Contributor

remove this line

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 16, 2018

Weird failing. It should not be related.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 16, 2018

Also, please add a what's new entry.
Once those I think that we are good to merge!!!

@logc

This comment has been minimized.

Contributor

logc commented May 16, 2018

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 16, 2018

About the what's new entry, I added one in commit cd52612 , for release 0.20 ...

Sorry I missed the file.

@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 16, 2018

I'll merge when it will be (almost) green

@logc

This comment has been minimized.

Contributor

logc commented May 18, 2018

@glemaitre the checks passed this time 😄 can we merge then?

@glemaitre glemaitre merged commit 680c36b into scikit-learn:master May 18, 2018

8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.12%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +4.87% compared to 4905934
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@glemaitre

This comment has been minimized.

Contributor

glemaitre commented May 18, 2018

Thanks for the recall and the PR

@logc logc deleted the logc:davies-bouldin-index branch May 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment