# [MRG+1] Add Davies-Bouldin index #10827

Merged
merged 7 commits into from May 18, 2018

## Conversation

Projects
None yet
4 participants
Contributor

### logc commented Mar 17, 2018 • edited by glemaitre

Add another unsupervised quality metric for clustering results, the Davies-Bouldin Index.

closes #7942

#### What does this implement/fix? Explain your changes.

This implements an unsupervised quality metric for clustering results, the Davies-Bouldin Index, based on an already existing PR that was stalled. The differences between this commit and the changes proposed there are minimal. In particular, the tests are copied verbatim, to ensure that this implementation does still conform to what was expected.

I noticed while working on a toy problem that there are not many alternatives in sklearn for unsupervised metrics of clustering quality. In particular, I missed Davies-Bouldin from working with other ML packages (Weka, IIRC).

Looking through  sklearn, I found the above mentioned PR, noticed the author @tomron does not seem to answer anymore, and decided to push for this change to get accepted by doing a similar proposal. I fixed all remaining open comments.

If there is a better way to get the DB index into sklearn, please tell me. If there are other comments that can still be improved in this implementation, I will do my best to correct them, too.

### jnothman reviewed Mar 18, 2018

Thanks for this. Also, flake8 is failing

 ---------- .. [1] Davies, David L.; Bouldin, Donald W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227_

Member

#### logc Mar 18, 2018

Contributor

I'm sorry, I can see there are sections like this in other parts of the doc, but I don't know how to generate the example contents (?)

#### jnothman Mar 18, 2018

Member

It should just be a couple of lines showing how you would use this function in a simple case.

#### logc Mar 18, 2018

Contributor

Isn't the doctest example in lines 1625-1637 doing that?

#### jnothman Apr 9, 2018

Member

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

 cluster analysis. >>> from sklearn.cluster import KMeans >>> from sklearn.metrics import davis_bouldin_index

Member

Typo here

Contributor

Fixed

### jnothman reviewed Mar 18, 2018

 rng.rand(10, 2), np.arange(10)) # Assert the value is 0. when all samples are equals assert_equal(0., davies_bouldin_index(np.ones((10, 2)),

#### jnothman Mar 18, 2018

Member

These days we would prefer a bare assert, rather than assert_equal

Contributor

Fixed

Closed

Member

### jnothman commented Mar 18, 2018

 Feel like wrapping up #8135 for us also so we don't need to add tests here?
Contributor

### glemaitre commented Mar 18, 2018

 #10828 is taking over #8135 and could be almost merge with we don't consider to add the two tests which I mentioned in the summary.
Member

### jnothman commented Mar 18, 2018

 @logc, I don't think you've pushed a second commit.
Member

### jnothman commented Mar 18, 2018

 Please merge in the latest master, where we have just added common tests for clustering metrics. Please add this to the common tests in sklearn/metrics/cluster/tests/test_common.py
Contributor

### logc commented Mar 18, 2018

 @jnothman sorry, commented before pushing the second commit. The tests run really long, locally (!)
Member

### jnothman commented Mar 18, 2018

 The tests run really long, locally (!) Do you mean the full test suite? The clustering metrics tests?
Contributor

### logc commented Mar 18, 2018

 @jnothman yes, the full test suite. Currently, I am running it with make from the project root, before creating a commit. If I only have changes to the test_unsupervised.py file, then I only run that. If there is a better strategy, please tell me.
Member

### jnothman commented Mar 18, 2018

 If I were you, I'd run pytest sklearn/metrics/cluster and let Travis do the rest.
Contributor

### logc commented Mar 18, 2018 • edited

 @jnothman added davies_bouldin_index to test_common. In order to pass test_format_invariance, I had to change the dtype of the centroids variable -- from adapting to X to hardcoded as float. This was one of the open questions in the previous PR, and now I know the answer: if you do not do that assumption, then the resulting index is not the same for an array of ints as for the same array cast as float.
Contributor

### logc commented Apr 5, 2018

 @jnothman can I have another review here? 😄 Also, @tguillemot had comments on the previously existing PR; maybe I could have a review here, too? Thanks!
Member

### jnothman commented Apr 9, 2018

 is the difference between int and float X substantial?

### jnothman approved these changes Apr 9, 2018

LGTM, thanks

 ---------- .. [1] Davies, David L.; Bouldin, Donald W. (1979). "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227_

#### jnothman Apr 9, 2018

Member

Perhaps. I believe it belongs more here, in the API documentation, than in the narrative user guide

 [[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5) labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10 assert_almost_equal(davies_bouldin_index(X, labels), 2*np.sqrt(0.5)/3)

#### jnothman Apr 9, 2018

Member

Please include spaces around * and /

#### logc Apr 10, 2018

Contributor

Done

 X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) labels = [0, 0, 1, 2] assert_almost_equal(davies_bouldin_index(X, labels), (5./4)/3)

Member

Contributor

Done

Contributor

### logc commented Apr 11, 2018

 My ping to @tguillemot was not very successful 😄 ; maybe @glemaitre would help here in having a second opinion? Thanks to any and all reviewers.
Contributor

### glemaitre commented Apr 11, 2018

 I put it in my list of revision :) You should receive my review soon

Closed

### glemaitre requested changes Apr 21, 2018

Sorry for the delay.

Couple of comments and this is missing a what's new entry.

 DB = \frac{1}{k} \sum{i=1}^k \max_{i \neq j} R_{ij} >>> from sklearn import datasets

Contributor

#### logc Apr 22, 2018

Contributor

Done!

 .. topic:: References * Davies, David L.; Bouldin, Donald W. (1979).

#### glemaitre Apr 21, 2018

Contributor

You are missing an indent in the references to properly render them:

https://21311-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#id27

#### logc Apr 22, 2018

Contributor

Done! My RST skills are rusty, I only write Markdown these days ...

 @@ -80,6 +81,7 @@ 'confusion_matrix', 'consensus_score', 'coverage_error', 'davies_bouldin_index',

#### glemaitre Apr 21, 2018

Contributor

@jnothman do we have a convention finishing by score since that this is bounded between 0 and 1.

 rng = np.random.RandomState(seed=0) # Assert message when there is only one label assert_raise_message(ValueError, "Number of labels is",

#### glemaitre Apr 21, 2018

Contributor

I think this is time to factorize the error test which is the same between the different metrics.

#### logc Apr 22, 2018

Contributor

Done!

 "A Cluster Separation Measure". IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-1 (2): 224-227_ """ X, labels = check_X_y(X, labels)

#### glemaitre Apr 21, 2018

Contributor

I would factorize the part which is the same in all different metrics.
I think that this is redundant and stand there for a kind of check/validation

#### logc Apr 22, 2018 • edited

Contributor

This seems a bit out of scope for this PR. Also, there are small differences in each metric that make the refactor non-trivial. I could take it up in a later PR if you do not mind.

 rng.rand(10, 2), np.zeros(10)) # Assert message when all point are in different clusters assert_raise_message(ValueError, "Number of labels is",

Contributor

Same as above

#### logc Apr 22, 2018

Contributor

Done!

 rng.rand(10, 2), np.arange(10)) # Assert the value is 0. when all samples are equals assert 0. == davies_bouldin_index(np.ones((10, 2)),

#### glemaitre Apr 21, 2018

Contributor

you should write:
assert computed == expected. With float use pytest.approx

assert davies_bouldin_index(...) == pytest.approx(0.0)

#### logc Apr 22, 2018

Contributor

Done!

 [0] * 5 + [1] * 5) # Assert the value is 0. when all the mean cluster are equal assert 0. == davies_bouldin_index([[-1, -1], [1, 1]] * 10,

Contributor

Same as above

#### logc Apr 22, 2018

Contributor

Done!

 X = ([[0, 0], [1, 1]] * 5 + [[3, 3], [4, 4]] * 5 + [[0, 4], [1, 3]] * 5 + [[3, 1], [4, 0]] * 5) labels = [0] * 10 + [1] * 10 + [2] * 10 + [3] * 10 assert_almost_equal(davies_bouldin_index(X, labels),

#### glemaitre Apr 21, 2018

Contributor

Use pytest approx

#### logc Apr 22, 2018

Contributor

Done! Also, refactored other usages of assert_almost_equal to pytest.approx in this module, for consistency.

 # General case - cluster have one sample X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) labels = [0, 0, 1, 2] assert_almost_equal(davies_bouldin_index(X, labels),

Contributor

pytest approx

Contributor

Done!

### lzfelix commented Apr 22, 2018 • edited

 Thanks for your PR, but could you please provide supporting evidence that Davies-Bouldin index is bounded on the interval [0, 1]? The original paper by Davies and Bouldin [1] only explicitly states that this measure is grater or equal than zero. On the other hand, the MATLAB documentation shows a scenario where the index is larger than zero under K-means clustering: NumObservations: 600 InspectedK: [1 2 3 4 5 6] CriterionValues: [NaN 0.4663 0.4454 0.8316 1.0444 0.9236] OptimalK: 3  Reference: [1] Davies, David L., and Donald W. Bouldin. "A cluster separation measure." IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224-227.
Contributor

### glemaitre commented Apr 22, 2018

 @lzfelix Reading the references, it seems that there is nothing mentioning an upper bound equal to 1. I am not sure exactly how to make the formal proof. My intuition to find an upper bound to 1 would be in the below configuration: But honestly, this is just a draw and I would appreciate with somebody could have a formal proof, either way.

### lzfelix commented Apr 22, 2018 • edited

 @glemaitre it seems that DBI > 1 if there's some degree of overlap between two clusters, at least for the bidimensional case. It's harder to figure out what happens on higher-dimensional spaces without a better analysis. Still, it is possible to obtain DBI ~ 1.19 using the code below: X, y = sklearn.datasets.make_blobs(n_samples=400, centers=1, center_box=(0, 1)) y_hat = sklearn.cluster.KMeans(n_clusters=2).fit_predict(X) plt.scatter(*X.T, c=y_hat, alpha=0.4) print('Davies-Bouldin index: {:4.4}'.format(davies_bouldin_index(X, y_hat))) which produces:
Contributor

### glemaitre commented Apr 22, 2018

 Actually this is not really an overlap but more something like ellipses-like shape. In this case, it will be higher than 1. However, I am not sure what would be the upper bound formally.

### lzfelix commented Apr 22, 2018

 @glemaitre you are right. Maybe the upper bound helps on deciding if the generated clusters are not well defined, in the opposite sense that DBI ~ 0 corresponds to a good partitioning of the data? Still, it's just speculation... Anyways, I just wanted to try to contribute on the docs of this PR to avoid later confusion :)
Contributor

### glemaitre commented Apr 22, 2018

 Thanks this is useful review. We don't like to put erroneous doc :) ‎
Contributor

### logc commented Apr 22, 2018 • edited

 @lzfelix Thanks for pointing out that inconsistency! I am having a close look at the original reference, to see where this "upper bound is 1" idea is rooted ... I am not sure if it is the documentation, or the implementation which is wrong. By the way, if I understand correctly make_blobs, in your example you are forcing the data to be scattered around a single center, and then fit a KMeans model with n=2. One could argue that this is a fundamental shortcoming of the algorithm -- it tries to partition the dataset always in as many labels as given. In fact, DBI is "warning" (via the RuntimeWarning) that the clustering is artificial because it results in centroid distances that are 0 ... ?
Contributor

### logc commented Apr 22, 2018 • edited

 @glemaitre What is the correct way to create a "What's new" entry for this change?

### lzfelix commented Apr 22, 2018

 @logc the argument that DBI is warning that the clusters are malformed are rooted on the assumption that DBI should be smaller than one. On the other hand, so far we have not proven this bound nor are sure if it has a meaning, so these conclusions are not being built on solid assumptions. Your observations are correct about make_blobs. I created the example this way because it was easy to understand and reproduce. On the other hand I was able to produce the same behaviour on a dataset with high-dimensional data: while DBI > 1, the V-measure is larger than 0.5. Still, since the data is high dimensional (and dense) it's hard to wonder what's going on there. 😞
Contributor

### glemaitre commented Apr 22, 2018

 ‎In the section new feature of the v.0.20 in the doc/whatsnew folder. You can check the format there.
Contributor

### logc commented Apr 22, 2018

 I removed the reference to bounds in the DBI value. Now the doc only says that A lower index value is indicative of a better clustering partition

### lzfelix commented Apr 22, 2018 • edited

 @logc great! Maybe "index values closer to zero indicate a better clustering partition"? This way you ensure to convey the message that there is a lower bound only.
Member

### jnothman commented Apr 22, 2018

 Or just say "Zero is the minimum score and a greater score is better.". … On 23 April 2018 at 06:04, Luiz Felix ***@***.***> wrote: @logc great! Maybe "index values closer to zero indicative of a better clustering partition"? This way you ensure to convey the message that there is a lower bound only. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10827 (comment)>, or mute the thread .

### lzfelix commented Apr 22, 2018

 Actually, the smaller the better.
Member

### jnothman commented Apr 23, 2018

 of course. forgetting the context. just trying to simplify the message
Contributor

### logc commented Apr 23, 2018 • edited

 I went for a mix of both formulations 😄 Zero is the lowest possible score. Values closer to zero indicate a better partition.
Contributor

### logc commented May 14, 2018 • edited

 @glemaitre I am sure this PR is on your list, but let me ask: is there something else that needs fixing here? thanks! 😄
Contributor

### glemaitre commented May 14, 2018

 Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology: https://22132-843222-gh.circle-artifacts.com/0/doc/modules/clustering.html#calinski-harabaz-index @jnothman Do you think that this is right to do so.
Member

### jnothman commented May 14, 2018

 Yes, it should be _score for consistency. … On 14 May 2018 at 22:50, Guillaume Lemaitre ***@***.***> wrote: Mainly, I would rename index by score to be similar to other metric. Other index follow this terminology: https://22132-843222-gh.circle-artifacts.com/0/doc/ modules/clustering.html#calinski-harabaz-index @jnothman Do you think that this is right to do so. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10827 (comment)>, or mute the thread .
Contributor

### logc commented May 14, 2018

 @jnothman @glemaitre renamed to score in implementation. Following the example in Calinski-Harabaz index linked, the doc still refers to it as an "index".

### logc added some commits Mar 17, 2018

 Add Davies-Bouldin index 
Add another unsupervised quality metric for clustering results, the
Davies-Bouldin Index.
 76facb5 
 Code review changes 
- Add Davies-Bouldin to clustering test_common
- Fix flake8 issue
- Use bare assert in test
- Fix typo
 ed1325d 
 Add spaces around operators 
 a588458 
 Second code review changes 
- Fix incorrectly parsed documentation block
- Fix references indentation
- Refactor test assertions
 cd52612 
 Documentation changes 
 975944e 
 Rename index to score 
 3dcf3cb 
Contributor

### logc commented May 14, 2018

 Rebased on top of the current master
Contributor

### logc commented May 14, 2018

 The failing check does not seem related to these changes, but I don't know how to deal with the error: 60 Complete output from command python setup.py egg_info: 61 This backport is meant only for Python 2. 62 It does not work on Python 3, and Python 3 users do not need it as the concurrent.futures package is available in the standard library. 63 For projects that work on both Python 2 and 3, the dependency needs to be conditional on the Python version, like so: 64 extras_require={':python_version == "2.7"': ['futures']} 65 66 ---------------------------------------- 67Command "python setup.py egg_info" failed with error code 1 in C:\Users\appveyor\AppData\Local\Temp\1\pip-install-dzi69ulw\futures\  futures is a transitive dependency pulled in from wheelhouse_uploader. Any ideas?

### glemaitre reviewed May 16, 2018

 # General case - cluster have one sample X = ([[0, 0], [2, 2], [3, 3], [5, 5]]) labels = [0, 0, 1, 2] pytest.approx(davies_bouldin_index(X, labels), (5. / 4) / 3)

#### glemaitre May 16, 2018

Contributor
1. -> 5
 .. math:: R_{ij} = \frac{s_i + s_j}{d_{ij}} Then the DB index is defined as:

#### glemaitre May 16, 2018

Contributor

We should probably defined DB -> Davies-Bouldin (DB) index

#### logc May 16, 2018

Contributor

Changed to "Davies-Bouldin"

 Zero is the lowest possible score. Values closer to zero indicate a better partition. In normal usage, the Davies-Bouldin index is applied to the results of a

Contributor

DB index

#### logc May 16, 2018

Contributor

I am sorry, I do not understand what is requested here (?)

#### logc May 16, 2018

Contributor

I guess we can leave it explicitly as "Davies-Bouldin". "DB" might be confused with database, or DBSCAN.

 >>> from sklearn import datasets >>> iris = datasets.load_iris() >>> X = iris.data

#### glemaitre May 16, 2018

Contributor

remove this line

 ~~~~~~~~~~ - The computation of Davies-Bouldin is simpler than that of Silhouette scores.

#### glemaitre May 16, 2018

Contributor

remove this line

 DBSCAN. - The usage of centroid distance limits the distance metric to Euclidean space.

Contributor

remove this line

Contributor

### glemaitre commented May 16, 2018

 Weird failing. It should not be related.
Contributor

### glemaitre commented May 16, 2018

 Also, please add a what's new entry. Once those I think that we are good to merge!!!
 Documentation changes 
 7848d87 
Contributor

### logc commented May 16, 2018

 About the what's new entry, I added one in commit cd52612 , for release 0.20 ...
Contributor

### glemaitre commented May 16, 2018

 About the what's new entry, I added one in commit cd52612 , for release 0.20 ... Sorry I missed the file.
Contributor

### glemaitre commented May 16, 2018

 I'll merge when it will be (almost) green
Contributor

### logc commented May 18, 2018

 @glemaitre the checks passed this time 😄 can we merge then?

### glemaitre merged commit 680c36b` into scikit-learn:master May 18, 2018 8 checks passed

#### 8 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 95.12%)
Details
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +4.87% compared to 4905934
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
Contributor

### glemaitre commented May 18, 2018 • edited

 Thanks for the recall and the PR