# [MRG+1] Added support for multiclass Matthews correlation coefficient #8094

Merged
merged 37 commits into from Jun 19, 2017

## Conversation

Projects
None yet
5 participants
Contributor

### Erotemic commented Dec 21, 2016 • edited by jnothman Edited 1 time jnothman edited Feb 17, 2017 (most recent)

#### What does this implement/fix? Explain your changes.

This extends the current matthews_corrcoef to handle the multiclass case.

Also fixes #7929 and #8354

The extension is defined here: http://www.sciencedirect.com/science/article/pii/S1476927104000799
(pdf is behind a paywall, but the author has a website with details here http://rk.kvl.dk/introduction/index.html )

and my implementation follows equation (2) in this paper:
http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0041882&type=printable

The new implementation can handle both the binary and multiclass case. I've left in the original binary case implementation for now (it is a bit faster and more clear as to what is going on).

I've added new tests that inspect properties of the multiclass case as well as ensure that the multiclass case reduces to the binary case.

There's not much else to say. This is a pretty straight forward change.

### Erotemic added some commits Dec 21, 2016

 Added support for multiclass MCC 
 db088c4 
 Cleaned up implementation and referenced original paper 
 2b69e58 
 Accidently added temporary work 
 0d61d44 
 fixed pep8 errors 
 405048c 
 hopefully last pep8 error 
 15f0fe3 
 Unified multiclass and binary MCC cases 
 0d08a88 
Contributor

### Erotemic commented Dec 22, 2016 • edited Edited 1 time Erotemic edited Dec 22, 2016 (most recent)

 I made some updates to this code to both simply it and unify the binary and multiclass case. For both my reference and to describe my process here are the iterations I went through to change the code. The original binary case computed the MCC as such:  mean_yt = np.average(y_true, weights=sample_weight) mean_yp = np.average(y_pred, weights=sample_weight) y_true_u_cent = y_true - mean_yt y_pred_u_cent = y_pred - mean_yp cov_ytyp = np.average(y_true_u_cent * y_pred_u_cent, weights=sample_weight) var_yt = np.average(y_true_u_cent ** 2, weights=sample_weight) var_yp = np.average(y_pred_u_cent ** 2, weights=sample_weight) mcc = cov_ytyp / np.sqrt(var_yt * var_yp) My first pass at computing the multiclass looked like this and directly followed this paper  C = confusion_matrix(y_pred, y_true, sample_weight=sample_weight) N = len(C) cov_ytyp = sum([ C[k, k] * C[m, l] - C[l, k] * C[k, m] for k in range(N) for m in range(N) for l in range(N) ]) cov_ytyt = sum([ C[:, k].sum() * np.sum([C[g, f] for f in range(N) for g in range(N) if f != k]) for k in range(N) ]) cov_ypyp = np.sum([ C[k, :].sum() * np.sum([C[f, g] for f in range(N) for g in range(N) if f != k]) for k in range(N) ]) mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp) I was able to improve on this a bit using numpy shortcuts  C = confusion_matrix(y_pred, y_true, sample_weight=sample_weight) N = len(C) cov_ytyp = ((np.diag(C)[:, np.newaxis, np.newaxis] * C).sum() - (C[np.newaxis, :, :] * C[:, :, np.newaxis]).sum()) cov_ytyt = np.sum([ (C[:, k].sum() * (C[:, :k].sum() + C[:, k + 1:].sum())) for k in range(N) ]) cov_ypyp = np.sum([ (C[k, :].sum() * (C[:k, :].sum() + C[k + 1:, :].sum())) for k in range(N) ]) mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp) My latest iteration significantly simplifies and increases the interpretability of the code. It runs the fastest out of the multiclass options I've written and only runs marginally slower for the binary case (231.1230 µs vs 285.2201 µs on a set of binary labels of length 200).  class_covariances = ( np.cov(y_pred == k, y_true == k, bias=True, fweights=sample_weight) for k in range(len(lb.classes_)) ) covariance = np.sum(class_covariances, axis=0) cov_ypyp, cov_ytyp, _, cov_ytyt = covariance.ravel() mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp) Moving to this simplified code exposed a small bug in the original tests. I had to remove the following line:  y_true_inv2 = label_binarize(y_true, ["a", "b"]) * -1 assert_almost_equal(matthews_corrcoef(y_true, y_true_inv2), -1) because the only reason it was working was due to bug #8098
 removed double ravel 
 613d661 

### Erotemic referenced this pull request Dec 22, 2016

Open

#### Bug in metrics.classification._check_targets? #8098

 Added whats new for multiclass MCC 
 806448b 

### jnothman reviewed Dec 27, 2016

Description of this extension is due in doc/modules/model_evaluation.rst.

I also wonder why you prefer the calculation based on cov over using confusion_matrix, which I suspect would be more readable given the discrete application.

Contributor

### Erotemic commented Dec 27, 2016

 The np.cov implementation more closely resembles the calculation used in the original paper. The fancy indexing in the list comprehensions looks a bit more confusing to me. However, it seems older versions of numpy don't support the fweights keyword argument, so regardless, I'll have to switch back to the confusion matrix implementation. I'll make that change and update doc/modules/model_evaluation.rst
Member

### jnothman commented Dec 27, 2016

 We could easily include cov in utils.fixes if fweights support is the only issue. Could you show me what it looks like with confusion matrix in a comment? … On 28 December 2016 at 05:44, Jon Crall ***@***.***> wrote: The np.cov implementation more closely resembles the calculation used in the original paper. The fancy indexing in the list comprehensions looks a bit more confusing to me. However, it seems older versions of numpy don't support the fweights keyword argument, so regardless, I'll have to switch back to the confusion matrix implementation. I'll make that change and update doc/modules/model_evaluation.rst — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8094 (comment)>, or mute the thread .
Contributor

### Erotemic commented Dec 27, 2016 • edited Edited 1 time Erotemic edited Dec 27, 2016 (most recent)

 My second comment shows my iterations on approaching the problem. The 3rd block of code shows my best confusion_matrix implementation. However, I'll repost the relevant code blocks here to avoid confusion: My confusion_matrix the implementation is:  C = confusion_matrix(y_pred, y_true, sample_weight=sample_weight) N = len(C) cov_ytyp = ((np.diag(C)[:, np.newaxis, np.newaxis] * C).sum() - (C[np.newaxis, :, :] * C[:, :, np.newaxis]).sum()) cov_ytyt = np.sum([ (C[:, k].sum() * (C[:, :k].sum() + C[:, k + 1:].sum())) for k in range(N) ]) cov_ypyp = np.sum([ (C[k, :].sum() * (C[:k, :].sum() + C[k + 1:, :].sum())) for k in range(N) ]) mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp) Whereas the np.cov implementation looks like this:  class_covariances = ( np.cov(y_pred == k, y_true == k, bias=True, fweights=sample_weight) for k in range(len(lb.classes_)) ) covariance = np.sum(class_covariances, axis=0) cov_ypyp, cov_ytyp, _, cov_ytyt = covariance.ravel() mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp) Perhaps there is a way to make the confusion_matrix implementation more concise that I've been unable to think of?
Member

### jnothman commented Dec 28, 2016

 Sorry I'd failed to read your comment above. Isn't C[:k, :].sum() + C[k + 1:, :].sum() the same as C.sum() - C[k].sum()? Or am I misreading? If so, I get s = C.sum(axis=1) cov_ypyp = s.sum() ** 2 - np.dot(s, s)  I must be doing something wrong.
Contributor

### Erotemic commented Dec 28, 2016 • edited Edited 1 time Erotemic edited Dec 28, 2016 (most recent)

 I think you are correct. Following your observation, I was also able to see a similar pattern in computing cov_ytyp. I've been able to greatly simplify the above code removing all need for list comprehensions and np.newaxis. The new code also runs about 7x faster. C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) t_sum = C.sum(axis=1) p_sum = C.sum(axis=0) n_correct = np.diag(C).sum() n_samples = p_sum.sum() cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum) cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum) cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum) mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)

### Erotemic added some commits Dec 28, 2016

 Improved MCC calculation. Added test to ensure consistency with paper… 
… version. Edited relevant documentation.
 1c742cf 
 Undid changes to quoted wiki text 
 7b12781 
 removed mcc from undefined multiclass in test_common 
 ea985d4 
Member

### jnothman commented Dec 29, 2016

 7x faster than using np.cov? or than list comprehensions?

### jnothman reviewed Dec 29, 2016

I'll try to review test and correctness soon.

 + C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) + t_sum = C.sum(axis=1) + p_sum = C.sum(axis=0) + n_correct = np.diag(C).sum()

#### jnothman Dec 29, 2016

Member

can use np.trace(C)

Contributor

duh @me, fixed.

### jnothman reviewed Dec 29, 2016

otherwise, LGTM!!

 + # These two weighted vectors have 0 correlation and hence mcc should be 0 + y_1 = [0, 1, 2, 0, 1, 2, 0, 1, 2] + y_2 = [1, 1, 1, 2, 2, 2, 0, 0, 0] + np.cov(y_1, y_2)

Member

?

#### Erotemic Dec 29, 2016

Contributor

A mistake in the comment, and a leftover np.cov from testing. Fixing.

 + }{\sqrt{ + (s^2 - \sum_{k}^{K} p_k^2) \times + (s^2 - \sum_{k}^{K} t_k^2) + }}

#### jnothman Dec 29, 2016

Member

You should probably note that this no longer ranges from -1 to 1...?

#### Erotemic Dec 29, 2016

Contributor

Technically it still does range from -1 to +1, because the multiclass case does encompass the binary case. However, when there are more than 2 labels it will not be possible to achieve -1. I'll note that:

When there are more than two labels, the value of the MCC will no longer range
between -1 and +1. Instead the minimum value will be somewhere between -1 and 0
depending on the number and distribution of ground true labels. The maximum
value is always +1.

 .. math:: MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}. +In the multiclass case, the Matthews correlation coefficient can be defined + in terms of a +:ref:sphx_glr_auto_examples_model_selection_plot_confusion_matrix.py

#### jnothman Dec 29, 2016

Member

I think

:func:confusion_matrix


would be more apt here.

Contributor

fixing

Contributor

### Erotemic commented Dec 29, 2016 • edited Edited 1 time Erotemic edited Dec 29, 2016 (most recent)

 @jnothman It was 7x faster than list comprehensions. That benchmark does not include the time it takes to compute the confusion matrix, which is the bottleneck of the function. On a separate note I'm noticing on AppVeyor that a test for MCC is failing, and I'm not sure why as it seems to pass on my machine as well as the other CI machines. The test that is failing (test_common.py.test_sample_weight_invariance:check_sample_weight_invariance(matthews_corrcoef_score) is a yeild test, which is something that I had problems with in #7654. I'm not sure if it is causing issues here. I've manually tried scaling the sample weights on some dummy data and it always seems consistent when I do it. [00:07:30] ====================================================================== [00:07:30] FAIL: C:\Python27-x64\lib\site-packages\sklearn\metrics\tests\test_common.py.test_sample_weight_invariance:check_sample_weight_invariance(matthews_corrcoef_score) [00:07:30] ---------------------------------------------------------------------- [00:07:30] Traceback (most recent call last): [00:07:30] File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest [00:07:30] self.test(*self.arg) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 741, in __call__ [00:07:30] return self.check(*args, **kwargs) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 292, in wrapper [00:07:30] return fn(*args, **kwargs) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\metrics\tests\test_common.py", line 1006, in check_sample_weight_invariance [00:07:30] "under scaling" % name) [00:07:30] File "C:\Python27-x64\lib\site-packages\numpy\testing\utils.py", line 490, in assert_almost_equal [00:07:30] raise AssertionError(_build_err_msg()) [00:07:30] AssertionError: [00:07:30] Arrays are not almost equal to 7 decimals [00:07:30] matthews_corrcoef_score sample_weight is not invariant under scaling [00:07:30] ACTUAL: 0.19988003199146895 [00:07:30] DESIRED: 0.61482001028003908 [00:07:30] [00:07:30] ====================================================================== [00:07:30] FAIL: C:\Python27-x64\lib\site-packages\sklearn\metrics\tests\test_common.py.test_sample_weight_invariance:check_sample_weight_invariance(matthews_corrcoef_score) [00:07:30] ---------------------------------------------------------------------- [00:07:30] Traceback (most recent call last): [00:07:30] File "C:\Python27-x64\lib\site-packages\nose\case.py", line 197, in runTest [00:07:30] self.test(*self.arg) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 741, in __call__ [00:07:30] return self.check(*args, **kwargs) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\utils\testing.py", line 292, in wrapper [00:07:30] return fn(*args, **kwargs) [00:07:30] File "C:\Python27-x64\lib\site-packages\sklearn\metrics\tests\test_common.py", line 1006, in check_sample_weight_invariance [00:07:30] "under scaling" % name) [00:07:30] File "C:\Python27-x64\lib\site-packages\numpy\testing\utils.py", line 490, in assert_almost_equal [00:07:30] raise AssertionError(_build_err_msg()) [00:07:30] AssertionError: [00:07:30] Arrays are not almost equal to 7 decimals [00:07:30] matthews_corrcoef_score sample_weight is not invariant under scaling [00:07:30] ACTUAL: 0.0 [00:07:30] DESIRED: -0.039763715905510061  When I run nosetests "sklearn/metrics/tests/test_common.py:test_sample_weight_invariance" --verbose 2>&1 | grep matthew  it outputs /home/joncrall/code/scikit-learn/sklearn/metrics/tests/test_common.py.test_sample_weight_invariance:check_sample_weight_invariance(matthews_corrcoef_score) ... ok /home/joncrall/code/scikit-learn/sklearn/metrics/tests/test_common.py.test_sample_weight_invariance:check_sample_weight_invariance(matthews_corrcoef_score) ... ok  Am I running the test wrong? Is there anything about this test or AppVeyor that is known to be unstable? Continuing to look into the AppVeyor failure and I'm just unable to reproduce the issue. I wrote the following standalone script with additional cases and I don't see how the failure numbers could be getting generated. The function seems perfectly scale invariant. from sklearn.metrics import matthews_corrcoef import numpy as np def test_scaled(metric, y1, y2, rng): sample_weight = rng.randint(1, 10, size=len(y1)) mcc_want = metric(y1, y2, sample_weight=sample_weight) print('mcc_want = %r' % (mcc_want,)) # print('sample_weight = %r' % (sample_weight,)) # print('y1 = %r' % (y1,)) # print('y2 = %r' % (y2,)) # print('mcc_want = %r' % (mcc_want,)) for s in [.003, .03, .5, 2, 2.1, 10.9]: weight = sample_weight * s mcc = metric(y1, y2, sample_weight=weight) # print('weight = %r' % (weight,)) # print('mcc = %r' % (mcc,)) assert np.isclose(mcc, mcc_want) # rng = np.random rng = np.random.RandomState(0) metric = matthews_corrcoef for n_classes in range(1, 10): for n_samples in [1, 2, 5, 20, 50, 100, 1000]: y1 = rng.randint(0, n_classes, size=(n_samples, )) y2 = rng.randint(0, n_classes, size=(n_samples, )) print('n_classes, n_samples = %r, %r' % (n_classes, n_samples,)) test_scaled(metric, y1, y2, rng)

### Erotemic and others added some commits Dec 29, 2016

 Small changes 
 66f9ed9 
 Debugging appveyor: ensure sample weights are copied 
 96a53eb 
Member

### jnothman commented Dec 29, 2016

 I hope you don't mind me hacking your branch to try debug this. I'm as lost as you are.... except that I thought I'd check that this isn't an issue of sample_weight being modified somehow between checks (not that I see how it can be).
Member

### jnothman commented Dec 30, 2016

 Still failing. It's hard to comprehend what might make this fail in Windows but work elsewhere, if not for some kind of interaction across generated tests. :\
Member

### jnothman commented Dec 30, 2016 • edited Edited 1 time jnothman edited Dec 30, 2016 (most recent)

 I've considered replacing y1 with y1.copy(), etc, again to isolate assertions/metrics from one another, but I've not done it yet. Any bright ideas for debugging an appveyor failure... @ogrisel, @lesteve?
 Added more debug info for AppVeyor 
 90c752c 
Contributor

### Erotemic commented Dec 30, 2016

 I don't mind anyone else hacking on the branch. I added my own debugging statements and the behavior is extremely weird. My debug statements print out the state of the variables and some sanity check measures. The extra checks look like this  # common factor for scaling in [2, 0.3]: import textwrap sample_weight2 = sample_weight * scaling metric1_sanity = metric(y1, y2, sample_weight=sample_weight) metric2_sanity = metric(y1, y2, sample_weight=sample_weight2) err_msg2 = textwrap.dedent( """ {name} sample_weight is not invariant under scaling. This is weird, so here is a longer form debug message metric1_sanity = {metric1_sanity} metric2_sanity = {metric2_sanity} weighted_score = {weighted_score} scaling = {scaling} y1 = {y1} y2 = {y2} sample_weight = {sample_weight} sample_weight2 = {sample_weight2} """ ).format( name=repr(name), metric1_sanity=repr(metric1_sanity), metric2_sanity=repr(metric2_sanity), weighted_score=repr(weighted_score), scaling=repr(scaling), y1=repr(y1), y2=repr(y2), sample_weight=repr(sample_weight), sample_weight2=repr(sample_weight2), ) assert_almost_equal( weighted_score, metric(y1, y2, sample_weight=sample_weight * scaling), err_msg=err_msg2, # err_msg="%s sample_weight is not invariant " # "under scaling" % name) ) test1 - AppVeyor windows The tests still fail only on windows and these are the values I get for the first test. 'matthews_corrcoef_score' sample_weight is not invariant under scaling. This is weird, so here is a longer form debug message metric1_sanity = 0.19988003199146895 metric2_sanity = 0.61482001028003908 weighted_score = 0.19988003199146895 scaling = 2 y1 = array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1]) y2 = array([0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0]) sample_weight = array([6, 1, 4, 4, 8, 4, 6, 3, 5, 8, 7, 9, 9, 2, 7, 8, 8, 9, 2, 6, 9, 5, 4, 1, 4, 6, 1, 3, 4, 9, 2, 4, 4, 4, 8, 1, 2, 1, 5, 8, 4, 3, 8, 3, 1, 1, 5, 6, 6, 7]) sample_weight2 = array([12, 2, 8, 8, 16, 8, 12, 6, 10, 16, 14, 18, 18, 4, 14, 16, 16, 18, 4, 12, 18, 10, 8, 2, 8, 12, 2, 6, 8, 18, 4, 8, 8, 8, 16, 2, 4, 2, 10, 16, 8, 6, 16, 6, 2, 2, 10, 12, 12, 14]) ACTUAL: 0.19988003199146895 DESIRED: 0.61482001028003908  test2 - AppVeyor windows the second failing test has these values: metric1_sanity = 0.0 metric2_sanity = -0.039763715905510061 weighted_score = 0.0 scaling = 2 y1 = array([4, 0, 3, 3, 3, 1, 3, 2, 4, 0, 0, 4, 2, 1, 0, 1, 1, 0, 1, 4, 3, 0, 3, 0, 2, 3, 0, 1, 3, 3, 3, 0, 1, 1, 1, 0, 2, 4, 3, 3, 2, 4, 2, 0, 0, 4, 0, 4, 1, 4]) y2 = array([1, 2, 2, 0, 1, 1, 1, 1, 3, 3, 2, 3, 0, 3, 4, 1, 2, 4, 3, 4, 4, 4, 3, 4, 4, 4, 0, 4, 3, 2, 0, 1, 1, 3, 0, 0, 1, 2, 4, 2, 0, 3, 2, 2, 0, 1, 0, 2, 2, 3]) sample_weight = array([6, 1, 4, 4, 8, 4, 6, 3, 5, 8, 7, 9, 9, 2, 7, 8, 8, 9, 2, 6, 9, 5, 4, 1, 4, 6, 1, 3, 4, 9, 2, 4, 4, 4, 8, 1, 2, 1, 5, 8, 4, 3, 8, 3, 1, 1, 5, 6, 6, 7]) sample_weight2 = array([12, 2, 8, 8, 16, 8, 12, 6, 10, 16, 14, 18, 18, 4, 14, 16, 16, 18, 4, 12, 18, 10, 8, 2, 8, 12, 2, 6, 8, 18, 4, 8, 8, 8, 16, 2, 4, 2, 10, 16, 8, 6, 16, 6, 2, 2, 10, 12, 12, 14]) ACTUAL: 0.0 DESIRED: -0.039763715905510061  So, I'm putting these numbers into my ubuntu machine and I get these values: test1 - local ubuntu In [17]: scaling = 2 ...: y1 = array([0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, ...: 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, ...: 1, 1, 0, 1]) ...: y2 = array([0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, ...: 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, ...: 0, 1, 1, 0]) ...: sample_weight = array([6, 1, 4, 4, 8, 4, 6, 3, 5, 8, 7, 9, 9, 2, 7, 8, 8, 9, 2, 6, 9, 5, 4, ...: 1, 4, 6, 1, 3, 4, 9, 2, 4, 4, 4, 8, 1, 2, 1, 5, 8, 4, 3, 8, 3, 1, 1, ...: 5, 6, 6, 7]) ...: In [18]: matthews_corrcoef(y1, y2, sample_weight) ...: Out[18]: 0.19988003199146895 In [19]: matthews_corrcoef(y1, y2, sample_weight * scaling) Out[19]: 0.19988003199146895 test2 - local ubuntu In [20]: scaling = 2 ...: y1 = array([4, 0, 3, 3, 3, 1, 3, 2, 4, 0, 0, 4, 2, 1, 0, 1, 1, 0, 1, 4, 3, 0, 3, ...: 0, 2, 3, 0, 1, 3, 3, 3, 0, 1, 1, 1, 0, 2, 4, 3, 3, 2, 4, 2, 0, 0, 4, ...: 0, 4, 1, 4]) ...: y2 = array([1, 2, 2, 0, 1, 1, 1, 1, 3, 3, 2, 3, 0, 3, 4, 1, 2, 4, 3, 4, 4, 4, 3, ...: 4, 4, 4, 0, 4, 3, 2, 0, 1, 1, 3, 0, 0, 1, 2, 4, 2, 0, 3, 2, 2, 0, 1, ...: 0, 2, 2, 3]) ...: sample_weight = array([6, 1, 4, 4, 8, 4, 6, 3, 5, 8, 7, 9, 9, 2, 7, 8, 8, 9, 2, 6, 9, 5, 4, ...: 1, 4, 6, 1, 3, 4, 9, 2, 4, 4, 4, 8, 1, 2, 1, 5, 8, 4, 3, 8, 3, 1, 1, ...: 5, 6, 6, 7]) ...: In [21]: matthews_corrcoef(y1, y2, sample_weight) ...: Out[21]: -0.0084553590158631987 In [22]: matthews_corrcoef(y1, y2, sample_weight * scaling) Out[22]: -0.0084553590158631987  One thing that I notice is that on my machine: test1 results in the metric1_sanity / ACTUAL value, but... test2 results in the metric2_sanity / DESIRED value I'm really stumped as to what could be causing this. I'll check copying y1, but I feel like anything at this point is a shot in the dark. (I'm really hoping for some very satisfying explanation to come out of all this)

### Erotemic added some commits Dec 30, 2016

 Shooting in the dark 
 fe87c9a 
 syntax error 
 1595089 
 temporary reduce appveyor load 
 59d44ed 
 wip 
 03f19f1 
 wip 
 255a031 
 more sanity checks 
 afd9ac0 
 wip 
 8d52b91 
 wip 
 14a19ee 
 more debug 
 5d70f46 
 Merge branch 'master' into multiclass_mcc 
 455a78c 

Member

### jnothman commented Feb 14, 2017

 Merging this may also fix #8354

### jnothman referenced this pull request Feb 14, 2017

Closed

#### cohen_kappa_score overflows integers #8354

 Merge branch 'master' into multiclass_mcc 
 1f73d35 

# Codecov Report

Merging #8094 into master will increase coverage by 0.74%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8094      +/-   ##
==========================================
+ Coverage   94.75%   95.49%   +0.74%
==========================================
Files         342      342
Lines       60801    61062     +261
==========================================
+ Hits        57609    58309     +700
+ Misses       3192     2753     -439
Impacted Files Coverage Δ
sklearn/metrics/tests/test_common.py 99.52% <ø> (+0.01%)
sklearn/metrics/classification.py 98.07% <100%> (+0.29%)
sklearn/metrics/tests/test_classification.py 100% <100%> (+0.42%)
sklearn/model_selection/_split.py 98.6% <0%> (-0.17%)
sklearn/linear_model/tests/test_randomized_l1.py 100% <0%> (ø)
sklearn/model_selection/init.py 100% <0%> (ø)
sklearn/ensemble/forest.py 98.16% <0%> (ø)
sklearn/tree/tree.py 98.41% <0%> (ø)
sklearn/cluster/tests/test_dbscan.py 100% <0%> (ø)
... and 62 more

Δ = absolute <relative> (impact), ø = not affected, ? = missing data

Merged

Closed

### GaelVaroquaux reviewed Mar 5, 2017

 .. math:: MCC = \frac{tp \times tn - fp \times fn}{\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}. +In the multiclass case, the Matthews correlation coefficient can be defined + in terms of a

#### GaelVaroquaux Mar 5, 2017

Member

Aren't missing an underscore at the end of the markup for this link?

### GaelVaroquaux reviewed Mar 5, 2017

 + if sample_weight.dtype.kind in {'i', 'u', 'b'}: + dtype = np.int64 + else: + dtype = np.float64

#### GaelVaroquaux Mar 5, 2017

Member

I don't understand the logic of upcasting everything to the maximum resolution.

Typically, I expect code to keep the same types as what I put in. If I put in float32, it is often a choice, to limit memory consumption.

#### Erotemic Mar 5, 2017

Contributor

This is because the confusion matrix accumulates values. Its common for accumulation functions to have a dtype that is different from the input dtype (see documentation of np.sum). This default dtype depends on the platform and one of these platforms (windows) had failing tests due to this behavior. The choice to always choose int64 is to maintain consistent cross-platform behavior.

### GaelVaroquaux reviewed Mar 5, 2017

 + # The minimum will be different for depending on the input + y_true = [0, 0, 1, 1, 2, 2] + y_pred_min = [1, 1, 0, 0, 0, 0] + assert_almost_equal(matthews_corrcoef(y_true, y_pred_min), -0.6123724)

#### GaelVaroquaux Mar 5, 2017

Member

Where does this value come from? I am not very comfortable with tests comparing against such hard-coded value if it is not easily understandable why the value is the correct one.

#### Erotemic Mar 5, 2017

Contributor

This is simply the correct output for this specific multiclass instance. The reason why this specific example takes a weird value and not -1 is because technically some of the negative predictions are correct. When there are 2 classes you can construct an instance that is completely wrong, but in more than 2 classes every time you say class 2 when it should have been class 1, you are technically correct that it wasn't class 0, so you'll always get something right.

Perhaps -12 / np.sqrt(24 * 16) would be better? I'm not sure how to give a better intuition without redefining the function itself or using a lot of terms.

I actually found this particular example by doing a brute force search over 6 examples with 3 labels to find the minimum value the MCC would take in this instance.

Member

Member

### jnothman commented Mar 5, 2017

 regarding the upcast, we have found this necessary to avoid numerical stability issues in confusion_matrix. please do tell if there is a better fix. (And good catch on the ReST link syntax: I hate it and forget to see that error frequently) … On 6 Mar 2017 8:36 am, "Gael Varoquaux" ***@***.***> wrote: I made a few small comments. Overall, this looks good. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8094 (comment)>, or mute the thread .
Member

### jnothman commented Mar 5, 2017

 Yes, -12 / np.sqrt(24 * 16) would be better. … On 6 March 2017 at 09:22, Jon Crall ***@***.***> wrote: *@Erotemic* commented on this pull request. ------------------------------ In sklearn/metrics/tests/test_classification.py <#8094 (comment)> : > + n_classes = 4 + y_true = [chr(ord_a + i) for i in rng.randint(0, n_classes, size=20)] + + # corrcoef of same vectors must be 1 + assert_almost_equal(matthews_corrcoef(y_true, y_true), 1.0) + + # with multiclass > 2 it is not possible to achieve -1 + y_true = [0, 0, 1, 1, 2, 2] + y_pred_bad = [2, 2, 0, 0, 1, 1] + assert_almost_equal(matthews_corrcoef(y_true, y_pred_bad), -.5) + + # Maximizing false positives and negatives minimizes the MCC + # The minimum will be different for depending on the input + y_true = [0, 0, 1, 1, 2, 2] + y_pred_min = [1, 1, 0, 0, 0, 0] + assert_almost_equal(matthews_corrcoef(y_true, y_pred_min), -0.6123724) This is simply the correct output for this specific multiclass instance. The reason why this specific example takes a weird value and not -1 is because technically some of the negative predictions are correct. When there are 2 classes you can construct an instance that is completely wrong, but in more than 2 classes every time you say class 2 when it should have been class 1, you are technically correct that it wasn't class 0, so you'll always get something right. Perhaps -12 / np.sqrt(24 * 16) would be better? I'm not sure how to give a better intuition without redefining the function itself or using a lot of terms. I actually found this particular example by doing a brute force search over 6 examples with 3 labels to find the minimum value the MCC would take in this instance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8094 (comment)>, or mute the thread .

### lesteve reviewed Mar 6, 2017

 @@ -366,8 +397,6 @@ def test_matthews_corrcoef(): y_true_inv = ["b" if i == "a" else "a" for i in y_true] assert_almost_equal(matthews_corrcoef(y_true, y_true_inv), -1) - y_true_inv2 = label_binarize(y_true, ["a", "b"]) * -1

#### lesteve Mar 6, 2017

Member

I think #8377 should be merged before this PR, which would reduce its scope. I already had @jnothman's +1 maybe @GaelVaroquaux you can have a look?

### Erotemic and others added some commits Mar 6, 2017

 Changed based on reviews 
 95f4d3d 
 Merge branch 'multiclass_mcc' of github.com:Erotemic/scikit-learn int… 
…o multiclass_mcc
 552d8c2 
 Merge branch 'master' into multiclass_mcc 
 f45015c 
Member

### lesteve commented Apr 25, 2017

 I think #8377 should be merged before this PR, which would reduce its scope. I already had @jnothman's +1 maybe @GaelVaroquaux you can have a look? #8377 has been merged, I fixed the conflicts via the web interface, let's see what the CIs have to say.
Member

### agramfort commented Jun 8, 2017

 @jnothman @lesteve all green here good to go?

### lesteve reviewed Jun 8, 2017

 + + .. [4] Jurman, Riccadonna, Furlanello, (2012). A Comparison of MCC and CEN + Error Measures in MultiClass Prediction + _

Member

Contributor

fixed

Member

### lesteve commented Jun 8, 2017 • edited Edited 1 time lesteve edited Jun 8, 2017 (most recent)

 It would be nice to merge this one during the sprint. It looks like it is useful and has been sitting idle for a while.
Member

### agramfort commented Jun 8, 2017

 then merge and fix the link on master :)
 fixed error in link url 
 58df854 
Member

### lesteve commented Jun 8, 2017

 then merge and fix the link on master :) For completeness, you can even push into people's branch now (or edit inline via the github web interface for small things). So if you have the necessary rights you can do the fix yourself before merging.
Member

### lesteve commented Jun 8, 2017

 Also I am not familiar at all with the ML aspects of this PR.
Member

### jnothman commented Jun 8, 2017

 what ml aspects, @lesteve? … On 8 Jun 2017 11:58 pm, "Loïc Estève" ***@***.***> wrote: Also I am not familiar at all with the ML aspects of this PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8094 (comment)>, or mute the thread .

Member

### jnothman commented Jun 19, 2017

 I think there is consensus to merge this. I'm taking Gael's "overall this looks good" to be a +1. Enough other eyes have looked at it.

### jnothman merged commit e339240 into scikit-learn:master Jun 19, 2017 2 of 3 checks passed

#### 2 of 3 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

### dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 1af0395 

### dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 eff0046 

### NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 3ff1768 

### paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 f2b0262 

### AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 50f4a69 

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 561bed5 

### jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

 [MRG+1] Added support for multiclass Matthews correlation coefficient (… 
…#8094)

Also ensure confusion matrix is accumulated with high precision.
 44582f9