Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] avoid integer overflow by using floats for matthews_corrcoef #9693

Merged
merged 6 commits into from Sep 11, 2017

Conversation

@sam-s
Copy link
Contributor

@sam-s sam-s commented Sep 5, 2017

Reference Issue

Fixes #9622

What does this implement/fix? Explain your changes.

avoid integer overflow by using floats

Copy link
Member

@jnothman jnothman left a comment

Could you please add a test?

@@ -527,7 +527,8 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_true = lb.transform(y_true)
y_pred = lb.transform(y_pred)

C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight
).astype(np.float64) # bug#9622

This comment has been minimized.

@jnothman

jnothman Sep 5, 2017
Member

Better to use C = C.astype... than this awkward style

Copy link
Member

@jnothman jnothman left a comment

Otherwise LGTM

true_pos = conf_matrix[1,1]
false_pos = conf_matrix[1,0]
false_neg = conf_matrix[0,1]
n_points = float(conf_matrix.sum())

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

Is this float only for the sake of python 2 division? It's not needed. future.division is in effect.

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

This can also be determined as len(y_true)

This comment has been minimized.

@sam-s

sam-s Sep 6, 2017
Author Contributor

done (the float was more along the lines of "code as documentation" - added emphasis why this is "correct").

return mcc_numerator / np.sqrt(mcc_denominator)

def random_ys(n_points):
x_true = np.random.sample(n_points)

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

We usually want tests to be deterministic (so we don't get occasional test failures). So please use a np.random.RandomState with a fixed seed to generate data.

This comment has been minimized.

@sam-s

sam-s Sep 6, 2017
Author Contributor

done

@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)

C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
C = C.astype(np.float64) # Bug#9622
t_sum = C.sum(axis=1)
p_sum = C.sum(axis=0)
n_correct = np.trace(C)

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

can we use n_samples = len(y_true) to avoid imprecision there?

This comment has been minimized.

@sam-s

sam-s Sep 6, 2017
Author Contributor

no, because of sample_weight

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

good point. Thanks

y_pred = (x_pred > 0.5) * 1.0
return y_true, y_pred

for n_points in [100, 10000, 1000000]:

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

Is there any reason to try with smaller numbers?

This comment has been minimized.

@sam-s

sam-s Sep 6, 2017
Author Contributor

The main reason is that the success at smaller numbers makes it clear that the failure with the big numbers indicates a problem with the stock function rather than with mcc_safe.
IOW, this is a "test of the test".
This is cheap, so I see no reason not to do it.

@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)

C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
C = C.astype(np.float64) # Bug#9622

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

the bug number is not needed here, as long as the test fails if this line is removed (which it does).

This comment has been minimized.

@sam-s

sam-s Sep 6, 2017
Author Contributor

mathematically the astype is unnecessary and, in fact, could be viewed as dangerous, because it increases a possibility of a round-off errors.
the bug reference explains why we do need it (because python lacks transparent bignums).

This comment has been minimized.

@jnothman

jnothman Sep 6, 2017
Member

I think you mean numpy lacks transparent bignums. I'm pretty sure Python by default does them.

I don't mind this terribly. If we keep it, write "Issue #9622" rather than "Bug#9622".

FWIW, to pass the test I found it sufficient to just replace

t_sum = C.sum(axis=1, dtype=np.float64)
p_sum = C.sum(axis=0, dtype=np.float64)

but perhaps in other cases the trace is the problem??

This comment has been minimized.

@lesteve

lesteve Sep 7, 2017
Member

Maybe a slightly less magical change would be to do:

diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
     y_pred = lb.transform(y_pred)
 
     C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
-    C = C.astype(np.float64)    # Bug#9622
     t_sum = C.sum(axis=1)
     p_sum = C.sum(axis=0)
     n_correct = np.trace(C)
     n_samples = p_sum.sum()
-    cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
-    cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
-    cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
-    mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+    n_samples_square = n_samples ** 2
+    corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+    corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+    corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+    mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
 
     if np.isnan(mcc):
         return 0.

Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.

Note the from __future__ import division ensures that int divisions work as in Python 3.

This comment has been minimized.

@jnothman

jnothman Sep 7, 2017
Member

That seems a decent approach!

@jnothman jnothman changed the title Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef [MRG+1] avoid integer overflow by using floats for matthews_corrcoef Sep 6, 2017
@jnothman jnothman added this to the 0.19.1 milestone Sep 6, 2017
@jnothman
Copy link
Member

@jnothman jnothman commented Sep 6, 2017

flake8 errors:


./sklearn/metrics/tests/test_classification.py:491:33: E231 missing whitespace after ','
        true_pos = conf_matrix[1,1]
                                ^
./sklearn/metrics/tests/test_classification.py:492:34: E231 missing whitespace after ','
        false_pos = conf_matrix[1,0]
                                 ^
./sklearn/metrics/tests/test_classification.py:493:34: E231 missing whitespace after ','
        false_neg = conf_matrix[0,1]
Copy link
Member

@lesteve lesteve left a comment

Maybe the code can be made simpler and the test as well

@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)

C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
C = C.astype(np.float64) # Bug#9622

This comment has been minimized.

@lesteve

lesteve Sep 7, 2017
Member

Maybe a slightly less magical change would be to do:

diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
     y_pred = lb.transform(y_pred)
 
     C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
-    C = C.astype(np.float64)    # Bug#9622
     t_sum = C.sum(axis=1)
     p_sum = C.sum(axis=0)
     n_correct = np.trace(C)
     n_samples = p_sum.sum()
-    cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
-    cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
-    cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
-    mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+    n_samples_square = n_samples ** 2
+    corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+    corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+    corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+    mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
 
     if np.isnan(mcc):
         return 0.

Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.

Note the from __future__ import division ensures that int divisions work as in Python 3.


for n_points in [100, 10000, 1000000]:
y_true, y_pred = random_ys(n_points)
assert_almost_equal(matthews_corrcoef(y_true, y_true), 1.0)

This comment has been minimized.

@lesteve

lesteve Sep 7, 2017
Member

Just checking this with a simple array is a perfectly fine regression test for the bug we are seeing. A simpler example showing the bug:

import numpy as np
from sklearn.metrics.classification import matthews_corrcoef


n = int(1e5)

arr = np.repeat([0., 1.], n)
assert matthews_corrcoef(arr, arr) == 1.

# multi-class test
arr = np.repeat([0., 1., 2.], n)
assert matthews_corrcoef(arr, arr) == 1.

For completeness the value of matthews_corrcoef(arr, arr) is 0 (two classes) and 35.32 (multi-class) on master.

This comment has been minimized.

@sam-s

sam-s Sep 7, 2017
Author Contributor

given the lack of transparent bignums in numpy, your magic is dangerous.
I don't mind the other change ("manipulating quantities that are of order 1"), but casting to float is critical.

This comment has been minimized.

@sam-s

sam-s Sep 7, 2017
Author Contributor

I will gladly add the simple test, but I see no reason to give up the big random one.

@sam-s
Copy link
Contributor Author

@sam-s sam-s commented Sep 7, 2017

what does "MRG+1" mean?

@NelleV
Copy link
Member

@NelleV NelleV commented Sep 7, 2017

@sam-s It means the pull requests has been approved by one core contributor.

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 7, 2017

@lesteve I don't entirely get why your solution works. Isn't the largest number n_samples_square, which remains an integer in your code?

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 8, 2017

@lesteve I don't entirely get why your solution works. Isn't the largest number n_samples_square, which remains an integer in your code?

Integer division behaves as we want for Python3 (i.e. true division) and we have from __future__ import division so that it behaves as we want as well for Python 2. Maybe more related to what you wanted to hear, the overflow was because we were multiplying two covariances (order n^2 * n^2 super roughly). For n^2 to overflow you need 3 billion samples.

@sam-s
Copy link
Contributor Author

@sam-s sam-s commented Sep 8, 2017

in the current patch, n_samples is a float64.

@lesteve
Copy link
Member

@lesteve lesteve commented Sep 8, 2017

given the lack of transparent bignums in numpy, your magic is dangerous.
I don't mind the other change ("manipulating quantities that are of order 1"), but casting to float is critical.

Although I agree it can be surprising at first I would not call it magic, it is just the behaviour of true division that operands get casted to floats before the division takes place.

I am not going to claim to be a floating point expert, but I would go with my version posted in #9693 (comment).

@sam-s
Copy link
Contributor Author

@sam-s sam-s commented Sep 8, 2017

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2017

At any rate, "manipulating quantities that are of order 1" does not
matter for floats (by the mere meaning of the term "floating point
number").

The exponent can also overflow, though I feel that is very unlikely to happen.

I'm more concerned about C.sum() to overflow (though also unlikely?), which is avoided by casting to float.

@amueller
Copy link
Member

@amueller amueller commented Sep 8, 2017

Either solution seems fine for me.

@sam-s
Copy link
Contributor Author

@sam-s sam-s commented Sep 8, 2017

Copy link
Member

@jnothman jnothman left a comment

I'm happy to merge whichever solution

@sam-s
Copy link
Contributor Author

@sam-s sam-s commented Sep 10, 2017

@jnothman
Copy link
Member

@jnothman jnothman commented Sep 10, 2017

We still need a second lgtm

@amueller amueller merged commit 4c61e8b into scikit-learn:master Sep 11, 2017
6 checks passed
6 checks passed
ci/circleci Your tests passed on CircleCI!
Details
codecov/patch 100% of diff hit (target 96.17%)
Details
codecov/project 96.17% (+<.01%) compared to fb64216
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details
@amueller
Copy link
Member

@amueller amueller commented Sep 11, 2017

thanks

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Sep 12, 2017
…cikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests
amueller added a commit to amueller/scikit-learn that referenced this pull request Sep 12, 2017
…cikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests
massich added a commit to massich/scikit-learn that referenced this pull request Sep 15, 2017
…cikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests
amueller added a commit to amueller/scikit-learn that referenced this pull request Sep 19, 2017
remove outdated comment

fix also for FeatureUnion

[MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742)

[MRG+1] Remove hard dependency on nose (scikit-learn#9670)

MAINT Stop vendoring sphinx-gallery (scikit-learn#9403)

CI upgrade travis to run on new numpy release (scikit-learn#9096)

CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697)

* doc/datasets/conftest.py to implement the equivalent of nose fixtures
* add conftest.py in root folder to ensure that sklearn local folder
  is used rather than the package in site-packages
* test doc with pytest in Travis
* move custom_data_home definition from nose fixture to .rst file

[MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests

TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710)

TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738)

add some rudimentary tests for meta-estimators

fix extra whitespace in error message

add missing if_delegate_has_method in pipeline

don't test tuple pipeline for now

only copy list if not list already? doesn't seem to help?
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…cikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…cikit-learn#9693)

* Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef

* matthews_corrcoef: cosmetic change requested by jnothman

* Add test_matthews_corrcoef_overflow for Bug#9622

* test_matthews_corrcoef_overflow: clean-up and make deterministic

* matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype

* test_matthews_corrcoef_overflow: add simple deterministic tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants