[MRG+1] avoid integer overflow by using floats for matthews_corrcoef #9693
Conversation
Could you please add a test? |
@@ -527,7 +527,8 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None): | |||
y_true = lb.transform(y_true) | |||
y_pred = lb.transform(y_pred) | |||
|
|||
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) | |||
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight | |||
).astype(np.float64) # bug#9622 |
jnothman
Sep 5, 2017
Member
Better to use C = C.astype... than this awkward style
Better to use C = C.astype... than this awkward style
Otherwise LGTM |
true_pos = conf_matrix[1,1] | ||
false_pos = conf_matrix[1,0] | ||
false_neg = conf_matrix[0,1] | ||
n_points = float(conf_matrix.sum()) |
jnothman
Sep 6, 2017
Member
Is this float only for the sake of python 2 division? It's not needed. future.division is in effect.
Is this float only for the sake of python 2 division? It's not needed. future.division is in effect.
jnothman
Sep 6, 2017
Member
This can also be determined as len(y_true)
This can also be determined as len(y_true)
sam-s
Sep 6, 2017
Author
Contributor
done (the float
was more along the lines of "code as documentation" - added emphasis why this is "correct").
done (the float
was more along the lines of "code as documentation" - added emphasis why this is "correct").
return mcc_numerator / np.sqrt(mcc_denominator) | ||
|
||
def random_ys(n_points): | ||
x_true = np.random.sample(n_points) |
jnothman
Sep 6, 2017
Member
We usually want tests to be deterministic (so we don't get occasional test failures). So please use a np.random.RandomState with a fixed seed to generate data.
We usually want tests to be deterministic (so we don't get occasional test failures). So please use a np.random.RandomState with a fixed seed to generate data.
sam-s
Sep 6, 2017
•
Author
Contributor
done
done
@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None): | |||
y_pred = lb.transform(y_pred) | |||
|
|||
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) | |||
C = C.astype(np.float64) # Bug#9622 | |||
t_sum = C.sum(axis=1) | |||
p_sum = C.sum(axis=0) | |||
n_correct = np.trace(C) |
jnothman
Sep 6, 2017
Member
can we use n_samples = len(y_true)
to avoid imprecision there?
can we use n_samples = len(y_true)
to avoid imprecision there?
sam-s
Sep 6, 2017
Author
Contributor
no, because of sample_weight
no, because of sample_weight
jnothman
Sep 6, 2017
Member
good point. Thanks
good point. Thanks
y_pred = (x_pred > 0.5) * 1.0 | ||
return y_true, y_pred | ||
|
||
for n_points in [100, 10000, 1000000]: |
jnothman
Sep 6, 2017
Member
Is there any reason to try with smaller numbers?
Is there any reason to try with smaller numbers?
sam-s
Sep 6, 2017
Author
Contributor
The main reason is that the success at smaller numbers makes it clear that the failure with the big numbers indicates a problem with the stock function rather than with mcc_safe
.
IOW, this is a "test of the test".
This is cheap, so I see no reason not to do it.
The main reason is that the success at smaller numbers makes it clear that the failure with the big numbers indicates a problem with the stock function rather than with mcc_safe
.
IOW, this is a "test of the test".
This is cheap, so I see no reason not to do it.
@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None): | |||
y_pred = lb.transform(y_pred) | |||
|
|||
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) | |||
C = C.astype(np.float64) # Bug#9622 |
jnothman
Sep 6, 2017
Member
the bug number is not needed here, as long as the test fails if this line is removed (which it does).
the bug number is not needed here, as long as the test fails if this line is removed (which it does).
sam-s
Sep 6, 2017
Author
Contributor
mathematically the astype
is unnecessary and, in fact, could be viewed as dangerous, because it increases a possibility of a round-off errors.
the bug reference explains why we do need it (because python lacks transparent bignums).
mathematically the astype
is unnecessary and, in fact, could be viewed as dangerous, because it increases a possibility of a round-off errors.
the bug reference explains why we do need it (because python lacks transparent bignums).
jnothman
Sep 6, 2017
Member
I think you mean numpy lacks transparent bignums. I'm pretty sure Python by default does them.
I don't mind this terribly. If we keep it, write "Issue #9622" rather than "Bug#9622".
FWIW, to pass the test I found it sufficient to just replace
t_sum = C.sum(axis=1, dtype=np.float64)
p_sum = C.sum(axis=0, dtype=np.float64)
but perhaps in other cases the trace is the problem??
I think you mean numpy lacks transparent bignums. I'm pretty sure Python by default does them.
I don't mind this terribly. If we keep it, write "Issue #9622" rather than "Bug#9622".
FWIW, to pass the test I found it sufficient to just replace
t_sum = C.sum(axis=1, dtype=np.float64)
p_sum = C.sum(axis=0, dtype=np.float64)
but perhaps in other cases the trace is the problem??
lesteve
Sep 7, 2017
Member
Maybe a slightly less magical change would be to do:
diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
- C = C.astype(np.float64) # Bug#9622
t_sum = C.sum(axis=1)
p_sum = C.sum(axis=0)
n_correct = np.trace(C)
n_samples = p_sum.sum()
- cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
- cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
- cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
- mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+ n_samples_square = n_samples ** 2
+ corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+ corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+ corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+ mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
if np.isnan(mcc):
return 0.
Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.
Note the from __future__ import division
ensures that int divisions work as in Python 3.
Maybe a slightly less magical change would be to do:
diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
- C = C.astype(np.float64) # Bug#9622
t_sum = C.sum(axis=1)
p_sum = C.sum(axis=0)
n_correct = np.trace(C)
n_samples = p_sum.sum()
- cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
- cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
- cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
- mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+ n_samples_square = n_samples ** 2
+ corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+ corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+ corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+ mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
if np.isnan(mcc):
return 0.
Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.
Note the from __future__ import division
ensures that int divisions work as in Python 3.
jnothman
Sep 7, 2017
Member
That seems a decent approach!
That seems a decent approach!
flake8 errors:
|
Maybe the code can be made simpler and the test as well |
@@ -528,6 +528,7 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None): | |||
y_pred = lb.transform(y_pred) | |||
|
|||
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight) | |||
C = C.astype(np.float64) # Bug#9622 |
lesteve
Sep 7, 2017
Member
Maybe a slightly less magical change would be to do:
diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
- C = C.astype(np.float64) # Bug#9622
t_sum = C.sum(axis=1)
p_sum = C.sum(axis=0)
n_correct = np.trace(C)
n_samples = p_sum.sum()
- cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
- cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
- cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
- mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+ n_samples_square = n_samples ** 2
+ corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+ corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+ corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+ mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
if np.isnan(mcc):
return 0.
Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.
Note the from __future__ import division
ensures that int divisions work as in Python 3.
Maybe a slightly less magical change would be to do:
diff --git a/sklearn/metrics/classification.py b/sklearn/metrics/classification.py
index 7c4697d..9c92be5 100644
--- a/sklearn/metrics/classification.py
+++ b/sklearn/metrics/classification.py
@@ -528,15 +528,15 @@ def matthews_corrcoef(y_true, y_pred, sample_weight=None):
y_pred = lb.transform(y_pred)
C = confusion_matrix(y_true, y_pred, sample_weight=sample_weight)
- C = C.astype(np.float64) # Bug#9622
t_sum = C.sum(axis=1)
p_sum = C.sum(axis=0)
n_correct = np.trace(C)
n_samples = p_sum.sum()
- cov_ytyp = n_correct * n_samples - np.dot(t_sum, p_sum)
- cov_ypyp = n_samples ** 2 - np.dot(p_sum, p_sum)
- cov_ytyt = n_samples ** 2 - np.dot(t_sum, t_sum)
- mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)
+ n_samples_square = n_samples ** 2
+ corr_ytyp = n_correct / n_samples - np.dot(t_sum, p_sum) / n_samples_square
+ corr_ypyp = 1 - np.dot(p_sum, p_sum) / n_samples_square
+ corr_ytyt = 1 - np.dot(t_sum, t_sum) / n_samples_square
+ mcc = corr_ytyp / np.sqrt(corr_ytyt * corr_ypyp)
if np.isnan(mcc):
return 0.
Also I think that this may be better in terms of floating point precision, manipulating quantities that are of order 1, reduces the likelihood of overflowing.
Note the from __future__ import division
ensures that int divisions work as in Python 3.
|
||
for n_points in [100, 10000, 1000000]: | ||
y_true, y_pred = random_ys(n_points) | ||
assert_almost_equal(matthews_corrcoef(y_true, y_true), 1.0) |
lesteve
Sep 7, 2017
•
Member
Just checking this with a simple array is a perfectly fine regression test for the bug we are seeing. A simpler example showing the bug:
import numpy as np
from sklearn.metrics.classification import matthews_corrcoef
n = int(1e5)
arr = np.repeat([0., 1.], n)
assert matthews_corrcoef(arr, arr) == 1.
# multi-class test
arr = np.repeat([0., 1., 2.], n)
assert matthews_corrcoef(arr, arr) == 1.
For completeness the value of matthews_corrcoef(arr, arr)
is 0 (two classes) and 35.32 (multi-class) on master.
Just checking this with a simple array is a perfectly fine regression test for the bug we are seeing. A simpler example showing the bug:
import numpy as np
from sklearn.metrics.classification import matthews_corrcoef
n = int(1e5)
arr = np.repeat([0., 1.], n)
assert matthews_corrcoef(arr, arr) == 1.
# multi-class test
arr = np.repeat([0., 1., 2.], n)
assert matthews_corrcoef(arr, arr) == 1.
For completeness the value of matthews_corrcoef(arr, arr)
is 0 (two classes) and 35.32 (multi-class) on master.
sam-s
Sep 7, 2017
Author
Contributor
given the lack of transparent bignums in numpy, your magic is dangerous.
I don't mind the other change ("manipulating quantities that are of order 1"), but casting to float
is critical.
given the lack of transparent bignums in numpy, your magic is dangerous.
I don't mind the other change ("manipulating quantities that are of order 1"), but casting to float
is critical.
sam-s
Sep 7, 2017
Author
Contributor
I will gladly add the simple test, but I see no reason to give up the big random one.
I will gladly add the simple test, but I see no reason to give up the big random one.
what does "MRG+1" mean? |
@sam-s It means the pull requests has been approved by one core contributor. |
@lesteve I don't entirely get why your solution works. Isn't the largest number |
Integer division behaves as we want for Python3 (i.e. true division) and we have |
in the current patch, |
Although I agree it can be surprising at first I would not call it magic, it is just the behaviour of true division that operands get casted to floats before the division takes place. I am not going to claim to be a floating point expert, but I would go with my version posted in #9693 (comment). |
* Loïc Estève ***@***.***> [2017-09-08 07:16:35 -0700]:
> given the lack of transparent bignums in numpy, your magic is dangerous.
I don't mind the other change ("manipulating quantities that are of
order 1"), but casting to float is critical.
Yes, casting to float is the crux of the matter, and the current version
does it via the `dtype=np.float64` argument.
Although I agree it can be surprising at first I would not call it
magic, it is just the behaviour of true division that operands get
casted to floats before the division takes place.
Yes, but we must ensure that they are correct before the cast.
The current version does it.
I am not going to claim to be a floating point expert, but I would go
with my version posted in
#9693 (comment).
The only advantage of your version over the current one is that you
compute `n_samples ** 2` once instead of twice.
This is more than made up for by several extra divisions.
At any rate, "manipulating quantities that are of order 1" does not
matter for floats (by the mere _meaning_ of the term "floating point
number").
…--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://iris.org.il
http://mideasttruth.com http://www.dhimmitude.org http://no2bds.org
At war time "salt of the earth" becomes "cannon fodder".
|
The exponent can also overflow, though I feel that is very unlikely to happen. I'm more concerned about |
Either solution seems fine for me. |
* Andreas Mueller ***@***.***> [2017-09-08 15:51:57 +0000]:
> At any rate, "manipulating quantities that are of order 1" does not
> matter for floats (by the mere _meaning_ of the term "floating point
> number").
The exponent can also overflow, though I feel that is very unlikely to happen.
The only way it could happen is if sample_weight.sum() overflows, as in
sample_weight = np.repeat([1e305], 10000)
We can deal with this issue by scaling:
sample_weight /= abs(sample_weight.max()) + abs(sample_weight.min())
but it is unclear if this is worth the effort.
Thanks.
…--
Sam Steingold (http://sds.podval.org/) on darwin Ns 10.3.1504
http://steingoldpsychology.com http://www.childpsy.net http://honestreporting.com
http://islamexposedonline.com http://www.memritv.org http://camera.org
Daddy, what does "format disk c: complete" mean?
|
I'm happy to merge whichever solution |
On Sat, Sep 9, 2017 at 7:50 AM, Joel Nothman ***@***.***> wrote:
> @jnothman commented on this pull request.
> I'm happy to merge whichever solution
Thank you, the current version works.
|
We still need a second lgtm |
4c61e8b
into
scikit-learn:master
thanks |
…cikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests
…cikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests
…cikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests
remove outdated comment fix also for FeatureUnion [MRG+2] Limiting n_components by both n_features and n_samples instead of just n_features (Recreated PR) (scikit-learn#8742) [MRG+1] Remove hard dependency on nose (scikit-learn#9670) MAINT Stop vendoring sphinx-gallery (scikit-learn#9403) CI upgrade travis to run on new numpy release (scikit-learn#9096) CI Make it possible to run doctests in .rst files with pytest (scikit-learn#9697) * doc/datasets/conftest.py to implement the equivalent of nose fixtures * add conftest.py in root folder to ensure that sklearn local folder is used rather than the package in site-packages * test doc with pytest in Travis * move custom_data_home definition from nose fixture to .rst file [MRG+1] avoid integer overflow by using floats for matthews_corrcoef (scikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests TST Platform independent hash collision tests in FeatureHasher (scikit-learn#9710) TST More informative error message in test_preserve_trustworthiness_approximately (scikit-learn#9738) add some rudimentary tests for meta-estimators fix extra whitespace in error message add missing if_delegate_has_method in pipeline don't test tuple pipeline for now only copy list if not list already? doesn't seem to help?
…cikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests
…cikit-learn#9693) * Fix bug#9622: avoid integer overflow by using floats for matthews_corrcoef * matthews_corrcoef: cosmetic change requested by jnothman * Add test_matthews_corrcoef_overflow for Bug#9622 * test_matthews_corrcoef_overflow: clean-up and make deterministic * matthews_corrcoef: pass dtype=np.float64 to sum & trace instead of using astype * test_matthews_corrcoef_overflow: add simple deterministic tests
Reference Issue
Fixes #9622
What does this implement/fix? Explain your changes.
avoid integer overflow by using floats