scipy.stats.pearsonr overflows with high values of x and y #8980

antoinecarme · 2018-06-28T11:25:57Z

Hi,

This is an issue to mention that the implementation of pearsonr is not tolerant to overflows.

Line 3032 in b7865dd

r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))

Pearsonr is stable by a multiplicative coefficient. When this coefficient is too high, the computation fails with an error message:

scipy/stats/stats.py:3013: RuntimeWarning: overflow encountered in double_scalars
  r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))

The sum of squares product here is too high. Need to adapt the computation so that this partial computation is avoided.

To reproduce (starting from the documentation example in master branch):

import numpy as np
from scipy import stats

a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)
pearson1 = stats.pearsonr(a, b)
a1 = a * 1e90
b1 = b * 1e90
pearson2 = stats.pearsonr(a1, b1)
print(pearson1 , pearson2)
assert(np.allclose(pearson1[0] , pearson2[0]))
assert(np.allclose(pearson1[1] , pearson2[1]))

Versions :

output of 'import sys, scipy, numpy; print(scipy.version, numpy.version, sys.version_info)'

1.1.0 1.14.5 sys.version_info(major=3, minor=6, micro=5, releaselevel='final', serial=0)

Also performed a lookup to see if a similar issue already exists.

The text was updated successfully, but these errors were encountered:

WarrenWeckesser · 2018-06-29T00:14:14Z

Thanks for reporting this problem. It looks like some simple changes could be made to avoid this overflow.

antoinecarme · 2018-07-09T09:00:32Z

Yes. Sure.

May I take care of fixing this bug ? the following change seems OK :

    x = np.asarray(x)
    y = np.asarray(y)
    n = len(x)
    mx = x.mean()
    my = y.mean()
    xm, ym = x - mx, y - my
    sum_sqr_x , sum_sqr_y = _sum_of_squares(xm) , _sum_of_squares(ym)
    if((sum_sqr_x == 0.0) or (sum_sqr_y == 0.0)):
        return (0.0 , 0.0)
    xm1 = xm / np.sqrt(sum_sqr_x)
    ym1 = ym / np.sqrt(sum_sqr_y)
    r = np.add.reduce(xm1 * ym1)
    ....

I can add a pull request with the code change and a failing test for review.

ilayn · 2018-07-09T11:05:46Z

Sure, please go ahead.

I don't know why that quantitiy is manually computed but numpy.hypot is a better elementwise alternative for the manual size computation. I also don't know if there is a reason for not using the norm(..., p=2).

antoinecarme · 2018-07-09T12:28:22Z

norm is also a solution as you said.

_sum_of_squares performs some additional checks on numpy arrays, so I keep it as it is.

Just added a PR for this bug. In summary, this is just a reordering of the computations.

Your reviews are welcome.

The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using scipy.linalg.norm. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible p-values in this case are +1 or -1. The calculation of the p-value has also been rewritten. It now has the simpler form halfdf = len(x)/2 - 1 prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353.

The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form halfdf = len(x)/2 - 1 prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353.

The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.

WarrenWeckesser added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Jun 29, 2018

chrisb83 mentioned this issue Oct 4, 2018

scipy.stats.pearsonr returns r=1 if r_num/r_den = inf #9353

Closed

ilayn mentioned this issue Oct 21, 2018

Wrong p-value and unwanted warning when using scipy.stats.pearsonr #9396

Closed

WarrenWeckesser mentioned this issue Dec 3, 2018

MAINT: stats: Rewrite pearsonr. #9562

Merged

tylerjereddy closed this as completed in #9562 Apr 18, 2019

tylerjereddy added this to the 1.3.0 milestone Apr 18, 2019

WarrenWeckesser mentioned this issue Apr 30, 2019

scipy.stats.pearsonr overflows with high values of x and y. #8980 #9009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scipy.stats.pearsonr overflows with high values of x and y #8980

scipy.stats.pearsonr overflows with high values of x and y #8980

antoinecarme commented Jun 28, 2018 •

edited

Loading

WarrenWeckesser commented Jun 29, 2018

antoinecarme commented Jul 9, 2018

ilayn commented Jul 9, 2018 •

edited

Loading

antoinecarme commented Jul 9, 2018

scipy.stats.pearsonr overflows with high values of x and y #8980

scipy.stats.pearsonr overflows with high values of x and y #8980

Comments

antoinecarme commented Jun 28, 2018 • edited Loading

WarrenWeckesser commented Jun 29, 2018

antoinecarme commented Jul 9, 2018

ilayn commented Jul 9, 2018 • edited Loading

antoinecarme commented Jul 9, 2018

antoinecarme commented Jun 28, 2018 •

edited

Loading

ilayn commented Jul 9, 2018 •

edited

Loading