Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.stats.pearsonr overflows with high values of x and y #8980

Closed
antoinecarme opened this issue Jun 28, 2018 · 4 comments · Fixed by #9562
Closed

scipy.stats.pearsonr overflows with high values of x and y #8980

antoinecarme opened this issue Jun 28, 2018 · 4 comments · Fixed by #9562
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@antoinecarme
Copy link

antoinecarme commented Jun 28, 2018

Hi,

This is an issue to mention that the implementation of pearsonr is not tolerant to overflows.

r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))

Pearsonr is stable by a multiplicative coefficient. When this coefficient is too high, the computation fails with an error message:

scipy/stats/stats.py:3013: RuntimeWarning: overflow encountered in double_scalars
  r_den = np.sqrt(_sum_of_squares(xm) * _sum_of_squares(ym))

The sum of squares product here is too high. Need to adapt the computation so that this partial computation is avoided.

To reproduce (starting from the documentation example in master branch):

import numpy as np
from scipy import stats

a = np.array([0, 0, 0, 1, 1, 1, 1])
b = np.arange(7)
pearson1 = stats.pearsonr(a, b)
a1 = a * 1e90
b1 = b * 1e90
pearson2 = stats.pearsonr(a1, b1)
print(pearson1 , pearson2)
assert(np.allclose(pearson1[0] , pearson2[0]))
assert(np.allclose(pearson1[1] , pearson2[1]))

Versions :

output of 'import sys, scipy, numpy; print(scipy.version, numpy.version, sys.version_info)'

1.1.0 1.14.5 sys.version_info(major=3, minor=6, micro=5, releaselevel='final', serial=0)

Also performed a lookup to see if a similar issue already exists.

@WarrenWeckesser WarrenWeckesser added defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats labels Jun 29, 2018
@WarrenWeckesser
Copy link
Member

Thanks for reporting this problem. It looks like some simple changes could be made to avoid this overflow.

@antoinecarme
Copy link
Author

Yes. Sure.

May I take care of fixing this bug ? the following change seems OK :

    x = np.asarray(x)
    y = np.asarray(y)
    n = len(x)
    mx = x.mean()
    my = y.mean()
    xm, ym = x - mx, y - my
    sum_sqr_x , sum_sqr_y = _sum_of_squares(xm) , _sum_of_squares(ym)
    if((sum_sqr_x == 0.0) or (sum_sqr_y == 0.0)):
        return (0.0 , 0.0)
    xm1 = xm / np.sqrt(sum_sqr_x)
    ym1 = ym / np.sqrt(sum_sqr_y)
    r = np.add.reduce(xm1 * ym1)
    ....

I can add a pull request with the code change and a failing test for review.

@ilayn
Copy link
Member

ilayn commented Jul 9, 2018

Sure, please go ahead.

I don't know why that quantitiy is manually computed but numpy.hypot is a better elementwise alternative for the manual size computation. I also don't know if there is a reason for not using the norm(..., p=2).

@antoinecarme
Copy link
Author

norm is also a solution as you said.

_sum_of_squares performs some additional checks on numpy arrays, so I keep it as it is.

Just added a PR for this bug. In summary, this is just a reordering of the computations.

Your reviews are welcome.

WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Dec 3, 2018
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using scipy.linalg.norm.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible p-values in this case are +1 or -1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    halfdf = len(x)/2 - 1
    prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Dec 3, 2018
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    halfdf = len(x)/2 - 1
    prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Dec 6, 2018
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    ab = len(x)/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
Closes scipygh-9406.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Dec 6, 2018
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    ab = len(x)/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
Closes scipygh-9406.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Apr 9, 2019
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    ab = len(x)/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
Closes scipygh-9406.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Apr 16, 2019
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    ab = len(x)/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
Closes scipygh-9406.
WarrenWeckesser added a commit to WarrenWeckesser/scipy that referenced this issue Apr 18, 2019
The calculation of r in pearsonr(x, y) is rewritten in a more
robust form.

Let

    xm = x - mean(x)
    ym = y - mean(y)

In this commit, the calculation of r is changed from (roughly)

    r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum())

to

    r = dot(xm/norm(xm), ym/norm(ym))

where the norm is calculated using `scipy.linalg.norm`.

`scipy.linalg.norm` avoids the underflow and overflow of intermediate results
that was the source of a couple bug reports.

The function has the following additional argument validation:

 * A ValueError is raised if the lengths of x and y are not the same.
 * A ValueError is raised if the length is less than 2.

New warnings are generated in the following cases:

 * If an input is *constant*, a PearsonRConstantInputWarning is generated,
   and nan is returned for both r and prob.  The correlation coefficient
   is not defined in this case.
 * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated,
   but the calculation proceeds.  The condition for x being classified as
   near constant is

       norm(xm)/abs(mean(x)) < 1e-13

   (This is a heuristic condition, and it may generate false warnings.)

Another change is that inputs with length 2 are handled as a special case.
The only possible values of r in this case are +1 or -1, and in both cases,
the p-value is 1.

The calculation of the p-value has also been rewritten. It now has the simpler
form

    ab = len(x)/2 - 1
    prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r)))

`special.btdtr` is the CDF of the beta distribution.

There were two existing tests that did not pass with the new code.  For the
inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of
pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both
cases.  With the new code, we have:

    In [14]: x = np.arange(3.0)

    In [15]: pearsonr(x, x)
    Out[15]: (0.9999999999999998, 1.3415758552508151e-08)

    In [16]: pearsonr(x, -x)
    Out[16]: (-0.9999999999999998, 1.3415758552508151e-08)

The r values differ from the exact values by 2 ULPs.  The reason the p-values
are then so large (well, large compared to the exact value 0) is that, for n=3,
the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and
np.spacing(1) is about 2.22e-16.  When n is larger, the p-value is not so
sensitive to small perturbations in r.

New tests have been added to verify that problems reported with the old
code are now fixed.

Closes scipygh-3728.
Closes scipygh-7730.
Closes scipygh-8980.
Closes scipygh-9353.
Closes scipygh-9406.
@tylerjereddy tylerjereddy added this to the 1.3.0 milestone Apr 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants