-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scipy.stats.pearsonr overflows with high values of x and y #8980
Comments
Thanks for reporting this problem. It looks like some simple changes could be made to avoid this overflow. |
Yes. Sure. May I take care of fixing this bug ? the following change seems OK :
I can add a pull request with the code change and a failing test for review. |
Sure, please go ahead. I don't know why that quantitiy is manually computed but numpy.hypot is a better elementwise alternative for the manual size computation. I also don't know if there is a reason for not using the |
norm is also a solution as you said. _sum_of_squares performs some additional checks on numpy arrays, so I keep it as it is. Just added a PR for this bug. In summary, this is just a reordering of the computations. Your reviews are welcome. |
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using scipy.linalg.norm. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible p-values in this case are +1 or -1. The calculation of the p-value has also been rewritten. It now has the simpler form halfdf = len(x)/2 - 1 prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form halfdf = len(x)/2 - 1 prob = 2*special.btdtr(halfdf, halfdf, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.
The calculation of r in pearsonr(x, y) is rewritten in a more robust form. Let xm = x - mean(x) ym = y - mean(y) In this commit, the calculation of r is changed from (roughly) r = dot(xm, ym)/sqrt((xm**2).sum() * (ym**2).sum()) to r = dot(xm/norm(xm), ym/norm(ym)) where the norm is calculated using `scipy.linalg.norm`. `scipy.linalg.norm` avoids the underflow and overflow of intermediate results that was the source of a couple bug reports. The function has the following additional argument validation: * A ValueError is raised if the lengths of x and y are not the same. * A ValueError is raised if the length is less than 2. New warnings are generated in the following cases: * If an input is *constant*, a PearsonRConstantInputWarning is generated, and nan is returned for both r and prob. The correlation coefficient is not defined in this case. * If an input is *nearly* constant, a PearsonRNearConstantWarning is generated, but the calculation proceeds. The condition for x being classified as near constant is norm(xm)/abs(mean(x)) < 1e-13 (This is a heuristic condition, and it may generate false warnings.) Another change is that inputs with length 2 are handled as a special case. The only possible values of r in this case are +1 or -1, and in both cases, the p-value is 1. The calculation of the p-value has also been rewritten. It now has the simpler form ab = len(x)/2 - 1 prob = 2*special.btdtr(ab, ab, 0.5*(1 - abs(r))) `special.btdtr` is the CDF of the beta distribution. There were two existing tests that did not pass with the new code. For the inputs x = [0.0, 1.0, 2.0] with y = x and y = -x, the previous version of pearsonr returned *exactly* 1 and -1, respectively, with p-value 0 in both cases. With the new code, we have: In [14]: x = np.arange(3.0) In [15]: pearsonr(x, x) Out[15]: (0.9999999999999998, 1.3415758552508151e-08) In [16]: pearsonr(x, -x) Out[16]: (-0.9999999999999998, 1.3415758552508151e-08) The r values differ from the exact values by 2 ULPs. The reason the p-values are then so large (well, large compared to the exact value 0) is that, for n=3, the p-value for r near 1 or -1 grows like the square root of 1 - abs(r), and np.spacing(1) is about 2.22e-16. When n is larger, the p-value is not so sensitive to small perturbations in r. New tests have been added to verify that problems reported with the old code are now fixed. Closes scipygh-3728. Closes scipygh-7730. Closes scipygh-8980. Closes scipygh-9353. Closes scipygh-9406.
Hi,
This is an issue to mention that the implementation of pearsonr is not tolerant to overflows.
scipy/scipy/stats/stats.py
Line 3032 in b7865dd
Pearsonr is stable by a multiplicative coefficient. When this coefficient is too high, the computation fails with an error message:
The sum of squares product here is too high. Need to adapt the computation so that this partial computation is avoided.
To reproduce (starting from the documentation example in master branch):
Versions :
output of 'import sys, scipy, numpy; print(scipy.version, numpy.version, sys.version_info)'
1.1.0 1.14.5 sys.version_info(major=3, minor=6, micro=5, releaselevel='final', serial=0)
Also performed a lookup to see if a similar issue already exists.
The text was updated successfully, but these errors were encountered: