Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

Closed
dennisdeh opened this issue Mar 12, 2023 · 3 comments · Fixed by #18193
Closed

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

dennisdeh opened this issue Mar 12, 2023 · 3 comments · Fixed by #18193
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@dennisdeh
Copy link
Contributor

dennisdeh commented Mar 12, 2023

Describe your issue.

I have noticed an overflow in scipy.stats.kendalltau with the standard parameters in scipy 1.10.0, when using it to calculate in some cases when the samples given are sufficiently large.

Interestingly enough, using an older implementation of scipy.stats.kendalltau from 0.15.1, this overflow does not occur. The older implementation is different from the newer one, which employs an approximation by default for large samples, and changed a couple of versions ago. The parameter method="exact" does not work when there are ties as in the example below.

Reproducing Code Example

from scipy.stats import kendalltau
import random

# generate lists of random classes 1-7 (multiclass)
random.seed(6272161)
classes = [1, 2, 3, 4, 5, 6, 7]
n_samples = 2*10**5
x = random.choices(classes, k=n_samples)
y = random.choices(classes, k=n_samples)

# Calculate ordinal association with Kendall's tau:
res_scipy_kt = kendalltau(x, y)

The outputs for scipy 1.10.0 and scipy 0.15.1 are:
results statistic
scipy 1.10.0: 0.0011816493905730343
scipy 0.15.1: 0.0011816493905730343
results pvalue
scipy 1.10.0: 0.4879408388218892
scipy 0.15.1: 0.427971540028532

The statistics agree to machine precision even with the overflow, but the pvalue's do not, but that is okay as scipy 1.10.0 employs an approximate algorithm to calculate the p-value. I think this might point towards that the problem is in the p-value calculation.

Error message

C:\Users\denni\.conda\envs\inv\lib\site-packages\scipy\stats\_stats_py.py:5283: RuntimeWarning: overflow encountered in longlong_scalars
  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

SciPy/NumPy/Python version and system information

1.10.0 1.23.5 sys.version_info(major=3, minor=8, micro=16, releaselevel='final', serial=0)
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
@dennisdeh dennisdeh added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Mar 12, 2023
@mdhaber
Copy link
Contributor

mdhaber commented Mar 13, 2023

The warning indicates the line it's coming from.

scipy/scipy/stats/_stats_py.py

Lines 5885 to 5886 in 0a017f2

var = ((m * (2*size + 5) - x1 - y1) / 18 +
(2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

which is in the p-value calculation. In this case, the statistic is OK because the sample is not large enough for those parts of the calculation to overflow.

A quick fix could be to make sure that the NumPy integers (e.g. xtie, ytie) are Python ints, which won't overflow.

The older implementation is different from the newer one, which employs an approximation by default for large samples, and changed a couple of versions ago.

The "exact" p-value calculation is not exact when there are ties, so a normal approximation that corrects for ties is used. Whether this is actually more accurate than using the "exact" method (even when it is not truly exact) depends on the nature of the ties. In the meantime, there is permutation_test, and hopefully we can resolve the interface question (gh-18067) to allow the user to select it as a method for computing the p-value.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 19, 2023

@dennisdeh did you want to submit a PR or should I fix it?

@dennisdeh
Copy link
Contributor Author

@dennisdeh did you want to submit a PR or should I fix it?

Thanks for fixing it, mdhaber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants