BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

dennisdeh · 2023-03-12T22:15:04Z

Describe your issue.

I have noticed an overflow in scipy.stats.kendalltau with the standard parameters in scipy 1.10.0, when using it to calculate in some cases when the samples given are sufficiently large.

Interestingly enough, using an older implementation of scipy.stats.kendalltau from 0.15.1, this overflow does not occur. The older implementation is different from the newer one, which employs an approximation by default for large samples, and changed a couple of versions ago. The parameter method="exact" does not work when there are ties as in the example below.

Reproducing Code Example

from scipy.stats import kendalltau
import random

# generate lists of random classes 1-7 (multiclass)
random.seed(6272161)
classes = [1, 2, 3, 4, 5, 6, 7]
n_samples = 2*10**5
x = random.choices(classes, k=n_samples)
y = random.choices(classes, k=n_samples)

# Calculate ordinal association with Kendall's tau:
res_scipy_kt = kendalltau(x, y)

The outputs for scipy 1.10.0 and scipy 0.15.1 are:
results statistic
scipy 1.10.0: 0.0011816493905730343
scipy 0.15.1: 0.0011816493905730343
results pvalue
scipy 1.10.0: 0.4879408388218892
scipy 0.15.1: 0.427971540028532

The statistics agree to machine precision even with the overflow, but the pvalue's do not, but that is okay as scipy 1.10.0 employs an approximate algorithm to calculate the p-value. I think this might point towards that the problem is in the p-value calculation.

Error message

C:\Users\denni\.conda\envs\inv\lib\site-packages\scipy\stats\_stats_py.py:5283: RuntimeWarning: overflow encountered in longlong_scalars
  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

SciPy/NumPy/Python version and system information

1.10.0 1.23.5 sys.version_info(major=3, minor=8, micro=16, releaselevel='final', serial=0)
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/denni/.conda/envs/inv\\Library\\include']

mdhaber · 2023-03-13T01:18:53Z

The warning indicates the line it's coming from.

scipy/scipy/stats/_stats_py.py

Lines 5885 to 5886 in 0a017f2

    
           var = ((m * (2*size + 5) - x1 - y1) / 18 + 
        
                  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))

which is in the p-value calculation. In this case, the statistic is OK because the sample is not large enough for those parts of the calculation to overflow.

A quick fix could be to make sure that the NumPy integers (e.g. xtie, ytie) are Python ints, which won't overflow.

The older implementation is different from the newer one, which employs an approximation by default for large samples, and changed a couple of versions ago.

The "exact" p-value calculation is not exact when there are ties, so a normal approximation that corrects for ties is used. Whether this is actually more accurate than using the "exact" method (even when it is not truly exact) depends on the nature of the ties. In the meantime, there is permutation_test, and hopefully we can resolve the interface question (gh-18067) to allow the user to select it as a method for computing the p-value.

mdhaber · 2023-03-19T15:35:44Z

@dennisdeh did you want to submit a PR or should I fix it?

dennisdeh · 2023-03-25T09:32:36Z

@dennisdeh did you want to submit a PR or should I fix it?

Thanks for fixing it, mdhaber

dennisdeh added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Mar 12, 2023

j-bowhay added the scipy.stats label Mar 13, 2023

mdhaber mentioned this issue Mar 24, 2023

MAINT: stats.kendalltau: avoid overflow #18193

Merged

tupui closed this as completed in #18193 Mar 24, 2023

tupui added this to the 1.11.0 milestone Mar 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

dennisdeh commented Mar 12, 2023 •

edited

mdhaber commented Mar 13, 2023 •

edited

mdhaber commented Mar 19, 2023

dennisdeh commented Mar 25, 2023

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

BUG: Overflow in 'new' implementation of scipy.stats.kendalltau #18139

Comments

dennisdeh commented Mar 12, 2023 • edited

Describe your issue.

Reproducing Code Example

Error message

SciPy/NumPy/Python version and system information

mdhaber commented Mar 13, 2023 • edited

mdhaber commented Mar 19, 2023

dennisdeh commented Mar 25, 2023

dennisdeh commented Mar 12, 2023 •

edited

mdhaber commented Mar 13, 2023 •

edited