Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

pablogr · 2020-05-25T11:19:45Z

My issue is about the calculation of the p-values using the scipy.stats functions kstest and ks_2samp.

Reproducing code example:

rvs1 = np.array(  [-0.0033, -0.0025, -0.0020, -0.00085, 0.0, 0.001, 0.0223, 1.05] )
rvs1 = np.sort(rvs1)

result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='greater' )
print( "  Continuous scipy greater:   ",result_cont[0],result_cont[1] )
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='less'  )
print( "  Continuous scipy less:   ",result_cont[0],result_cont[1] )
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='two-sided'  )
print( "  Continuous scipy 2-sided:   ",result_cont[0],result_cont[1], "\n" )


np.random.seed(seed=233423)
for siz in [10,100,1000, 10000, 100000, 1000000]:
   print("Size:",siz)
   result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22) )
   print( "  Continuous scipy: ",result_cont[0],result_cont[1])
   result_cont = result_cont[0]
   print("; p-value from PGR formula",ks_pvalue(0,len(rvs1),result_cont) )
   rvs2 = stats.cauchy.rvs(size=siz, loc=1.1, scale=2.22)
   rvs2 = np.sort(rvs2)
   result_discr = stats.ks_2samp(rvs1, rvs2)
   print( "  Discrete scipy1:  ",result_discr )
   result_discr2 = stats.mstats.ks_twosamp(rvs1, rvs2)
   print( "  Discrete scipy2:  ",result_discr2[0], result_discr2[1] )
   tspgr, pvpgr = kolmogorov_test_empirical_distribs( pd.DataFrame(rvs1, columns=["residual"]), pd.DataFrame(rvs2, columns=["residual"]) )
   print( "  Discrete  PGR:    ",tspgr, "; p-value PGR:",pvpgr )

I obtain the output below:

Continuous scipy greater: 0.518857086814109 0.007904829594641602
Continuous scipy less: 0.3531858417269593 0.10657179549539793
Continuous scipy 2-sided: 0.518857086814109 0.015809659189283204

Size: 10
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.7, pvalue=0.012020659079482687)
Discrete scipy2: 0.7 0.025670559472125747
Discrete PGR: 0.7 ; p-value PGR: 0.025670559472125747
Size: 100
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.49, pvalue=0.03797240632133647)
Discrete scipy2: 0.4899999999999997 0.05704510190829596
Discrete PGR: 0.49 ; p-value PGR: 0.05704510190829575
Size: 1000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.493, pvalue=0.026752661574814707)
Discrete scipy2: 0.4929999999999997 0.0422233090117905
Discrete PGR: 0.493 ; p-value PGR: 0.04222330901179031
Size: 10000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
Discrete scipy2: 0.5207000000000227 0.026214482841187867
Discrete PGR: 0.5206999999999999 ; p-value PGR: 0.026214482841197786
Size: 100000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917)
Discrete scipy2: 0.5201199999997861 0.02638641717410287
Discrete PGR: 0.52012 ; p-value PGR: 0.026386417174008938
Size: 1000000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.5186660000000001, pvalue=0.027023415027206087)
Discrete scipy2: 0.5186660000026049 0.027023415026037792
Discrete PGR: 0.5186660000000001 ; p-value PGR: 0.027023415027206062

The problems are that:

The p-values do not seem to be consistent with the Smirnov's formula (which are the results above with the text "p-value from PGR formula", which match with the result given for large sizes by ks_twosamp, which appears above as "Discrete scipy2").
The ks_2samp function seems to provide unstable p-values. In the example above (reproduced below) you can see that a tiny variation in the test statistics leads to a large change in the p-value:
(Size: 100: Discrete scipy1: Ks_2sampResult(statistic=0.49, pvalue=0.03797240632133647);
Size: 1000: Discrete scipy1: Ks_2sampResult(statistic=0.493, pvalue=0.026752661574814707))
Size: 10000 Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
Size: 100000 Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917). Moreover, the test statistics monotonically increases with the size, but the p-values decrease and increase.
The results of kstest and ks_2samp are inconsistent with each other. As it can be viewed above:
Continuous scipy: 0.518857086814109 0.015809659189283204
Size: 10000 Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
Size: 100000 Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917)

Scipy/Numpy/Python version information:

Scipy version: 1.3.1
Numpy version: 1.16.5

The text was updated successfully, but these errors were encountered:

josef-pkt · 2020-05-25T14:36:49Z

AFAICS, you never increase the sample size in the first case of the loop, values never change
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22) )

mdhaber · 2020-11-29T16:38:34Z

@josef-pkt I think there may have been a point, here, but it looks like it has been resolved. Can you check my understanding before we close?

@pablogr lists three separate problems, but the first two:

The p-values produced by ks_2samp do not seem to be consistent with Smirnov's formula
The ks_2samp function seems to provide unstable p-values.
might be summarized:
ks_2samp doesn't agree with mstats.ks_twosamp.

There might have been a bug here in SciPy 1.3, but there doesn't seem to be a problem now. stats.ks_2samp now agrees with stats.mstats.ks_twosamp closely.

The third problem was that the p-values of the two-sample test were not converging to the values of the continuous test. rvs1 was supposed to stay the same, rvs2 was supposed to approximate the continuous Cauchy distribution.

This no longer seems to be a problem either. I think this script shows that these issues are resolved:

import numpy as np
from scipy.stats import ks_2samp as ks1, kstest as ks0, cauchy
from scipy.stats.mstats import ks_twosamp as ks2
np.random.seed(1234321)

rvs1 = np.array(  [-0.0033, -0.0025, -0.0020, -0.00085, 0.0, 0.001, 0.0223, 1.05] )
rvs1 = np.sort(rvs1)

N = 100000
rvs2 = cauchy.rvs(size=N, loc=1.1, scale=2.22)
rvs2 = np.sort(rvs2)

res0 = ks0(rvs1, "cauchy", args=(1.1,2.22), alternative='two-sided')
res1 = ks1(rvs1, rvs2)
res2 = ks2(rvs1, rvs2)

print(res0)
print(res1)
print(res2)

I get:

KstestResult(statistic=0.518857086814109, pvalue=0.015809659189283204)
KstestResult(statistic=0.5182899999999999, pvalue=0.015984451764390086)
KstestResult(statistic=0.5182899999999999, pvalue=0.015984451764390086)

mdhaber · 2020-11-29T19:31:55Z

@pvanmulbregt I saw your new PR about ks_2samp. Do you agree that this issue has been resolved already?

pvanmulbregt · 2020-12-04T06:21:02Z

stats.ks_2samp and stats.mstats.ks_2samp all resolve to the same code so there should be no difference.

mdhaber · 2020-12-04T17:38:04Z

OK, thanks for submitting @pablogr, but we think this has been resolved. Please re-open if you disagree.

AtsushiSakai added the scipy.stats label May 25, 2020

mdhaber mentioned this issue Nov 29, 2020

A Solid Foundation for Statistics in Python with SciPy mdhaber/scipy#26

Closed

mdhaber assigned josef-pkt Nov 29, 2020

mdhaber closed this as completed Dec 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

pablogr commented May 25, 2020

josef-pkt commented May 25, 2020

mdhaber commented Nov 29, 2020

mdhaber commented Nov 29, 2020

pvanmulbregt commented Dec 4, 2020

mdhaber commented Dec 4, 2020

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

Comments

pablogr commented May 25, 2020

Reproducing code example:

Scipy/Numpy/Python version information:

josef-pkt commented May 25, 2020

mdhaber commented Nov 29, 2020

mdhaber commented Nov 29, 2020

pvanmulbregt commented Dec 4, 2020

mdhaber commented Dec 4, 2020