Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

Closed
pablogr opened this issue May 25, 2020 · 5 comments
Closed

Possible bugs in the calculation of p-values of kstest and ks_2samp #12212

pablogr opened this issue May 25, 2020 · 5 comments
Assignees

Comments

@pablogr
Copy link

pablogr commented May 25, 2020

My issue is about the calculation of the p-values using the scipy.stats functions kstest and ks_2samp.

Reproducing code example:

rvs1 = np.array(  [-0.0033, -0.0025, -0.0020, -0.00085, 0.0, 0.001, 0.0223, 1.05] )
rvs1 = np.sort(rvs1)

result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='greater' )
print( "  Continuous scipy greater:   ",result_cont[0],result_cont[1] )
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='less'  )
print( "  Continuous scipy less:   ",result_cont[0],result_cont[1] )
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22), alternative='two-sided'  )
print( "  Continuous scipy 2-sided:   ",result_cont[0],result_cont[1], "\n" )


np.random.seed(seed=233423)
for siz in [10,100,1000, 10000, 100000, 1000000]:
   print("Size:",siz)
   result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22) )
   print( "  Continuous scipy: ",result_cont[0],result_cont[1])
   result_cont = result_cont[0]
   print("; p-value from PGR formula",ks_pvalue(0,len(rvs1),result_cont) )
   rvs2 = stats.cauchy.rvs(size=siz, loc=1.1, scale=2.22)
   rvs2 = np.sort(rvs2)
   result_discr = stats.ks_2samp(rvs1, rvs2)
   print( "  Discrete scipy1:  ",result_discr )
   result_discr2 = stats.mstats.ks_twosamp(rvs1, rvs2)
   print( "  Discrete scipy2:  ",result_discr2[0], result_discr2[1] )
   tspgr, pvpgr = kolmogorov_test_empirical_distribs( pd.DataFrame(rvs1, columns=["residual"]), pd.DataFrame(rvs2, columns=["residual"]) )
   print( "  Discrete  PGR:    ",tspgr, "; p-value PGR:",pvpgr )

I obtain the output below:

Continuous scipy greater: 0.518857086814109 0.007904829594641602
Continuous scipy less: 0.3531858417269593 0.10657179549539793
Continuous scipy 2-sided: 0.518857086814109 0.015809659189283204

Size: 10
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.7, pvalue=0.012020659079482687)
Discrete scipy2: 0.7 0.025670559472125747
Discrete PGR: 0.7 ; p-value PGR: 0.025670559472125747
Size: 100
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.49, pvalue=0.03797240632133647)
Discrete scipy2: 0.4899999999999997 0.05704510190829596
Discrete PGR: 0.49 ; p-value PGR: 0.05704510190829575
Size: 1000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.493, pvalue=0.026752661574814707)
Discrete scipy2: 0.4929999999999997 0.0422233090117905
Discrete PGR: 0.493 ; p-value PGR: 0.04222330901179031
Size: 10000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
Discrete scipy2: 0.5207000000000227 0.026214482841187867
Discrete PGR: 0.5206999999999999 ; p-value PGR: 0.026214482841197786
Size: 100000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917)
Discrete scipy2: 0.5201199999997861 0.02638641717410287
Discrete PGR: 0.52012 ; p-value PGR: 0.026386417174008938
Size: 1000000
Continuous scipy: 0.518857086814109 0.015809659189283204
; p-value from PGR formula 0.02693690263781148
Discrete scipy1: Ks_2sampResult(statistic=0.5186660000000001, pvalue=0.027023415027206087)
Discrete scipy2: 0.5186660000026049 0.027023415026037792
Discrete PGR: 0.5186660000000001 ; p-value PGR: 0.027023415027206062

The problems are that:

  • The p-values do not seem to be consistent with the Smirnov's formula (which are the results above with the text "p-value from PGR formula", which match with the result given for large sizes by ks_twosamp, which appears above as "Discrete scipy2").

  • The ks_2samp function seems to provide unstable p-values. In the example above (reproduced below) you can see that a tiny variation in the test statistics leads to a large change in the p-value:
    (Size: 100: Discrete scipy1: Ks_2sampResult(statistic=0.49, pvalue=0.03797240632133647);
    Size: 1000: Discrete scipy1: Ks_2sampResult(statistic=0.493, pvalue=0.026752661574814707))
    Size: 10000 Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
    Size: 100000 Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917). Moreover, the test statistics monotonically increases with the size, but the p-values decrease and increase.

  • The results of kstest and ks_2samp are inconsistent with each other. As it can be viewed above:
    Continuous scipy: 0.518857086814109 0.015809659189283204
    Size: 10000 Discrete scipy1: Ks_2sampResult(statistic=0.5207, pvalue=0.015325026615326087)
    Size: 100000 Discrete scipy1: Ks_2sampResult(statistic=0.52012, pvalue=0.026386417174008917)

Scipy/Numpy/Python version information:

Scipy version: 1.3.1
Numpy version: 1.16.5

@josef-pkt
Copy link
Member

AFAICS, you never increase the sample size in the first case of the loop, values never change
result_cont = stats.kstest( rvs1, "cauchy", args=(1.1,2.22) )

@mdhaber
Copy link
Contributor

mdhaber commented Nov 29, 2020

@josef-pkt I think there may have been a point, here, but it looks like it has been resolved. Can you check my understanding before we close?

@pablogr lists three separate problems, but the first two:

  • The p-values produced by ks_2samp do not seem to be consistent with Smirnov's formula
  • The ks_2samp function seems to provide unstable p-values.
    might be summarized:
    ks_2samp doesn't agree with mstats.ks_twosamp.

There might have been a bug here in SciPy 1.3, but there doesn't seem to be a problem now. stats.ks_2samp now agrees with stats.mstats.ks_twosamp closely.

The third problem was that the p-values of the two-sample test were not converging to the values of the continuous test. rvs1 was supposed to stay the same, rvs2 was supposed to approximate the continuous Cauchy distribution.

This no longer seems to be a problem either. I think this script shows that these issues are resolved:

import numpy as np
from scipy.stats import ks_2samp as ks1, kstest as ks0, cauchy
from scipy.stats.mstats import ks_twosamp as ks2
np.random.seed(1234321)

rvs1 = np.array(  [-0.0033, -0.0025, -0.0020, -0.00085, 0.0, 0.001, 0.0223, 1.05] )
rvs1 = np.sort(rvs1)

N = 100000
rvs2 = cauchy.rvs(size=N, loc=1.1, scale=2.22)
rvs2 = np.sort(rvs2)

res0 = ks0(rvs1, "cauchy", args=(1.1,2.22), alternative='two-sided')
res1 = ks1(rvs1, rvs2)
res2 = ks2(rvs1, rvs2)

print(res0)
print(res1)
print(res2)

I get:

KstestResult(statistic=0.518857086814109, pvalue=0.015809659189283204)
KstestResult(statistic=0.5182899999999999, pvalue=0.015984451764390086)
KstestResult(statistic=0.5182899999999999, pvalue=0.015984451764390086)

@mdhaber
Copy link
Contributor

mdhaber commented Nov 29, 2020

@pvanmulbregt I saw your new PR about ks_2samp. Do you agree that this issue has been resolved already?

@pvanmulbregt
Copy link
Contributor

stats.ks_2samp and stats.mstats.ks_2samp all resolve to the same code so there should be no difference.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 4, 2020

OK, thanks for submitting @pablogr, but we think this has been resolved. Please re-open if you disagree.

@mdhaber mdhaber closed this as completed Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants