Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats.ks_2samp() misleading for small data sets. #9786

Closed
pvanmulbregt opened this issue Feb 8, 2019 · 1 comment
Closed

stats.ks_2samp() misleading for small data sets. #9786

pvanmulbregt opened this issue Feb 8, 2019 · 1 comment
Labels
Milestone

Comments

@pvanmulbregt
Copy link
Contributor

@pvanmulbregt pvanmulbregt commented Feb 8, 2019

scipy.stats.ks_2samp(x1, x2) implements the two-sided two-sample Kolmogorov-Smirnov test to compare two data sets. It uses an asymptotic formula which is best suited to large size data sets. When given small data sets, the results are quite far from correct

Reproducing code example:

>>> import scipy.stats
>>> scipy.stats.ks_2samp([0],[1])
Ks_2sampResult(statistic=1.0, pvalue=0.28904142837082675)

The statistic=1 is correct. However the actual pvalue is 1.0.
The ECDF of a sample with a single observation is a step function at that observation, and the difference between any 2 such ECDFs will have 2 steps, and the maximum absolute difference D_{1,1}=1.0. I.e. No matter the observations, the value D_{1,1} is always 1.0, hence Pr(D_{1,1} >= 1) = 1, which is not close to 0.289.

Below are some more examples:

>>> scipy.stats.ks_2samp([0],[1,2])
Ks_2sampResult(1.0, 0.2013129706454185)
# Correct is 0.6666666666666667

>>> scipy.stats.ks_2samp([1,2], [3,4,5,6])
Ks_2sampResult(1.0, 0.04686590906054742)
# Correct is 0.1333333333333333

I.e. There plenty of examples where the returned pvalue is small, even less than 5% say which is a common threshold for experimental results, but the true value is much greater.

Scipy/Numpy/Python version information:

1.3.0.dev0+e119b57 1.16.0 sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
@tylerjereddy

This comment has been minimized.

Copy link
Contributor

@tylerjereddy tylerjereddy commented Apr 25, 2019

Should be fixed by #9753, which I've now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.