# stats.ks_2samp() misleading for small data sets. #9786

Closed
opened this issue Feb 8, 2019 · 1 comment
Closed

# stats.ks_2samp() misleading for small data sets.#9786

opened this issue Feb 8, 2019 · 1 comment
Labels
Milestone

### pvanmulbregt commented Feb 8, 2019

scipy.stats.ks_2samp(x1, x2) implements the two-sided two-sample Kolmogorov-Smirnov test to compare two data sets. It uses an asymptotic formula which is best suited to large size data sets. When given small data sets, the results are quite far from correct

### Reproducing code example:

>>> import scipy.stats
>>> scipy.stats.ks_2samp([0],[1])
Ks_2sampResult(statistic=1.0, pvalue=0.28904142837082675)

The statistic=1 is correct. However the actual pvalue is 1.0.
The ECDF of a sample with a single observation is a step function at that observation, and the difference between any 2 such ECDFs will have 2 steps, and the maximum absolute difference D_{1,1}=1.0. I.e. No matter the observations, the value D_{1,1} is always 1.0, hence Pr(D_{1,1} >= 1) = 1, which is not close to 0.289.

Below are some more examples:

>>> scipy.stats.ks_2samp([0],[1,2])
Ks_2sampResult(1.0, 0.2013129706454185)
# Correct is 0.6666666666666667

>>> scipy.stats.ks_2samp([1,2], [3,4,5,6])
Ks_2sampResult(1.0, 0.04686590906054742)
# Correct is 0.1333333333333333

I.e. There plenty of examples where the returned pvalue is small, even less than 5% say which is a common threshold for experimental results, but the true value is much greater.

### Scipy/Numpy/Python version information:

1.3.0.dev0+e119b57 1.16.0 sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
added the label Feb 8, 2019
This was referenced Feb 8, 2019
added this to the 1.3.0 milestone Apr 24, 2019

### tylerjereddy commented Apr 25, 2019

 Should be fixed by #9753, which I've now merged.