# stats.ks_2samp() misleading for small data sets. #9786

Closed
opened this issue Feb 8, 2019 · 1 comment
Closed

# stats.ks_2samp() misleading for small data sets.#9786

opened this issue Feb 8, 2019 · 1 comment
Labels
Milestone

### pvanmulbregt commented Feb 8, 2019

`scipy.stats.ks_2samp(x1, x2)` implements the two-sided two-sample Kolmogorov-Smirnov test to compare two data sets. It uses an asymptotic formula which is best suited to large size data sets. When given small data sets, the results are quite far from correct

### Reproducing code example:

``````>>> import scipy.stats
>>> scipy.stats.ks_2samp(,)
Ks_2sampResult(statistic=1.0, pvalue=0.28904142837082675)
``````

The statistic=1 is correct. However the actual pvalue is 1.0.
The ECDF of a sample with a single observation is a step function at that observation, and the difference between any 2 such ECDFs will have 2 steps, and the maximum absolute difference `D_{1,1}=1.0`. I.e. No matter the observations, the value `D_{1,1}` is always 1.0, hence `Pr(D_{1,1} >= 1) = 1`, which is not close to 0.289.

Below are some more examples:

``````>>> scipy.stats.ks_2samp(,[1,2])
Ks_2sampResult(1.0, 0.2013129706454185)
# Correct is 0.6666666666666667

>>> scipy.stats.ks_2samp([1,2], [3,4,5,6])
Ks_2sampResult(1.0, 0.04686590906054742)
# Correct is 0.1333333333333333
``````

I.e. There plenty of examples where the returned pvalue is small, even less than 5% say which is a common threshold for experimental results, but the true value is much greater.

### Scipy/Numpy/Python version information:

``````1.3.0.dev0+e119b57 1.16.0 sys.version_info(major=3, minor=6, micro=6, releaselevel='final', serial=0)
``````
This was referenced Feb 8, 2019

### tylerjereddy commented Apr 25, 2019

 Should be fixed by #9753, which I've now merged.