In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats
from astropy.io import fits

In [None]:
# reset defalult plotting values
plt.rcParams['figure.figsize'] = (10, 7)
plt.rc('font', family='sans-serif')
plt.rc('axes', labelsize=14)
plt.rc('axes', labelweight='bold')
plt.rc('axes', titlesize=16)
plt.rc('axes', titleweight='bold')
plt.rc('axes', linewidth=2)
plt.rc('xtick',labelsize=14)
plt.rc('ytick',labelsize=14)

# Kolmogorov–Smirnov (KS) Testing
## Checking for inconsistent distributions of data

![]()

### Prof. Robert Quimby
&copy; 2019 Robert Quimby

## In this tutorial you will...

- Discuss how to determine if two samples are drawn from the same distribution
- Plot histograms of the data values
- See that plotting the cumulative distributions is more revealing
- Learn about the Kolmogorov–Smirnov or "KS" Test

## Weakness of histograms in comparing distributions

In [None]:
n = ???
mean = ???
std = ???
sample = np.random.normal(mean, std, size=n)

In [None]:
plt.hist(sample);

## Plotting the parent distribution on top is not revealing

In [None]:
low = mean - 5 * std
high = mean + 5 * std
plt.hist(sample, range=????, bins=????)
xs = np.linspace(low, high, 100)
ys = n / np.sqrt(2 * np.pi * std**2) * np.exp(-(xs - mean)**2 / 2 / std**2)
plt.plot(xs, ys);

## Check the sample moments

In [None]:
np.mean(sample), np.std(sample), scipy.stats.skew(sample)

## Cumulative Distributions

In [None]:
????
cfracs = ????
plt.plot(sample, cfracs, ls='steps-mid');

In [None]:
plt.plot(sample, cfracs, ls='steps-mid');

# compare to the parent distributon
cdf = lambda x: scipy.stats.norm.cdf(x, loc=mean, scale=std)
plt.plot(xs, cdf(xs))


## Kolmogorov–Smirnov (KS) Test

Can you confidently reject the **null hypothesis** that the sample is drawn from the parent population?

* [Test explanation from NIST](https://itl.nist.gov/div898/handbook/eda/section3/eda35g.htm)
* [KS-Test Wikipedia entry](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)

In [None]:
# in python
????

The "KS statistic" is the maximum absolute distance between the two curves

In [None]:
# KS statistic


The p-value is the probability a random test would have a KS statistic as large or larger than the value obtained.

The relation between the KS statistic (D) and the p-value depends on
- sample size
- the parent population's distribution (e.g. Poisson, Gaussian)