# Test 35: Kolmogorov–Smirnov test for goodness of fit

**Note**: The distribution doesn't seem to match the critical values in the table exactly, but it is probably overlooked because of the lack of assumptions needed in this test

## Objective

- I have some observed distribution
- I want to know if it follows some pre-defined distribution

## Assumptions

- The population distribution is continuous

## Method

- You have a sample $x$ of size $n$
- Get the CDF of $x$, which we'll call $S_n(x)$
- Get the CDF of the desired function you want to compare it to, which we'll call $F(x)$
- Find the maximum distance between the 2 functions $D$, which is the test statistic
$$\begin{aligned}
    D &= | F - S_n |
\end{aligned}$$

- Compare $D$ against values in the Kolmogorov-Smirnov table (Table 16)

- If $D$ is greater than the critical value, we reject the null hypothesis that the distributions are the same

## Proof

In [1]:
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

In [105]:
n = 250
CRITICAL_VALUE_GT35 = 1.36/(n**0.5)
# CRITICAL_VALUE_35 = 0.23
print(f"{CRITICAL_VALUE_GT35=}")

normal_distribution = np.random.normal(0,1,n)
exponential_distribution = np.random.beta(1,2,n)

def get_test_statistic(comparison_distribution):
    sample = np.random.beta(1,2,n)
    val_range = np.linspace(-1,1,500)

    empirical_cdf_sample = np.array([
        np.sum(sample <= x) / len(sample)
        for x in val_range
    ])

    empirical_cdf_comparison = np.array([
        np.sum(comparison_distribution <= x) / len(comparison_distribution)
        for x in val_range
    ])

    return max(abs(empirical_cdf_comparison - empirical_cdf_sample))

    # cdf_sample = np.array([scipy.stats.percentileofscore(sample, x)/100 for x in val_range])
    # comparison_dist = np.array([scipy.stats.percentileofscore(comparison_distribution, x)/100 for x in val_range])

    

CRITICAL_VALUE_GT35=0.08601395235657992


In [106]:
test_statistic_same_distribution = [get_test_statistic(normal_distribution) for _ in range(500)]
np.percentile(test_statistic_same_distribution, 95)

0.488

In [112]:
test_statistic_same_distribution = [get_test_statistic(exponential_distribution) for _ in range(500)]
np.percentile(test_statistic_same_distribution, 95)

0.0921999999999999