# Kolmogorov-Smirnov Test


Kolmogorov - Smirnov (KS) test is a non-parametric test to compare the equality of two continuous one dimensional probability distributions. In this test, we quantify the distance (absolute difference) between distributions. These two distributions could be two different sample, **or one could be sample and another one a theoretical distribution**. Let us test if our generated normal random variable follow normal distribution or not. st.kstest is the function to to perform KS test.

![KS Hypothesis](KS_plot.png)

The graph shows two curves: the red line is **CDF** or the **Cumulative distribution function** of the theorical distribution, whilst the blue one is the **empirical CDF**, which is the distribution of the sample.

The test answers the question "How likely is it that we would see a collection of samples like this if they were drawn from that probability distribution?" or, in the second case, "How likely is it that we would see two sets of samples like this if they were drawn from the same (but unknown) probability distribution?".

In [5]:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import pandas as pd

In [6]:
# generamos numeros aleatorios distribuidos normalmente, similar a rnorm() de R
# con media 0 y sdv de 1
x = np.random.randn(1000) # o np.random.normal() especificando media y sdv

# Kolmogorov-Smirnov Test  # D the greatest or max vertical distance between the two distributions
                           # x and the theorical 'norm', in other words, the KS statistic
D,p = st.kstest(x,'norm') # p -> p-value
print(p)

0.7186759915704102


![KS Hypothesis](hypothesis_Kolmogorov_Smirnov.png)

We get a p-value higher than the threshold, which means that our generated normally distributed random variable is in fact normal. We can also test if the the generated uniformly distributed random variable are not normal by chance. In this we get a p-value less than the threshold, which means that our generated random numbers in this case are not normal.

In [7]:
# genera numeros aleatorios en el intervalo [0, 1) con una distribución uniforme
y = np.random.rand(1000) 

D, p = st.kstest(y,'norm')
print(p)

8.165634881904524e-232


In [8]:
# ks test para dos 

D,p = st.kstest(y,x)
print(D)
print(p)

0.494
3.817701049123875e-111


In [10]:
# Genera dos muestras de datos de ejemplo (puedes reemplazarlas con tus propios datos)
data1 = np.random.normal(0, 1, 1000)  # Muestra 1 (distribución normal)
data2 = np.random.normal(0.5, 1, 1000)  # Muestra 2 (distribución normal con una media ligeramente desplazada)

# Ordena los datos de ambas muestras
data1 = np.sort(data1)
data2 = np.sort(data2)

# Calcula la diferencia acumulativa entre las dos muestras
cdf1 = np.arange(1, len(data1) + 1) / len(data1)
cdf2 = np.arange(1, len(data2) + 1) / len(data2)

In [16]:
# Crea un DataFrame a partir de los arrays, para esto es necesario indicar mediante un diccionario
# las etiquetas de los nombres de los campos
df = pd.DataFrame({
    'Muestra1': data1,
    'Muestra2': data2,
    'CDF1': cdf1,
    'CDF2': cdf2
})

# Calcula la diferencia entre las dos distribuciones
ks_statistic = np.max(np.abs(cdf1 - cdf2))

In [None]:
# Grafica el gráfico KS
plt.plot(data1, cdf1, label='Muestra 1')
plt.plot(data2, cdf2, label='Muestra 2')
plt.title('Gráfico KS')
plt.xlabel('Valores')
plt.ylabel('Distribución Acumulativa')
plt.legend(loc='best')

# Muestra el valor de la estadística KS en el gráfico
plt.annotate(f'KS Statistic: {ks_statistic:.3f}', xy=(0.5, 0.8), xycoords='axes fraction')

# Muestra el punto crítico de la prueba KS
# Dibuja una línea vertical en el punto crítico de la prueba KS
plt.axvline(np.max(data1[data1 <= data2[0]]), color='red', linestyle='--', label='KS Statistic')
plt.legend()

# Muestra el gráfico
plt.show()