# Test 28: Tukey test for multiple comparison of K population means (unequal sample sizes)

## Objective

- You have $K$ populations
- You want to know if any of the population means are significantly different from another

## Assumptions

- Each of the $K$ populations are normally distributed
- Each of the $K$ populations have the same variance

## Method

- You have $K$ populations, each of sizes $n_1, n_2, ... n_k$

- For each of $K$ populations, compute sample means $\bar{x_i}$

- For each of $K$ populations, compute variance $s_j^2$ using
$$\begin{aligned}
    s_j^2 &= \frac{\sum_{i=1}^{n_j} (x_{ij} - \bar{x_{.j}})}{n_j - 1}
\end{aligned}$$

- Then, compute the overall variance of the samples by taking
$$\begin{aligned}
    s^2 &= \frac{\sum_{j=1}^{K} (n_j - 1) \cdot s_j^2}{N - K}
\end{aligned}$$

- The critical value here follows a Studentized range, and degrees of freedom $v$ is
$$\begin{aligned}
    v &= (\sum_{j=1}^{K} n_j) - K
\end{aligned}$$

- As such, find a value $q$ from the Studentized range distribution with $v$ degrees of freedom for the degree of confidence $\alpha$ (Table 9)
    - e.g. For $\alpha = 0.05$, $q = 5.29$

- Finally, compute 
$$\begin{aligned}
    W &= \frac{q \cdot s}{\sqrt{\eta}} \\ \\

    \eta &= \frac{K}{(\frac{1}{n_1} + \frac{1}{n_2} + ... \frac{1}{n_k})}
\end{aligned}$$

- If the largest difference between the 2 sample means $\bar{x_i} - \bar{x_j} < W$, then the population means do not differ significantly 

## What distribution does $W$ follow?

In [1]:
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

In [60]:
K = 4
# MEAN = [5] * K
MEAN = [5,5,5,5]
SIGMA = [2] * K
SAMPLE_SIZE = np.random.randint(1000,1050,4)
CRITICAL_VALUE_5PCT = scipy.stats.studentized_range.ppf(k=K, df=np.sum(SAMPLE_SIZE)-K, q=0.95)

def get_max_diff_exceed_test_statistic_check():
    samples = [np.random.normal(x,y,z) for x,y,z in zip(MEAN, SIGMA, SAMPLE_SIZE)]
    sample_means = [np.mean(x) for x in samples]
    sample_variance = [np.var(x, ddof=1) for x in samples]
    overall_variance = np.sum([(n-1)*s_sq for n, s_sq in zip(SAMPLE_SIZE, sample_variance)]) / (np.sum(SAMPLE_SIZE) - K)

    eta = K / np.sum([1/size for size in SAMPLE_SIZE])
    
    test_statistic = (CRITICAL_VALUE_5PCT * overall_variance**0.5) / eta**0.5
    max_diff = max(sample_means) - min(sample_means)

    return max_diff > test_statistic

In [63]:
## 5% exceeds when means are all equal, as we expected by taking 5pct critical value

proportion_exceeds_test_statistic_when_all_means_equal = np.mean([get_max_diff_exceed_test_statistic_check() for _ in range(3_000)])
proportion_exceeds_test_statistic_when_all_means_equal

0.049666666666666665