# ANOVA and Experiments

A common goal of a statistician is to design an experiment that can be used to determine which independent variables are important in predicting an outcome. Examples of questions we might wish to study:

- What factors contribute to the sale price of a house?
- What factors contribute to a student being on academic probation?
- What treatments were successful in combating a disease?

Generally a statistician cannot hope to have all factors included (or even known) and so in practice there is randomness coming from the experiment representing the effects of unkown factors. The goal is to determine from the known factors which ones matter most.

*Analysis of Variance* (ANOVA) is a procedure designed to elicit conclusions to such questions.

## A Simple Example

We will explain the method with a simple example that we already have a procedure for:  We are using two independent samples of size $n_1 = n_2$ to compare the means of two normally distributed populations with means $\mu_1$ and $\mu_2$ and equal variances $\sigma_1^2 = \sigma_2^2$. We could address this question using a two-sample t-test (or normal test if the samples are big enouhg). However, there is another approach:

The Total Variation of the response in the two samples is given by:
$$ \mbox{Total SS} = \sum_{i=1}^2 \sum_{j=1}^n (Y_{ij} - \bar{Y} )^2 $$
where $\bar{Y}$ is the mean of the total set of data from both samples.


Using a bit of algebra we can partition this Total Variation into two parts:

$$ \mbox{Total SS} = n \sum_{i=1}^2 ( \bar{Y}_i - \bar{Y} )^2 + \sum_{i=1}^2 \sum_{j=1}^n ( Y_{ij} - \bar{Y}_i ) $$
where $\bar{Y}_i$ is the sample mean of the $i$th sample.

We have assumed that the two population variances are equal and that sample sizes are the same so that:

$$ \mbox{SSE} = \sum_{i=1}^2 \sum_{j=1}^n ( Y_{ij} - \bar{Y}_i ) = (n - 1) S_1^2 + (n-1) S_2^2 $$

and you recall that the pooled estimate for the variance is:

$$ S_p^2 = \frac{\mbox{SSE}}{2n - 2} $$

Let SST (for Sum of Squares of the Treatment) be:

$$ \mbox{SST} = n \sum_{i=1}^2 ( \bar{Y}_i - \bar{Y} )^2 = \frac{n}{2} ( \bar{Y}_1 - \bar{Y}_2 )^2  $$

Our partition is then:

$$ \mbox{Total SS} = \mbox{SST} + (2 n- 2) S_p^2 $$

Thus the total variation is larger when the differences between the sample means of the two populations are more different. 

A sort of simple question then is:  What is $\mbox{SST}$ an estimator for?

It turns out that we can show:

$$ E(\mbox{SST}) = \sigma^2 + \frac{n}{2} (\mu_1 - \mu_2)^2 $$

and thus if the $\mu_1 = \mu_2$ the $\mbox{SST}$ is an unbiased estimator for $\sigma^2$. So under the hypothesis that $\mu_1 = \mu_2$ the variable $ Z = (\bar{Y}_1 - \bar{Y}_2) / \sqrt{2 \sigma^2 / n } $ has a standard normal distribution adn thus:

$$ Z^2 = \frac{n}{2} \frac{ (\bar{Y}_1 - \bar{Y}_2)^2 }{\sigma^2} = \frac{\mbox{SST}}{\sigma^2} $$

has a $\chi^2$ distribution with 1 degree of freedom.

--

Noting that we may not know $\sigma^2$ we can include the $\mbox{SSE}$ to use:

$$ \frac{\mbox{SST}/1}{\mbox{SSE}/(2n-2)} $$ 

has an F distribution with 1 and 2n-2 degrees of freedom. Define the sum of square errors divided by their degrees of freedom *Mean Square Errors* we can rewrite this as:

$$ F = \frac{\mbox{MST}}{\mbox{MSE}} $$

and we use it, mean square treatment divided by mean square error as our primary test statistic. 

Disagreement between data and the null hypothesis is then indicated by values of $F$ that are too large, and so use a rejection region given by $F > F_{\alpha}$ with $F_\alpha$ determined by the F-distribution with 1 and 2n-2 degrees of freedom.

## So What

Essentially the ANOVA test rejects the null hypothesis if the variation observed in sample means is larger than could be explained by random samples drawn from populations with the same mean. You can show that the test we just divised is actually equivalent to a two sample t-test. So why bother?

Well what we will show now is that in fact it generalizes immediately to multiple treatments. We just expand our definitions of MST and MSE, simmultaneously we will relax our assumption that the sample sizes are equal. With $k$ treatments that looks like:

$$ \mbox{SST} = \sum_{i=1}^k n_i ( \bar{Y}_i - \bar{Y} )^2 $$ 

and 

$$ \mbox{SSE} = \mbox{Total SS} - \mbox{SST} $$ 

Then 

$$ \mbox{MST} = \frac{\mbox{SST}}{k-1} $$

and

$$ \mbox{MSE} = \frac{\mbox{SSE}}{(n_1 - 1) + (n_2 - 1) + \dots + (n_k - 1) } $$

### ANOVA Hypothesis

Then the Null Hypothesis is that the means $\mu_1 = \mu_2 = \dots = \mu_k $, and we will reject it and conclude that at least one of these means is different than the others if 

$$ F = \frac{\mbox{MST}}{\mbox{MSE}} > F_\alpha $$ 

from the F-distribution with $k-1$ and $(n_1 - 1) + (n_2 - 1) + \dots + (n_k - 1)$ degrees of freedom.

Four groups of students were given four different sets of support in their entry level English course at a university. They then took a common assessment and their scores were distributed as:


In [5]:
import numpy as np

In [3]:
c1 = [65, 87, 73, 79, 81, 69]
c2 = [75, 69, 83, 81, 72, 79, 90]
c3 = [59, 78, 67, 62, 83, 76]
c4 = [94, 89, 80, 88]

In [4]:
n1 = len(c1)
n2 = len(c2)
n3 = len(c3)
n4 = len(c4)

In [10]:
ybar = (sum(c1) + sum(c2) + sum(c3)+sum(c4) )/(n1+n2+n3+n4)

In [11]:
def ss(c):
    return sum( [ (y- ybar)**2 for y in c])
TotalSS = ss(c1)+ss(c2)+ss(c3)+ss(c4)
TotalSS

1909.217391304348

In [13]:
def st(c):
    cbar = np.mean(c)
    return len(c) *(cbar - ybar)**2

SST = st(c1) + st(c2) + st(c3) + st(c4)
SST

712.5864389233957

In [14]:
SSE = TotalSS - SST
SSE

1196.6309523809523

In [15]:
MST = SST / 3
MST

237.52881297446524

In [17]:
MSE = SSE / (n1 + n2 + n3 + n4 - 4)
MSE

62.980576441102755

In [18]:
# Test Statistic is

F = MST / MSE
F

3.7714613996363453

In [20]:
# p-value is then:

from scipy.stats import f

1 - f.cdf(F, 3, n1+n2+n3+n4-4 )

0.02804096198282069

So for a significance level of 0.95 or less we would be rejecting the null hypothesis and concluding that one of the four sets of interventions produced a different result (can you tell which one?).

## ANOVA Table

An ANOVA Table is a standardized way of laying out the computation of the results for an ANOVA analysis:

| Source | degrees of freedom | SS | MS | F |
| ----- | --- | --- | --- | --- |
| Treatment Effects | k-1 | SST | MST | MST / MSE |
| Errors | n-k | SSE | MSE| |
| Total | n-1 | Total SS | | |

for this problem:

| Source | degrees of freedom | SS | MS | F |
| ----- | --- | --- | --- | --- |
| Treatment Effects | 3 | 712.5 | 237.5 | 3.77 |
| Errors | 19 | 1196.6 | 62.98 | |
| Total | 22 | 1909.2 | | |

Essentially ANOVA is asking if the variation observed can be sufficiently explained by the differences arising from the treatments. This has become enough of a standard test that it is coded for us in scipy.stats:

In [22]:
from scipy.stats import f_oneway

In [23]:
f_oneway(c1, c2, c3, c4)

F_onewayResult(statistic=3.771461399636344, pvalue=0.0280409619828207)

## Randomized Block Design

So we now have a method of testing for variation between multiple treatments. A basic experiment design for 4 treatments would then be to divide our subjects into 4 sets randomly. However problems still remain. 

### Example

We have developed two rapid tests for the concentration of COVID-19 virus load in subjects. One test of the validity of our tests is to apply each of them to a set of test subjects. This is called blocking, and in this case we think of the subjects as the blocks and then the tests as the treatments.

The statistical model for this experiment is then that the response $Y_{ij}$ of the jth block to the ith treatment is from:

$$ Y_{ij} = \mu + \tau_i + \beta_j + \epsilon_{ij} $$

where the $\tau_i$ is the nonrandom effect of the ith treatment and the $\beta_j$ is the nonrandom effect of the jth block. In the example above the blocks are the individual subjects and so $\beta_j$ includes effects like variations in the underlying variable they came to the experiment with.

### ANOVA for Block Experiments

In a block experimental design we now have three factors that are contributing to the Total Variation:

$$ \mbox{Total SS} = \mbox{SSB} + \mbox{SST} + \mbox{SSE} $$

in order the variation explained by the blocks, the variation explained by the treatments, and then the remaining error. 

Formulas for the first two are:

$$ \mbox{SSB} = k \sum_{j=1}^b ( \bar{Y}_{\circ j} - \bar{Y} )^2 $$

$$ \mbox{SST} = b \sum_{i=1}^k ( \bar{Y}_{i \circ} - \bar{Y} )^2 $$

and then

$$ \mbox{SSE}  = \mbox{Total SS} - \mbox{SSB} - \mbox{SST} $$

We then get two test statistics, one that detects whether we can conclude that one of the treatments is giving a different population mean from the other treatments across all blocks; and one that detects whether we can conclude that one of the blocks is giving a different population mean from the other blocks across all treatments.

$$ F = \frac{\mbox{MST}}{\mbox{MSE}} $$ 

with $k-1$ and $n - b - k + 1$ degrees of freedom; and 

$$ F = \frac{\mbox{MSB}}{\mbox{MSE}}$$

with $b-1$ and $n-b-k+1$ degrees of freedom. In particular, if blocking is really necessary for the experiment we expect to have this last statistic be large.



## Suppose a Difference is Detected

ANOVA methods merely determine that one of the treatments produced a result and in particular that the treatments explain some significant portion of the variation in the samples. What it does not do is tell us which of the treatments was different. To go further one needs to develop appropriate null hypothesis about what is going on, typically the first thing to do is determine which treatments have the same population mean so that they can be eliminated from the problem. This process is called a post-hoc analysis in the sense that it is being done after the initial conclusion has been reached.