# Analysis of Variance (ANOVA)

## Concept

The goal is to determine the effects of **discrete** independent variables on a **continuous** dependent variable.

Mostly we are concerned with main effect of a factor and interactions between factors.

**Main effect** is one factor which influences the dependent variable even when ignoring all other factors. For example, young people's symptoms improve faster than older people's symptoms, regardless of medication type

**Interactions** are the effect of one factor depends on the levels of another factor. For example, medication A works better in older people.

**Intercept** is when the average dependent variable is different from 0, but the intercept of the ANOVA is usually **ignored**. For example, symptoms improve for almost everyone after 10 days.

**Factor** means an independent variable. **Level** means categorical values of each factor, independent variable.

For X-way ANOVA, X means the number of factors, number of independent variables.

**Repeated-measures ANOVA** means at least one factor involves multiple measurements, same factor but different level, from the same individual.

**MANOVA** is multivariate ANOVA, multiple dependent variables. For example, it's testing the effects of medication type and age on Covid-19 symptions and total medical expenses. Here, there are 2 dependent variables; symptions and expenses.

## Assumption

- The data are **sampled independently** of each other in the population.
- The **residuals**, unexplained variance after fitting the model, are **normally, Gaussian distributed**
- The variance within each cell, level or level by level cell, is roughly the same (**heteroscedasticity**, homogeneity of variance).

When assumption doesn't work, we can use non-parameteric one-way ANOVA, **Kruskal-Wallis test (KW-ANOVA)**, but it's rarely used, because ANOVAs are **generally robust to violations of the assumptions**.

## Hypothesis

The hypothesis that ANOVA is testing is the following. $\mu_i$ is the mean of each cell by levels.

$$
H_0: \mu_1 = \mu_1 = ... = \mu_k
$$
$$
H_A: \mu_i \ne \mu_j 
$$

The alternative hypothesis means that, at least one mean is different from at least one other mean. So when we do ANOVA, we don't actually immediately get the answer of which means are different from which other means. We have to do further investigation to know that.

## Computation

We first define **sum of squares (SS)**.

$$
SS = \sum_{i = 1}^{n} (x_i - \bar{x})^2
$$
$$
\text{Total SS} = \text{Within-group SS} + \text{Between-group SS}
$$

It means that the total variation in the dataset is, the sum of the variation across individuals within each group, AND the variation across the different levels.

$$
F = \frac{\text{Explained variance}}{\text{Unexplained variance}}
$$
$$
= \frac{\text{Dues to factors}}{\text{Natural variation}}
$$

ANOVA is statistically significant when explained variance is much bigger than unexplained variance.

$$
\text{SS}_{\text{Total}} = \sum_{j = 1}^{\text{levels}} \sum_{i = 1}^{\text{individuals}} (x_{ij} - \bar{x})^2
$$

$\text{SS}_{\text{Total}}$ has the degrees of freedom, $df_{\text{Total}} = N - 1$

$$
\text{SS}_{\text{Between}} = \sum_{j = 1}^{\text{levels}} (\bar{x}_j - \bar{x})^2 n_j
$$

$\text{SS}_{\text{Between}}$ has the degrees of freedom, $df_{\text{Between}} = k - 1$

$$
\text{SS}_{\text{Within}} = \sum_{j = 1}^{\text{levels}} \sum_{i = 1}^{\text{individuals}} (x_{ij} - \bar{x}_j)^2
$$

$\text{SS}_{\text{Within}}$ has the degrees of freedom, $df_{\text{Within}} = N - k$

$\text{SS}_{\text{Total}}$ means the entire variability present in the dataset, ignoring all the levels.

$\text{SS}_{\text{Between}}$ means the variability that we can attribute to different levels of our factor, ignoring all the individuals. $\bar{x}_j$ are the means of each level. $n_j$ is the number of individuals within each level.  

$\text{SS}_{\text{Within}}$ means the distances within each specific level and individuals in each level.

$\text{MS}$ means mean squares

$$
\text{MS}_{\text{Between}} = \frac{\text{SS}_{\text{Between}}}{df_{\text{Between}}}
$$
$$
\text{MS}_{\text{Within}} = \frac{\text{SS}_{\text{Within}}}{df_{\text{Within}}}
$$

Finally compute **test statistic** $F$,

$$
F_{k - 1, N - k} = \frac{\text{MS}_{\text{Between}}}{\text{MS}_{\text{Within}}}
$$

F test statistic is a ratio between all of the variability that we can attribute to experimental factors in numerator (levels of a factor), and the variability that we can attribute to individual variabilities. Since we have a test statistic, we can compute **P-value**. Correct interpretation of $p < 0.05$ is that the mean of at least one group (level) is statistically significantly different from the mean of at least one other level. So **the p-value and F test statistic don't actually tell us which groups are different. It only tells us there is different somewhere somehow**. So further data visualization and follow-up t-tests are necessary to determine which groups differ.

**Omnibus F-test** is a general test of significance, just telling us something is different, but it doesn't tell us what it is.

In the context of ANOVA, to make multiple comparisons and do post-hoc t-tests, we use **tukey test**. Post-hoc means we are testing individual conditions after we tested ANOVA is significant. To compare $i$ group and $j$ group, Tukey test statistic is,

$$
q = \frac{\bar{x}_i - \bar{x}_j}{\sqrt{\text{MS}_{\text{Within}}} \sqrt{2 / n}}
$$

This test looks like **t-test**. Difference between the means scaled by some measure of variance and something about total number of data points. $q$ has degrees of freedom $(j, n - j)$. $j$ is the number of comparison. The way to evaluate the significance of tukey test depends on the number of comparisons that are relevant to us. Post-hoc comparisons within ANOVA are allowed only when omnibus F-test is significant.

## Two-way ANOVA

The total variation in the dataset is,

- The sum of the variation across individuals within each group
- AND the variation across the different levels within each factor
- AND the variation at the **interaction** between the factors.

The two-way ANOVA table shows us the p-value for each factor and for each interaction, so each p-value only tells us about the specific factor.






