# Analysis of Variance (ANOVA)

## Concept

The goal is to determine the effects of **discrete** independent variables on a **continuous** dependent variable.

Mostly we are concerned with main effect of a factor and interactions between factors.

**Main effect** is one factor which influences the dependent variable even when ignoring all other factors. For example, young people's symptoms improve faster than older people's symptoms, regardless of medication type

**Interactions** are the effect of one factor depends on the levels of another factor. For example, medication A works better in older people.

**Intercept** is when the average dependent variable is different from 0, but the intercept of the ANOVA is usually **ignored**. For example, symptoms improve for almost everyone after 10 days.

**Factor** means an independent variable. **Level** means categorical values of each factor, independent variable.

For X-way ANOVA, X means the number of factors, number of independent variables.

**Repeated-measures ANOVA** means at least one factor involves multiple measurements, same factor but different level, from the same individual.

**MANOVA** is multivariate ANOVA, multiple dependent variables. For example, it's testing the effects of medication type and age on Covid-19 symptions and total medical expenses. Here, there are 2 dependent variables; symptions and expenses.

## Assumption

- The data are **sampled independently** of each other in the population.
- The **residuals**, unexplained variance after fitting the model, are **normally, Gaussian distributed**
- The variance within each cell, level or level by level cell, is roughly the same (**heteroscedasticity**, homogeneity of variance).

When assumption doesn't work, we can use non-parameteric one-way ANOVA, **Kruskal-Wallis test (KW-ANOVA)**, but it's rarely used, because ANOVAs are **generally robust to violations of the assumptions**.

## Hypothesis

The hypothesis that ANOVA is testing is the following. $\mu_i$ is the mean of each cell by levels.

$$
H_0: \mu_1 = \mu_1 = ... = \mu_k
$$
$$
H_A: \mu_i \ne \mu_j 
$$

The alternative hypothesis means that, at least one mean is different from at least one other mean. So when we do ANOVA, we don't actually immediately get the answer of which means are different from which other means. We have to do further investigation to know that.

## Computation

We first define **sum of squares (SS)**.

$$
SS = \sum_{i = 1}^{n} (x_i - \bar{x})^2
$$
$$
\text{Total SS} = \text{Within-group SS} + \text{Between-group SS}
$$

It means that the total variation in the dataset is, the sum of the variation across individuals within each group, AND the variation across the different levels.

$$
F = \frac{\text{Explained variance}}{\text{Unexplained variance}}
$$
$$
= \frac{\text{Dues to factors}}{\text{Natural variation}}
$$

ANOVA is statistically significant when explained variance is much bigger than unexplained variance.





In [1]:
# 160. Sum of squares 11:05