# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [1]:
# Your code here
import pandas as pd
df = pd.read_csv('ToothGrowth.csv')
df.head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [2]:
#Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

formula = 'len ~ C(supp) + C(dose)'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, typ=2)
print(table)

               sum_sq    df          F        PR(>F)
C(supp)    205.350000   1.0  14.016638  4.292793e-04
C(dose)   2426.434333   2.0  82.810935  1.871163e-17
Residual   820.425000  56.0        NaN           NaN


This code is performing an *ANOVA (Analysis of Variance)* on a linear model that uses both categorical and continuous variables. Let's break down the key parts:

### 1. **Formula: 'S ~ C(E) + C(M) + X'**

- S: This is the *dependent variable* (the outcome you're trying to predict).
- C(E) and C(M): These are *categorical independent variables*. The function C() tells the model to treat E and M as categorical (factors) and generate dummy variables for them.
- X: This is another *independent variable*, which is likely continuous or already numeric.

This formula defines the model you're fitting, where S is predicted by the variables E, M, and X.

### 2. **Model Fitting: lm = ols(formula, df).fit()**

- ols: This stands for *Ordinary Least Squares*, a method used to fit the linear model.
- formula: The formula you've defined earlier that indicates the relationship between the dependent and independent variables.
- df: This is the DataFrame containing your data.
- fit(): This function fits the linear model to the data based on the formula provided.

The result is stored in lm, which represents the fitted linear model.

### 3. **ANOVA Table: table = sm.stats.anova_lm(lm, typ=2)**

- sm.stats.anova_lm: This function computes an *ANOVA table* for the linear model you just fit.
- lm: This is the fitted linear model from the previous step.
- typ=2: This specifies the type of ANOVA. There are three types (I, II, and III), but *Type II* ANOVA is often used when you're interested in testing each predictor after accounting for others (i.e., to test the significance of each factor after adjusting for the others).

The result, stored in table, is the ANOVA table, which summarizes the significance of each predictor in the model.

### 4. **Printing the Table: print(table)**

The ANOVA table contains several important columns:
- *Sum of Squares (SS)*: Measures the total variability explained by each factor.
- *Degrees of Freedom (df)*: The number of independent pieces of information for each factor.
- *F-statistic*: The ratio of the model variance to the residual variance for each factor.
- *p-value*: Indicates the statistical significance of each factor (whether it's likely to have a meaningful impact on the dependent variable).

In summary, this code is fitting a linear model to predict S based on categorical variables E and M, and a continuous variable X, then conducting an ANOVA to check the statistical significance of each predictor.

## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

In [None]:
# Both dose and supplement type are impactful. At first glance, dosage seems to be the more impactful of the two.

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [None]:
# Your code here

Now run a t-test between these two groups and print the associated two-sided p-value: 

In [None]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups


## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [None]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [7]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.