ANOVA (Analysis of Variance)

Anova is used to compare difference of means among more than 2 groups. It does this by looking at variation in the data and where that variation is found (hence its name). Specifically, ANOVA compares the amount of variation within groups

1 Null Hypothesis, typically is that all means are equal
2 The Independent variables are categorical
3 Dependent variables are continues

F Value
F = Sample means of between groups/sample means of within groups

There are several types of ANOVA tests designed for different experimental designs and scenarios. Here are some of the common ANOVA tests 

One Way Anova:

Used when comparing means from multiple independent groups (or treatments). Determines if there are significant differences in means among the groups.
Example: Comparing the effect of different teaching methods on student exam scores

Two-way ANOVA

Used when there are two independent categorical variables (factors) that may influence the dependent variable. Investigates the effect of each factor and their interavtion effect
Example 2: Studying the impact of both gender and treatment type on patient recovery time

One Way ANOVA

Example: ANOVA test for fertilizer effectiveness

Scenario: A botanist is conducting an experiment to compare the growth of plants using three different types of fertilizers. Fertilizer A, B and C. The experiment involves assigning 30 random plants to each fertilizer group and measure the height of plants after a certain group

Hypothesis:

Null Hypothesis (HO): There is no significant difference in growth of plants among the three types of fertilizers

Alternate Hypothesis (Ha): Atleast one fertilizer type effects plant growth

In [2]:
import numpy as np
from scipy.stats import f_oneway

#Heights of plants for each fertilizer group
fertilizer_a = np.array([15.2, 16.5, 14.8,17.3,15.9,16.1,17.8,15.6,16.4,
                         14.7, 16.8, 16.2, 16.7, 15.5, 16.0, 16.6, 15.9, 15.3,
                         16.5, 16.9, 15.7, 16.3, 15.8, 17.1, 15.4, 16.5, 16.2,
                         17.0, 15.1, 16.4, 16.6])

fertilizer_b = np.array([14.3, 13.8, 13.5,14.7,13.9,14.5,14.0,14.9,13.7,
                         14.4, 14.2, 14.6, 14.1, 13.6, 14.3, 14.8, 14.2, 14.5,
                         14.7, 13.9, 14.4, 13.8, 14.6, 14.1, 13.5, 14.2, 13.7,
                         14.0, 14.3, 14.5, 14.1])

fertilizer_c= np.array([17.7, 18.3, 17.0,18.6,17.9,17.2,18.1,17.7,18.0,
                         17.3, 17.8, 18.2, 17.4, 18.5, 17.6, 18.4, 18.1, 17.1,
                         18.2, 18.3, 17.8, 18.0, 17.9, 18.1, 17.7, 18.2, 18.6,
                         17.3, 18.3, 18.4, 17.6])

#Perform one way ANOVA
f_statistic, p_value = f_oneway(fertilizer_a,fertilizer_b,fertilizer_c)

#Interpret the results
alpha = 0.05 #significant level
if p_value<alpha:
    print('Reject Null Hypothesis: There is a significant difference in the growth of plants across fertilizers')
else:
    print('Failed to reject Null Hypothesis: There is no difference in the growth of plants across fertilizers ')

Reject Null Hypothesis: There is a significant difference in the growth of plants across fertilizers


Two way ANOVA

Scenario: Researchers are conducting a study to determine the effects of diet type (Factor A:Low Carb, Balanced, Low Fat) and exercise level (Factor B: Sedentary, Moderate, Active) on weight loss. They recruit participants and measure their weight before and after a specified period.

Hypothesis:

Null Hypothesis (HO): There is no ineraction between diet type and exercise on weight loss

Alternative Hypothesis (Ha): There is an interaction between diet type and exercise on weight loss

In [3]:
import pandas as pd

#Create a dataframe with simulated data
data = {
    'Diet':['Low Carb', 'Low Carb', 'Low Carb', 'Balanced', 'Balanced', 'Balanced', 'Low Fat', 'Low Fat', 'Low Fat'],
    'Exercises': ['Sedentary', 'Moderate', 'Active','Sedentary', 'Moderate', 'Active','Sedentary', 'Moderate', 'Active'],
    'Weightloss': [2.3, 3.1, 4.2, 1.5, 2.0, 2.8, 1.0, 1.2, 1.5]
    }

df=pd.DataFrame(data)
df

Unnamed: 0,Diet,Exercises,Weightloss
0,Low Carb,Sedentary,2.3
1,Low Carb,Moderate,3.1
2,Low Carb,Active,4.2
3,Balanced,Sedentary,1.5
4,Balanced,Moderate,2.0
5,Balanced,Active,2.8
6,Low Fat,Sedentary,1.0
7,Low Fat,Moderate,1.2
8,Low Fat,Active,1.5


In [8]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#Fit two way ANOVA model
model = ols('Weightloss ~ Diet + Exercises', data=df).fit()
anova_table = sm.stats.anova_lm(model)

#Interpret the results
print(anova_table)

            df    sum_sq   mean_sq          F    PR(>F)
Diet       2.0  5.828889  2.914444  23.419643  0.006190
Exercises  2.0  2.308889  1.154444   9.276786  0.031455
Residual   4.0  0.497778  0.124444        NaN       NaN


Nested ANOVA

Scenario: A study is conducted to assess the effects of different instructors (nested within schools) on student exam scores. Students are nested within instructors, and instructors are nested sithin schools 

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#Create a dataframe with simulated data
data ={
    'School': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'Instructor': ['X', 'Y', 'X', 'Y', 'Z', 'Z', 'Y', 'Z', 'X'],
    'ExamScore': [85, 88, 82, 89, 86, 90, 91, 87, 88]
}

df = pd.DataFrame(data)
df

Unnamed: 0,School,Instructor,ExamScore
0,A,X,85
1,A,Y,88
2,B,X,82
3,B,Y,89
4,B,Z,86
5,C,Z,90
6,C,Y,91
7,C,Z,87
8,C,X,88


In [8]:
#Fit Nested Anova Model
model = ols('ExamScore ~ School + Instructor + School:Instructor', data=df).fit()
anova_table = sm.stats.anova_lm(model)

#Interpret the results
print(anova_table)

                    df     sum_sq    mean_sq         F    PR(>F)
School             2.0  20.833333  10.416667  2.314815  0.421464
Instructor         2.0  28.433333  14.216667  3.159259  0.369648
School:Instructor  4.0  10.733333   2.683333  0.596296  0.734989
Residual           1.0   4.500000   4.500000       NaN       NaN
