# ANOVA

It is commonly used when comparing means from three or more groups.

## Factors and Levels

1. Factor:  It is the independent variable that defines the groups or conditions being compared.
    
2. Levels:  levels represent the categories or values within each factor that define the groups or conditions being                   compared in the ANOVA analysis.
    
3. Example:Suppose a researcher wants to investigate the effect of different doses of a drug (Factor A).
Factor A (Drug Dose) with three levels:Low Dose,Medium Dose,High Dose

## Assumption 

1. Normality of sampling distribution of mean :
The distribution of sampling mean is normally distributed
2. Absence of outlier :
Outlying score need to be removed from dataset
3. Homogeneity of variance :
Population variance in diferent levels of each independent(factor) variable are equal
4. Samples are independent and random


## Types of ANOVA

#### 1. One-Way ANOVA:
    1. Situation: One-Way ANOVA is used when you have one factor with two or more levels and you want to compare the means of a continuous dependent variable across those groups.
    2. Example: Comparing the average test scores of students across different grade levels (e.g., 9th grade, 10th grade, 11th grade, and 12th grade).

#### 2. Two-Way ANOVA:

    1. Situation: Two-Way ANOVA is used when you have two factors and one continuous dependent variable. It allows you to examine the main effects of each independent variable and their interaction effect on the dependent variable.
    2. Example: Investigating the effect of both gender (male vs. female) and treatment (drug A vs. drug B) on blood pressure levels.

#### 3. Three-Way ANOVA:

    1. Situation: Three-Way ANOVA is used when you have three categorical independent variables (factors) and one continuous dependent variable. It allows you to examine the main effects of each independent variable, as well as their interactions.
    2. Example: Analyzing the impact of factors like temperature (low, medium, high), humidity (low, medium, high), and time of day (morning, afternoon, evening) on crop yield.

#### 4. Repeated Measure ANOVA
    1. One factor with at least two levels, levels are dependent.
    2. also known as within-subjects ANOVA, is used when the same subjects are measured under different conditions or at different time points.
    3. Example :
    Suppose a researcher wants to study the effect of different training interventions on the running speed of athletes hree consecutive days.For each day, the running speed of each athlete is recorded.
    By utilizing a repeated measures design, the researcher can assess the changes in running speed within each athlete across the three consecutive days, taking into account their individual performance and potential improvements resulting from the different training interventions.
    
#### 4. Factorial ANOVA
    1. Two or more factors( each of which with atleast) 2 levels, levels can be independdent and dependent.
    2. Factorial ANOVA is used when there are two or more independent variables (factors) that are studied simultaneously to determine their effects on the dependent variable.
    3. Example :Suppose a researcher wants to investigate the effect of two independent variables, "Type of Exercise" and "Gender," on participants' heart rate. They measure the heart rate of participants after engaging in different types of exercises.

       Factor A: Type of Exercise
       Exercise 1: Cardiovascular Exercise
       Exercise 2: Strength Training
       Factor B: Gender
       Male
       Female
       The researcher randomly assigns 40 participants to one of the four groups:
       Group 1: Cardiovascular Exercise - Male
       Group 2: Cardiovascular Exercise - Female
       Group 3: Strength Training - Male
       Group 4: Strength Training - Female
       Participants engage in the assigned exercise, and their heart rate is recorded immediately afterward.

## Partitioning in ANOVA 

1.  It is the decomposition of the total variation in the data into different components or sources of variation.
2.  the total variation in the dependent variable is partitioned into three main components:

    1. Between-Group Variation: This component represents the variation in the dependent variable that is due to differences between the groups or levels of the independent variable. It measures the effect of the independent variable on the dependent variable.

    2. Within-Group Variation: Also known as the residual or error variation, this component represents the variation in the dependent variable that cannot be explained by the independent variable. It reflects the random variation or noise in the data that is not accounted for by the model.

    3. Total Variation: This is the overall variation in the dependent variable, which is the sum of the between-group and within-group variations. It represents the total variability in the data.
    
3. The ratio of the between-group variation to the within-group variation is used to compute the F-statistic, which is used to test the statistical significance of the group differences. If the between-group variation is large relative to the within-group variation, it suggests that there are significant differences between the groups.

## Main Effect and Interaction Effect 

It is important concepts used to understand the relationships between independent variables (factors) and the dependent variable.
#### 1. Main Effect:
      1. A main effect in ANOVA refers to the effect of a single independent variable on the dependent variable, ignoring the effects of other independent variables. 
      2. It represents the average difference in the dependent variable across the levels of that particular independent variable.
      3. For example, in a two-way ANOVA with independent variables A and B, the main effect of A represents the difference in the dependent variable between the levels of A, regardless of the levels of B. 
      
##### 2. Interaction Effect:
       1. An interaction effect in ANOVA occurs when the effect of one independent variable on the dependent variable depends on the level of another independent variable. 
       2. In other words, the relationship between the independent variables and the dependent variable is not simply additive but varies based on the combination of levels of the independent variables.
       3. For example, in a two-way ANOVA, an interaction effect between variables A and B suggests that the effect of A on the dependent variable differs across the levels of B, and vice versa.

## Post Hoc Test ANOVA

### Importance
It provide additional insights and clarifying the nature of significant differences between groups
1. Identifying Specific Group Differences: Post hoc tests allow for pairwise comparisons between groups, helping to identify the specific groups that exhibit significant differences.

2. Avoiding Type I Errors: Post hoc tests adjust the p-values or confidence intervals for multiple comparisons, controlling the overall experimentwise error rate and reducing the risk of incorrectly concluding significant differences. 

3. Providing Detailed Insights: They offer mean differences, confidence intervals, and p-values for pairwise comparisons, allowing for a more detailed understanding of the magnitude and direction of the differences between groups.

4. Supporting Decision-Making: In various fields such as medicine, education, and business, post hoc tests assist in making informed decisions based on the specific group differences observed. 

5. Enhancing Interpretability

### Types of Post-hoc test

1. Tukey's Honestly Significant Difference (HSD) Test:
   This test is widely used and compares all possible pairs of means while controlling the familywise error rate. 
2. Bonferroni Correction: This approach adjusts the significance level for each individual comparison to control the familywise error rate.  
3. Dunnett's Test: This test is used when comparing multiple treatment groups against a control group. It controls the overall experimentwise error rate by comparing each treatment group to the control group while accounting for the correlation among the comparisons.
4. Fisher's Least Significant Difference (LSD) Test: The LSD test compares pairs of means using the standard error of the differences. 
5. Scheffe's Test: This test is robust and can be used for both planned and unplanned comparisons

#### Example1: One way anova
A researcher wants to compare the average test scores of students from three different schools (School A, School B, and School C) to determine if there are any significant differences. The researcher collects test scores from a random sample of students from each school. Conduct a one-way ANOVA using Python to determine if there are any significant differences in test scores between the three schools.

In [1]:
import pandas as pd
import scipy.stats as stats

# Create a DataFrame with the test scores and corresponding schools
data = {'School': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'TestScore': [85, 92, 88, 78, 82, 80, 90, 88, 95]}
df = pd.DataFrame(data)

# Perform one-way ANOVA
anova_result = stats.f_oneway(df[df['School'] == 'A']['TestScore'],
                              df[df['School'] == 'B']['TestScore'],
                              df[df['School'] == 'C']['TestScore'])

# Print the ANOVA result
print("One-Way ANOVA Result:")
print("F-Statistic:", anova_result.statistic)
print("p-value:", anova_result.pvalue)


One-Way ANOVA Result:
F-Statistic: 10.102272727272725
p-value: 0.012003941180973709


p-value < 0.05, we can conclude that there are significant differences in test scores between at least two of the schools.

#### Example2 : two way anova

A researcher wants to investigate the effect of two factors, gender and age group, on the performance of individuals in a cognitive task. The researcher randomly assigns participants to four groups: males in the young age group, males in the old age group, females in the young age group, and females in the old age group. The researcher measures the performance scores of participants in the cognitive task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between gender and age group on the performance scores.

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the performance scores, gender, and age group
data = {'Performance': [78, 84, 70, 76, 92, 88, 65, 70, 85, 82, 79, 80],
        'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Female',
                   'Male', 'Male', 'Female', 'Female'],
        'AgeGroup': ['Young', 'Old', 'Young', 'Old', 'Young', 'Old', 'Young', 'Old',
                     'Young', 'Old', 'Young', 'Old']}
df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Performance ~ Gender + AgeGroup + Gender:AgeGroup', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print("Two-Way ANOVA Result:")
print(anova_table)


Two-Way ANOVA Result:
                     sum_sq   df          F    PR(>F)
Gender           396.750000  1.0  11.843284  0.008806
AgeGroup          10.083333  1.0   0.300995  0.598233
Gender:AgeGroup   14.083333  1.0   0.420398  0.534902
Residual         268.000000  8.0        NaN       NaN


1. Gender: p-value < 0.05, suggesting statistical significance, there is a relationship between gender and performance scores.
2. AgeGroup: p-value > 0.05, indicating no significant difference between age groups in terms of performance scores.
3. Gender:AgeGroup Interaction:  The p-value > 0.05, indicating no significant interaction effect.

#### Example3 : Three way ANOVA

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = {
    'Yield': [10, 12, 8, 11, 9, 13, 7, 10, 11, 9, 14, 12, 8, 10, 13],
    'Treatment': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Gender': ['Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female', 'Male', 'Male', 'Male'],
    'Age_Group': ['Young', 'Young', 'Young', 'Young', 'Young', 'Young', 'Old', 'Old', 'Old', 'Old', 'Old', 'Old', 'Young', 'Young', 'Young']
}

df = pd.DataFrame(data)

# Perform three-way ANOVA
formula = 'Yield ~ C(Treatment) + C(Gender) + C(Age_Group) + C(Treatment):C(Gender):C(Age_Group)'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                                        sum_sq   df         F    PR(>F)
C(Treatment)                         16.533333  2.0  1.503030  0.353018
C(Gender)                             7.626984  1.0  1.386724  0.323874
C(Age_Group)                          0.126984  1.0  0.023088  0.888872
C(Treatment):C(Gender):C(Age_Group)  19.061905  7.0  0.495114  0.800169
Residual                             16.500000  3.0       NaN       NaN


In [9]:
import statsmodels.stats.multicomp as mc
# Perform Tukey's HSD test
mc_results = mc.MultiComparison(df['Yield'], df['Treatment']).tukeyhsd()

# Create a DataFrame to store the Tukey's HSD test results
tukey_results = pd.DataFrame(data=mc_results._results_table.data[1:], columns=mc_results._results_table.data[0])

# Print the Tukey's HSD test results
print(tukey_results)

  group1 group2  meandiff   p-adj   lower   upper  reject
0      A      B       2.0  0.2573 -1.2014  5.2014   False
1      A      C       2.4  0.1546 -0.8014  5.6014   False
2      B      C       0.4  0.9409 -2.8014  3.6014   False


1. -value > 0.05, indicating no significant difference between treatment and yeild, gender and yeild, age group and yeild.
2. from tukey result, see reject column ,indicating no significant difference

#### example4 : repeated measure anova

In [5]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM

#create data
df = pd.DataFrame({'patient': np.repeat([1, 2, 3, 4, 5], 4),
                   'drug': np.tile([1, 2, 3, 4], 5),
                   'response': [30, 28, 16, 34,
                                14, 18, 10, 22,
                                24, 20, 18, 30,
                                38, 34, 20, 44, 
                                26, 28, 14, 30]})

#perform the repeated measures ANOVA
print(AnovaRM(data=df, depvar='response', subject='patient', within=['drug']).fit())


              Anova
     F Value Num DF  Den DF Pr > F
----------------------------------
drug 24.7589 3.0000 12.0000 0.0000



Since this p-value is less than 0.05, we reject the null hypothesis and conclude that there is a statistically significant difference in mean response times between the four drugs.
A one-way repeated measures ANOVA was conducted on 5 individuals to examine the effect that four different drugs had on response time.
Results showed that the type of drug used lead to statistically significant differences in response time (F(3, 12) = 24.75887, p < 0.001).

#### Example 5 : Facotial anova

 A company wants to evaluate the effectiveness of three different marketing strategies (A, B, and C) in increasing sales revenue. They randomly select 60 customers and divide them equally into three groups. Each group is exposed to one of the marketing strategies (A, B, or C). After one month, they record the sales revenue generated by each customer. The company wants to determine if there is a significant difference in sales revenue among the three marketing strategies.

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sales revenue data for each marketing strategy
strategy_A = [100, 110, 95, 120, 105, 115]
strategy_B = [90, 85, 95, 80, 75, 100]
strategy_C = [120, 130, 110, 135, 125, 130]

# Creating a DataFrame for the data
data = pd.DataFrame({
    'Strategy': ['A'] * len(strategy_A) + ['B'] * len(strategy_B) + ['C'] * len(strategy_C),
    'Sales_Revenue': strategy_A + strategy_B + strategy_C
})

# Performing the factorial ANOVA
model = ols('Sales_Revenue ~ Strategy', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Printing the ANOVA table
print("Factorial ANOVA Results:")
print(anova_table)


Factorial ANOVA Results:
          sum_sq    df          F    PR(>F)
Strategy  4225.0   2.0  24.852941  0.000017
Residual  1275.0  15.0        NaN       NaN


Since the p-value is less than the significance level (0.05), there is a significant difference in sales revenue among the three marketing strategies (A, B, and C).

In [4]:
from statsmodels.stats.multicomp import MultiComparison
# Performing Tukey's HSD test
mc = MultiComparison(data['Sales_Revenue'], data['Strategy'])
tukey_result = mc.tukeyhsd()

# Printing the Tukey's HSD test results
print("Tukey's HSD Test Results:")
print(tukey_result)

Tukey's HSD Test Results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B    -20.0 0.0051 -33.8261 -6.1739   True
     A      C     17.5  0.013   3.6739 31.3261   True
     B      C     37.5    0.0  23.6739 51.3261   True
-----------------------------------------------------


This means that the sales revenue among all three strategies (A, B, and C) is significantly different from each other.

#### Example 6: partitioning of anova

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sales revenue data for each marketing strategy
strategy_A = [100, 110, 95, 120, 105, 115]
strategy_B = [90, 85, 95, 80, 75, 100]
strategy_C = [120, 130, 110, 135, 125, 130]

# Creating a DataFrame for the data
data = pd.DataFrame({
    'Strategy': ['A'] * len(strategy_A) + ['B'] * len(strategy_B) + ['C'] * len(strategy_C),
    'SalesRevenue': strategy_A + strategy_B + strategy_C
})

# Performing the partitioning in ANOVA
model = ols('SalesRevenue ~ Strategy', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extracting the sum of squares
ss_total = anova_table['sum_sq']['Residual'] + anova_table['sum_sq']['Strategy']
ss_strategy = anova_table['sum_sq']['Strategy']
ss_residual = anova_table['sum_sq']['Residual']

# Computing the proportion of variance explained by the strategy
prop_var_strategy = ss_strategy / ss_total # independent

# Computing the proportion of variance unexplained (residual variance)
prop_var_residual = ss_residual / ss_total

# Printing the partitioning results
print("Partitioning Results:")
print("Total Sum of Squares:", ss_total)
print("Sum of Squares - Strategy:", ss_strategy)
print("Sum of Squares - Residual:", ss_residual)
print("Proportion of Variance - Strategy:", prop_var_strategy)
print("Proportion of Variance - Residual:", prop_var_residual)


Partitioning Results:
Total Sum of Squares: 5499.999999999995
Sum of Squares - Strategy: 4224.999999999995
Sum of Squares - Residual: 1275.0
Proportion of Variance - Strategy: 0.768181818181818
Proportion of Variance - Residual: 0.231818181818182


The proportion of variance explained by the strategy indicates the proportion of total variability in the dependent variable (sales revenue) that is accounted for by the strategy.