## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Answer:

ANOVA makes the following assumptions:

Independence of observations: Each data point should be collected independently.

Violation Example: Repeated measurements on the same subject without accounting for it.

Normality: The residuals (differences from the group mean) should be approximately normally distributed.

Violation Example: If data is highly skewed or has outliers.

Homogeneity of variances (Homoscedasticity): The variance among the groups should be equal.

Violation Example: One group has much larger spread than the others.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

Answer:

One-Way ANOVA: Compares means across one independent variable with multiple levels.

Use case: Comparing average scores of students across different classes.

Two-Way ANOVA: Compares means across two independent variables and can test for interaction effects.

Use case: Analyzing effect of teaching method and gender on performance.

Repeated Measures ANOVA: Used when the same subjects are measured multiple times under different conditions.

Use case: Measuring blood pressure before, during, and after medication.



## Q3. What is the partitioning of variance in ANOVA, and why is it important?

Answer:

In ANOVA, total variability in the data is partitioned into:

SST (Total Sum of Squares): Total variation in data.

SSB/SSE (Sum of Squares Between): Variation due to differences between group means.

SSW/SSR (Sum of Squares Within/Residual): Variation due to differences within each group.

## Q4. How would you calculate SST, SSE, and SSR in a one-way ANOVA using Python?

In [8]:
import pandas as pd
import scipy.stats as stats
import numpy as np

# Sample Data
data = {
    'Group': ['A']*5 + ['B']*5 + ['C']*5,
    'Scores': [20, 22, 23, 21, 24, 28, 27, 29, 26, 30, 35, 33, 32, 36, 34]
}
df = pd.DataFrame(data)

# Overall mean
grand_mean = df['Scores'].mean()

# Total Sum of Squares (SST)
SST = sum((df['Scores'] - grand_mean)**2)

# Between Group Sum of Squares (SSB)
group_means = df.groupby('Group')['Scores'].mean()
n = df['Group'].value_counts().values[0]
SSB = sum(n * (group_means - grand_mean)**2)

# Within Group Sum of Squares (SSW)
SSW = SST - SSB

print(f"SST: {SST}, SSB: {SSB}, SSW: {SSW}")


SST: 390.0, SSB: 360.0, SSW: 30.0


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [11]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/ToothGrowth.csv")

# Two-way ANOVA: len ~ supp + dose + supp:dose
model = ols('len ~ C(supp) + C(dose) + C(supp):C(dose)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude?

Answer:

At a 5% significance level (α = 0.05), since p-value = 0.02 < 0.05, we reject the null hypothesis.

Conclusion: There is a statistically significant difference between the group means.



## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences?

Answer:

Methods to handle missing data:

Listwise Deletion: Remove any subject with missing values.

Imputation: Fill in missing data (e.g., mean imputation, regression imputation).

Mixed Models: Use models that handle missingness under MAR assumptions.

Consequences:

Biased results if missingness is not random.

Reduced power due to loss of data.

##  Q8. What are some common post-hoc tests used after ANOVA?

Answer:

Tukey's HSD: Best for comparing all pairs of group means.

Bonferroni Correction: Adjusts p-values for multiple comparisons.

Scheffe’s Test: More conservative; good for unequal group sizes.

## Q9. One-way ANOVA in Python – Comparing 3 Diets

In [19]:
from scipy.stats import f_oneway

# Sample Data
diet_A = [5, 7, 6, 8, 9]
diet_B = [4, 3, 6, 5, 4]
diet_C = [8, 9, 7, 8, 10]

# One-way ANOVA
f_stat, p_value = f_oneway(diet_A, diet_B, diet_C)
print(f"F-statistic: {f_stat:.2f}, P-value: {p_value:.4f}")


F-statistic: 12.12, P-value: 0.0013


## Q10. Two-way ANOVA – Software Program & Experience

In [31]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fixing data: 3 programs (A, B, C) × 2 experience levels × 5 repetitions = 30 rows
programs = ['A', 'B', 'C']
experiences = ['Novice', 'Experienced']

data = []

# Simulating 5 samples per (program, experience) group
values = [22, 24, 23, 25, 21, 20, 22, 21, 19, 20, 18, 19] * 3  # 36 values
i = 0

for program in programs:
    for exp in experiences:
        for _ in range(5):  # 5 repetitions
            data.append({'Program': program, 'Experience': exp, 'Time': values[i]})
            i += 1

df = pd.DataFrame(data)

# Two-way ANOVA
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_result = sm.stats.anova_lm(model, typ=2)
print(anova_result)


                             sum_sq    df         F    PR(>F)
C(Program)                 1.866667   2.0  0.269231  0.766244
C(Experience)              0.833333   1.0  0.240385  0.628381
C(Program):C(Experience)  39.466667   2.0  5.692308  0.009479
Residual                  83.200000  24.0       NaN       NaN


##  Q11. Two-sample t-test for test score comparison

In [25]:
from scipy.stats import ttest_ind

# Sample Data
control = [70, 72, 68, 71, 69]
experimental = [75, 78, 80, 76, 77]

t_stat, p_val = ttest_ind(control, experimental)
print(f"T-statistic: {t_stat:.2f}, P-value: {p_val:.4f}")


T-statistic: -6.47, P-value: 0.0002


## Q12. Repeated Measures ANOVA – Daily Sales of 3 Stores

In [29]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM

# Simulated Data
df = pd.DataFrame({
    'Subject': list(range(1, 31)),
    'StoreA': np.random.normal(500, 20, 30),
    'StoreB': np.random.normal(520, 20, 30),
    'StoreC': np.random.normal(510, 20, 30)
})

melted = pd.melt(df, id_vars=['Subject'], value_vars=['StoreA', 'StoreB', 'StoreC'],
                 var_name='Store', value_name='Sales')

anova = AnovaRM(melted, 'Sales', 'Subject', within=['Store']).fit()
print(anova)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  2.2704 2.0000 58.0000 0.1124

