### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
Ans: \
ANOVA is a statistical method used to compare the means of three or more groups. To ensure valid results, certain assumptions must be met:

 1. Independence of Observations\
Definition: The data from different groups must be independent (i.e., observations in one group should not influence another).

Example of Violation: Repeated measurements from the same subjects across groups or time.

 2. Normality\
Definition: The data within each group should be approximately normally distributed.

Example of Violation: If the data in any group is heavily skewed or has outliers, it violates this assumption.
 Note: ANOVA is fairly robust to minor deviations from normality when sample sizes are equal and large.

 3. Homogeneity of Variances (Homoscedasticity)\
Definition: All groups should have similar variances.

Example of Violation: One group has a much larger spread of values (standard deviation) than others.

You can test this using Levene’s Test or Bartlett’s Test.

### Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans: \
###Types of ANOVA and When to Use Them

There are **three main types of ANOVA** (Analysis of Variance), each designed for different experimental designs and research questions:

---

###  1. **One-Way ANOVA**
- **Purpose**: Compares the means of **three or more independent groups** based on **one independent variable (factor)**.
- **Use Case**: Testing whether different teaching methods affect student scores.

####  Example:
You want to compare test scores of students taught using:
- Method A  
- Method B  
- Method C

> **Use One-Way ANOVA** to test if there is a significant difference in average scores among the three methods.

---

###  2. **Two-Way ANOVA (Factorial ANOVA)**
- **Purpose**: Examines the influence of **two independent variables** on a dependent variable and **checks for interaction** between them.
- **Use Case**: Testing how **teaching method** and **gender** affect student scores.

####  Example:
You test students by:
- Gender: Male, Female  
- Method: A, B, C  
And want to know:
- Is there a difference in scores by gender?
- Is there a difference by method?
- Does gender × method have a combined effect?

> **Use Two-Way ANOVA** when you're analyzing two categorical factors and their interaction.

---

###  3. **Repeated Measures ANOVA**
- **Purpose**: Used when the **same subjects are measured multiple times** (within-subjects design).
- **Use Case**: Testing if a workout program improves fitness over time in the **same group of participants**.

####  Example:
Measure heart rate of individuals:
- Before training  
- After 4 weeks  
- After 8 weeks  

> **Use Repeated Measures ANOVA** to assess changes across time or conditions within the same subjects.

---

###  Summary Table:

| **Type of ANOVA**          | **Factors** | **Subjects**              | **Use Case**                                      |
|---------------------------|-------------|---------------------------|--------------------------------------------------|
| One-Way ANOVA             | 1           | Independent groups        | Comparing means of 3+ groups                     |
| Two-Way ANOVA             | 2           | Independent groups        | Comparing means based on 2 factors + interaction |
| Repeated Measures ANOVA   | 1+          | Same group, multiple times| Tracking changes in same subjects over time      |

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans: \
**Partitioning of variance** in ANOVA refers to how the **total variation** in the data is broken down into components that help us understand the sources of that variation.

---

### Breakdown of Variance in ANOVA

In ANOVA, the **total variance** is split into:

1. **Between-Group Variance (SSB or SSA)**  
   → Variation **between the means** of different groups.  
   → Captures the effect of the treatment or independent variable.

2. **Within-Group Variance (SSW or SSE)**  
   → Variation **within each group** (i.e., due to individual differences, error, or noise).  
   → Also called **error variance**.

3. **Total Variance (SST)**  
   → The sum of both:  
   $$
   [
   \text{SST} = \text{SSB} + \text{SSW}
   ]
   $$

---

### Formula Recap

- **Total Sum of Squares (SST)**:  
  $$
  [
  SST = \sum (X_{ij} - \bar{X})^2
  ]
  $$
  
- **Between-Group Sum of Squares (SSB)**:  
  $$
  [
  SSB = \sum n_j(\bar{X}_j - \bar{X})^2
  ]
  $$

- **Within-Group Sum of Squares (SSW)**:  
  $$
  [
  SSW = \sum \sum (X_{ij} - \bar{X}_j)^2
  ]
  $$

---

###  Why It’s Important

1. **Helps Identify the Source of Variability**:  
   By splitting variance, we can isolate whether differences in the dependent variable are **due to the treatment** or just **random error**.

2. **F-Ratio Calculation**:  
   The F-statistic in ANOVA is:
   $$
   [
   F = \frac{\text{Mean Square Between}}{\text{Mean Square Within}} = \frac{MSB}{MSW}
   ]
   $$
   A **high F-ratio** suggests that group means differ more than we'd expect by chance.

3. **Guides Hypothesis Testing**:  
   If between-group variance is significantly higher than within-group variance, we **reject the null hypothesis** (that all group means are equal).

---

###  Real-World Analogy

Imagine you're comparing the test scores of students from 3 schools:
- **Between-group variance** checks if **schools have different teaching quality**.
- **Within-group variance** checks if **students in the same school vary a lot due to individual factors**.

Understanding this helps you decide whether **the school (group)** truly makes a difference.

#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
Ans: \
Main Effects and Interaction Effect:
* Main Effect A: Impact of the first independent variable (e.g., Treatment).

* Main Effect B: Impact of the second independent variable (e.g., Gender).

* Interaction Effect (A × B): Whether the effect of one variable depends on the level of the other.

```
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    'score': [88, 92, 85, 95, 70, 65, 78, 82, 80, 85, 75, 90],
    'teaching_method': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
    'gender': ['M', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'F', 'M', 'F', 'M']
})

# Two-way ANOVA model with interaction
model = ols('score ~ C(teaching_method) * C(gender)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)
```

#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
Ans: \
If you conducted a **one-way ANOVA** and obtained:

- **F-statistic = 5.23**  
- **p-value = 0.02**

And you’re testing at a **significance level (α) of 0.05**, here’s how to interpret the results:

---

### **Conclusion:**
Since the **p-value (0.02) < α (0.05)**, you **reject the null hypothesis**.

---

### **Interpretation:**

- The **null hypothesis** in one-way ANOVA states that **all group means are equal**.
- A **p-value of 0.02** indicates that there's only a **2% chance** of observing such an F-statistic (or more extreme) if the null hypothesis were true.
- Therefore, you have **statistically significant evidence** to suggest that **at least one group mean is different** from the others.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
Ans: \
In **repeated measures ANOVA**, handling **missing data** is important because the same subjects are measured multiple times, and missing values can disrupt the within-subject comparisons. Here's how you can handle it and the consequences of different methods:

---

### **Ways to Handle Missing Data:**

1. **Listwise Deletion (Complete Case Analysis):**
   - Exclude subjects who have *any* missing values across the time points.
   - **Pros:** Easy to implement.
   - **Cons:** Reduces sample size and power; can bias results if data is not Missing Completely At Random (MCAR).

2. **Mean/Median Imputation:**
   - Replace missing values with the mean or median of that variable.
   - **Pros:** Simple and quick.
   - **Cons:** Underestimates variability; can bias estimates.

3. **Last Observation Carried Forward (LOCF):**
   - Use the participant’s last available value to fill in the missing one.
   - **Pros:** Maintains time-based structure.
   - **Cons:** Assumes stability, which may not be valid; can introduce bias.

4. **Multiple Imputation:**
   - Generates multiple plausible values based on the distribution of the data and combines results.
   - **Pros:** More accurate; maintains variability.
   - **Cons:** More complex and computationally intensive.

5. **Mixed-Effects Models / Linear Mixed Models (LMM):**
   - These models inherently handle missing data under the **Missing At Random (MAR)** assumption.
   - **Pros:** Robust, doesn't discard data unnecessarily.
   - **Cons:** Requires more advanced statistical knowledge.

---

### **Consequences of Each Method:**

| Method                     | Accuracy        | Bias Risk        | Sample Size Impact |
|---------------------------|-----------------|------------------|---------------------|
| Listwise Deletion         | Low (if much missing) | High (if not MCAR) | Decreases           |
| Mean/Median Imputation    | Low             | High             | None                |
| LOCF                      | Medium          | Medium to High   | None                |
| Multiple Imputation       | High            | Low              | None                |
| Mixed Models              | High            | Low (if MAR)     | None                |

---

###  Summary:
For repeated measures ANOVA:
- If data is **MCAR**, simple methods like listwise deletion may be acceptable.
- If data is **MAR**, use **multiple imputation** or **mixed-effects models** for best results.
- Avoid overly simplistic methods unless justified.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
Ans: \
After conducting an **ANOVA** and finding a **significant result**, the next step is often to perform **post-hoc tests** to determine **which specific group means differ** from each other. Here are some common post-hoc tests and when you'd use them:

---

###  **Common Post-Hoc Tests:**

| Post-Hoc Test | When to Use | Key Features |
|---------------|-------------|--------------|
| **Tukey's HSD** (Honestly Significant Difference) | When comparing **all pairs of groups** (equal group sizes preferred, but can handle unequal) | Controls **family-wise error rate** well; widely used |
| **Bonferroni Correction** | When you have a **small number of comparisons** | Very **conservative**; reduces Type I error but increases Type II error |
| **Scheffé’s Test** | When you want to test **all possible linear combinations** of means | **Very flexible**, but **less powerful** than others |
| **Dunnett's Test** | When comparing **each group against a control group** | More **powerful** than Tukey when only comparing to a control |
| **Games-Howell Test** | When variances are **unequal** and group sizes differ | **Does not assume equal variances**; great when ANOVA assumptions are violated |

---

###  **When Is a Post-Hoc Test Necessary?**

Post-hoc tests are needed **only when the ANOVA F-test is significant**. If the ANOVA shows that **at least one group differs**, a post-hoc test pinpoints **which group(s)** differ.

---

###  **Example Scenario:**

Suppose you're testing the effectiveness of **three different diets** (A, B, and C) on weight loss over a month. After running one-way ANOVA, you get:

- **F-statistic**: 6.8  
- **p-value**: 0.003 (significant)

You now perform **Tukey's HSD** to see if:
- Diet A vs B
- Diet A vs C
- Diet B vs C  
...differ significantly in terms of average weight loss.


### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.
Ans: \
One-Way ANOVA Results
F-statistic: 5.84

p-value: 0.00545

Interpretation:
Since the p-value (0.00545) is less than the typical significance level of 0.05, we reject the null hypothesis. This means that there is a statistically significant difference in mean weight loss among at least one pair of the diets (A, B, or C).

In [1]:
"""
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.
"""
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a mock dataset
np.random.seed(42)
n = 60

software = ['A', 'B', 'C']
experience = ['Novice', 'Experienced']

# Generate synthetic data
data = {
    'Time': np.concatenate([
        np.random.normal(30, 5, 10),  # Program A - Novice
        np.random.normal(25, 5, 10),  # Program A - Experienced
        np.random.normal(28, 5, 10),  # Program B - Novice
        np.random.normal(24, 5, 10),  # Program B - Experienced
        np.random.normal(26, 5, 10),  # Program C - Novice
        np.random.normal(23, 5, 10)   # Program C - Experienced
    ]),
    'Program': ['A'] * 20 + ['B'] * 20 + ['C'] * 20,
    'Experience': (['Novice'] * 10 + ['Experienced'] * 10) * 3
}

df = pd.DataFrame(data)

# Two-way ANOVA using ordinary least squares (OLS)
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

anova_table

"""
Interpretation:
Program: The p-value (0.197) indicates no significant main effect of software program on task time.

Experience: The p-value (0.000008) shows a highly significant effect of employee experience on task time — experienced users complete tasks faster.

Interaction: The significant interaction (p = 0.0011) implies that the effect of the software depends on experience level.
In other words, how fast someone completes a task with a given software may change depending on whether they are novice or experienced.
"""

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Program),61.139604,2.0,1.676565,0.196603
C(Experience),446.824314,1.0,24.50556,8e-06
C(Program):C(Experience),281.013398,2.0,7.705927,0.001137
Residual,984.613819,54.0,,


In [2]:
"""
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.
"""
# Re-import required libraries after kernel reset
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Simulating data for repeated measures ANOVA (sales for 3 stores across 30 days)
np.random.seed(42)
n_days = 30
data = pd.DataFrame({
    'Day': np.tile(np.arange(1, n_days + 1), 3),
    'Store': np.repeat(['A', 'B', 'C'], n_days),
    'Sales': np.concatenate([
        np.random.normal(200, 15, n_days),  # Store A
        np.random.normal(210, 15, n_days),  # Store B
        np.random.normal(220, 15, n_days)   # Store C
    ])
})

# Pivot data for repeated measures ANOVA
sales_wide = data.pivot(index='Day', columns='Store', values='Sales').reset_index()

# Melt for ANOVA
sales_long = pd.melt(sales_wide, id_vars=['Day'], value_vars=['A', 'B', 'C'],
                     var_name='Store', value_name='Sales')

# Perform repeated measures ANOVA
anova_rm = AnovaRM(sales_long, depvar='Sales', subject='Day', within=['Store'])
anova_result = anova_rm.fit()

# Post-hoc test (Tukey's HSD)
tukey_result = pairwise_tukeyhsd(endog=sales_long['Sales'], groups=sales_long['Store'], alpha=0.05)

anova_result.summary(), tukey_result.summary()

"""
Since the p-value is less than 0.05, we reject the null hypothesis. This means there is a statistically
significant difference in average daily sales between the three stores.
"""

(<class 'statsmodels.iolib.summary2.Summary'>
 """
                Anova
       F Value Num DF  Den DF Pr > F
 -----------------------------------
 Store 20.7170 2.0000 58.0000 0.0000
 
 """,
 <class 'statsmodels.iolib.table.SimpleTable'>)