## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### Assumptions for ANOVA:
1. **Independence of Observations**: The samples must be independent of each other.
   - *Violation*: If the data points within a sample are not independent, such as repeated measures on the same subjects without accounting for the correlation.
2. **Normality**: The distribution of the residuals (errors) should be approximately normally distributed.
   - *Violation*: If the residuals show a significant departure from normality, such as being heavily skewed or having outliers.
3. **Homogeneity of Variances (Homoscedasticity)**: The variances across the groups should be equal.
   - *Violation*: If one group has a much larger variance than the others, it could affect the ANOVA results.


## Q2. What are the three types of ANOVA, and in what situations would each be used?

### Types of ANOVA:
1. **One-Way ANOVA**: Used when comparing the means of three or more independent groups based on one independent variable.
   - *Example*: Comparing the mean test scores of students from different schools.
2. **Two-Way ANOVA**: Used when examining the influence of two different categorical independent variables on one continuous dependent variable.
   - *Example*: Studying the effect of different teaching methods and gender on student performance.
3. **Repeated Measures ANOVA**: Used when the same subjects are used for each treatment (i.e., repeated measurements are taken).
   - *Example*: Measuring the blood pressure of patients at different times after administering a drug.


## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### Partitioning of Variance:
- In ANOVA, the total variance observed in the data is partitioned into components:
  - **Total Sum of Squares (SST)**: The total variance in the data.
  - **Explained Sum of Squares (SSE)**: The variance explained by the independent variable(s).
  - **Residual Sum of Squares (SSR)**: The variance that remains unexplained (error variance).
- Understanding this partitioning helps in determining how much of the variance in the dependent variable can be attributed to the independent variable(s) versus random error.


## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


In [1]:
import numpy as np
import pandas as pd

# Sample data
data = {'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Score': [23, 20, 21, 27, 29, 25, 22, 24, 23]}
df = pd.DataFrame(data)

# Overall mean
grand_mean = np.mean(df['Score'])

# Total Sum of Squares (SST)
sst = np.sum((df['Score'] - grand_mean) ** 2)

# Explained Sum of Squares (SSE)
sse = df.groupby('Group').apply(lambda x: len(x) * (np.mean(x['Score']) - grand_mean) ** 2).sum()

# Residual Sum of Squares (SSR)
ssr = df.groupby('Group').apply(lambda x: np.sum((x['Score'] - np.mean(x['Score'])) ** 2)).sum()

print(f"SST: {sst}, SSE: {sse}, SSR: {ssr}")


SST: 65.55555555555556, SSE: 50.88888888888891, SSR: 14.666666666666668


## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {'Software': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'] * 2,
        'Experience': ['Novice'] * 9 + ['Experienced'] * 9,
        'Time': [12, 11, 13, 15, 14, 16, 13, 12, 14, 11, 10, 12, 14, 13, 15, 12, 11, 13]}
df = pd.DataFrame(data)

model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                                 sum_sq    df             F    PR(>F)
C(Software)                2.800000e+01   2.0  1.400000e+01  0.000729
C(Experience)              4.500000e+00   1.0  4.500000e+00  0.055405
C(Software):C(Experience)  6.197674e-29   2.0  3.098837e-29  1.000000
Residual                   1.200000e+01  12.0           NaN       NaN


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

### Interpretation:
- An F-statistic of 5.23 and a p-value of 0.02 indicate that there is a statistically significant difference between the groups at the 0.05 significance level.
- Since the p-value is less than 0.05, we reject the null hypothesis that all group means are equal and conclude that at least one group mean is different from the others.


## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

### Handling Missing Data:
1. **Listwise Deletion**: Remove any cases with missing data.
   - *Consequence*: Can lead to reduced sample size and potential bias if the missing data is not completely random.
2. **Mean Substitution**: Replace missing values with the mean of the observed values.
   - *Consequence*: Can underestimate the variability and affect the results.
3. **Imputation Methods**: Use statistical techniques to estimate and replace missing values.
   - *Consequence*: More sophisticated methods (e.g., multiple imputation) can handle missing data more effectively but require more complex modeling.


## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

### Common Post-hoc Tests:
1. **Tukey's HSD**: Used to compare all possible pairs of means when sample sizes are equal.
   - *Example*: After finding a significant difference in mean test scores among three different teaching methods, use Tukey's HSD to identify which methods differ.
2. **Bonferroni Correction**: Adjusts the significance level to control for Type I error when multiple comparisons are made.
   - *Example*: Comparing the effectiveness of multiple drug treatments where the number of comparisons is large.
3. **Scheffé Test**: More conservative, used for unequal sample sizes or more flexible comparisons.
   - *Example*: Used when there are unequal sample sizes across groups or if you need a more conservative test.


## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [3]:
import pandas as pd
import numpy as np
from scipy import stats

# Sample data for weight loss on three diets
np.random.seed(42)
diet_A = np.random.normal(loc=5, scale=1.5, size=50)
diet_B = np.random.normal(loc=6, scale=1.5, size=50)
diet_C = np.random.normal(loc=7, scale=1.5, size=50)

data = {'Diet': ['A']*50 + ['B']*50 + ['C']*50,
        'Weight_Loss': np.concatenate([diet_A, diet_B, diet_C])}
df = pd.DataFrame(data)

# One-Way ANOVA
f_statistic, p_value = stats.f_oneway(df[df['Diet'] == 'A']['Weight_Loss'],
                                      df[df['Diet'] == 'B']['Weight_Loss'],
                                      df[df['Diet'] == 'C']['Weight_Loss'])

print(f"F-statistic: {f_statistic}, p-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-statistic: 32.885105397869786, p-value: 1.5717322025821263e-12
There is a significant difference between the mean weight loss of the three diets.


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np

# Sample data
software = ['A', 'B', 'C'] * 10
experience = ['Novice', 'Experienced'] * 15
time = np.random.normal(loc=50, scale=10, size=30) + np.repeat([0, 5, -5], 10) + np.tile([10, 0], 15)

# Ensure all arrays have the same length
assert len(software) == len(experience) == len(time), "Arrays must have the same length"

# Create DataFrame
df = pd.DataFrame({'Software': software, 'Experience': experience, 'Time': time})

# Convert 'Software' and 'Experience' to categorical variables
df['Software'] = pd.Categorical(df['Software'])
df['Experience'] = pd.Categorical(df['Experience'])

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                                sum_sq    df         F    PR(>F)
C(Software)                 315.753173   2.0  1.838474  0.180766
C(Experience)               846.571759   1.0  9.858334  0.004442
C(Software):C(Experience)    18.816767   2.0  0.109561  0.896673
Residual                   2060.969093  24.0       NaN       NaN


## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.


In [10]:
import numpy as np
from scipy import stats

# Sample data
np.random.seed(42)
control_group = np.random.normal(loc=75, scale=10, size=50)
experimental_group = np.random.normal(loc=80, scale=10, size=50)

# Two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print(f"T-statistic: {t_statistic}, p-value: {p_value}")

# Interpretation
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")


T-statistic: -4.108723928204809, p-value: 8.261945608702611e-05
There is a significant difference in test scores between the control and experimental groups.


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.


In [11]:
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM

# Sample data
np.random.seed(42)
store_A = np.random.normal(loc=200, scale=20, size=30)
store_B = np.random.normal(loc=210, scale=20, size=30)
store_C = np.random.normal(loc=220, scale=20, size=30)

data = {'Store': ['A']*30 + ['B']*30 + ['C']*30,
        'Sales': np.concatenate([store_A, store_B, store_C]),
        'Day': np.tile(np.arange(1, 31), 3)}
df = pd.DataFrame(data)

# Repeated Measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()

print(res)

# If significant, follow up with post-hoc test
if res.anova_table['Pr > F'][0] < 0.05:
    print("There are significant differences in sales between the stores.")
else:
    print("There are no significant differences in sales between the stores.")


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 12.6985 2.0000 58.0000 0.0000

There are significant differences in sales between the stores.
