### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

**Assumptions:**
1. **Independence of Observations:** The data collected from different groups should be independent of each other.
   - *Violation Example:* If the data from different groups are related or influenced by the same factors, the independence assumption is violated.
2. **Normality:** The data within each group should be approximately normally distributed.
   - *Violation Example:* If the data are heavily skewed or have outliers, the normality assumption is violated.
3. **Homogeneity of Variances (Homoscedasticity):** The variances among the groups should be approximately equal.
   - *Violation Example:* If one group has much larger variance compared to others, the homogeneity of variances assumption is violated.
4. **Random Sampling:** The samples should be randomly selected from the population.
   - *Violation Example:* If the samples are chosen based on convenience or other non-random methods, this assumption is violated.


### Q2. What are the three types of ANOVA, and in what situations would each be used?

1. **One-Way ANOVA:** Used to compare means of three or more independent (unrelated) groups based on one independent variable.
   - *Situation:* Comparing test scores of students from different schools.
2. **Two-Way ANOVA:** Used to examine the effect of two different independent variables on one dependent variable, and to understand if there is an interaction between them.
   - *Situation:* Studying the impact of teaching method and gender on student performance.
3. **Repeated Measures ANOVA:** Used when the same subjects are used for each treatment (e.g., multiple measurements of the same subjects over time).
   - *Situation:* Measuring blood pressure of patients before and after treatment at multiple intervals.


### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

**Partitioning of Variance:**
ANOVA works by partitioning the total variability in the data into components attributable to different sources of variation. The main components are:
- **Total Sum of Squares (SST):** Total variability in the data.
- **Between-Group Sum of Squares (SSB):** Variability due to differences between the groups.
- **Within-Group Sum of Squares (SSW):** Variability within each group.

**Importance:**
Understanding the partitioning of variance helps in determining how much of the total variability is explained by the differences between the groups and how much is due to random error or within-group variability. It is crucial for interpreting the results of an ANOVA test.


In [None]:
### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'Group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Value': [23, 20, 22, 30, 28, 27, 33, 35, 37]}
df = pd.DataFrame(data)

# Perform one-way ANOVA
model = ols('Value ~ C(Group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Calculate sums of squares
SST = np.sum((df['Value'] - df['Value'].mean())**2)
SSE = np.sum((model.fittedvalues - df['Value'].mean())**2)
SSR = np.sum((df['Value'] - model.fittedvalues)**2)

print(f'Total Sum of Squares (SST): {SST}')
print(f'Explained Sum of Squares (SSE): {SSE}')
print(f'Residual Sum of Squares (SSR): {SSR}')


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

```python
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {'Factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'Factor2': ['X', 'X', 'Y', 'X', 'Y', 'Y', 'X', 'X', 'Y'],
        'Value': [23, 20, 22, 30, 28, 27, 33, 35, 37]}
df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Value ~ C(Factor1) + C(Factor2) + C(Factor1):C(Factor2)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)




### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

With an F-statistic of 5.23 and a p-value of 0.02, we can reject the null hypothesis at the 5% significance level. This means there is a statistically significant difference between the means of the groups. However, it does not tell us which specific groups are different from each other. Further post-hoc tests are needed to determine the specific group differences.


### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

**Handling Missing Data:**
1. **Listwise Deletion:** Excluding subjects with any missing data.
   - *Consequence:* Reduces sample size and may introduce bias if the missing data are not random.
2. **Mean Imputation:** Replacing missing values with the mean of the observed values.
   - *Consequence:* Reduces variability and can lead to biased parameter estimates.
3. **Multiple Imputation:** Creating multiple datasets with imputed values and combining the results.
   - *Consequence:* More accurate and less biased estimates, but more complex to implement.
4. **Mixed-Effects Models:** Using models that can handle missing data within subjects.
   - *Consequence:* More flexible and can produce unbiased estimates, but requires more complex modeling.

The choice of method depends on the amount and pattern of missing data, as well as the research context.


### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

**Common Post-Hoc Tests:**
1. **Tukey's HSD (Honestly Significant Difference):** Used when comparing all possible pairs of group means.
   - *Situation:* After finding significant differences in test scores across multiple teaching methods.
2. **Bonferroni Correction:** Adjusts p-values to control for Type I error when performing multiple comparisons.
   - *Situation:* Comparing the effects of different diets on weight loss in a clinical trial.
3. **Scheffé Test:** More conservative and flexible, suitable for complex comparisons.
   - *Situation:* Comparing group means when sample sizes are unequal and multiple comparisons are needed.
4. **Dunnett's Test:** Compares each treatment group mean to a control group mean.
   - *Situation:* Comparing the effectiveness of several new drugs to a standard treatment.

**Example:**
After conducting a one-way ANOVA on test scores from students in different schools, a significant difference is found. A post-hoc Tukey's HSD test can be used to determine which specific schools' test scores differ from each other.


In [None]:
### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
np.random.seed(0)
data = {
    'Diet': np.repeat(['A', 'B', 'C'], repeats=50),
    'WeightLoss': np.random.normal(loc=[5, 6, 7], scale=1, size=150)
}
df = pd.DataFrame(data)

# Perform one-way ANOVA
model = ols('WeightLoss ~ C(Diet)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

F_statistic = anova_table['F'][0]
p_value = anova_table['PR(>F)'][0]

print(f'F-statistic: {F_statistic}')
print(f'p-value: {p_value}')

# Interpretation:
# With the F-statistic and p-value, we can determine if there are significant differences between the diets.
# If the p-value is less than 0.05, we reject the null hypothesis and conclude that there are significant differences between the mean weight loss of the three diets.


In [None]:


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
np.random.seed(0)
data = {
    'Program': np.repeat(['A', 'B', 'C'], repeats=30),
    'Experience': np.tile(np.repeat(['Novice', 'Experienced'], repeats=15), 3),
    'Time': np.random.normal(loc=[20, 18, 15, 22, 20, 17], scale=2, size=90)
}
df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

# Interpretation:
# The ANOVA table provides the F-statistics and p-values for the main effects of Program and Experience, as well as their interaction effect.
# Significant p-values (less than 0.05) indicate significant effects.


In [None]:

### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

import pandas as pd
import numpy as np
from scipy import stats

# Example data
np.random.seed(0)
data = {
    'Group': np.repeat(['Control', 'Experimental'], 50),
    'Score': np.concatenate([np.random.normal(loc=70, scale=10, size=50), np.random.normal(loc=75, scale=10, size=50)])
}
df = pd.DataFrame(data)

# Perform two-sample t-test
control = df[df['Group'] == 'Control']['Score']
experimental = df[df['Group'] == 'Experimental']['Score']
t_stat, p_value = stats.ttest_ind(control, experimental)

print(f't-statistic: {t_stat}')
print(f'p-value: {p_value}')

# Interpretation:
if p_value < 0.05:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")


In [None]:

### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Example data
np.random.seed(0)
days = np.tile(np.arange(1, 31), 3)
store = np.repeat(['A', 'B', 'C'], 30)
sales = np.random.normal(loc=[200, 220, 210], scale=20, size=90)
data = {'Day': days, 'Store': store, 'Sales': sales}
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()

print(res)
