Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Assumptions required to use ANOVA:

Independence of observations: Each observation should be independent of all other observations.
Normality: The residuals (the differences between observed and predicted values) should be normally distributed for each group.
Homogeneity of variances: The variance of the residuals should be constant across all groups (homoscedasticity).

Q2. What are the three types of ANOVA, and in what situations would each be used?

Three types of ANOVA and their uses:

=One-way ANOVA: Compares means across two or more groups for a single independent variable. It is used when there is only one categorical independent variable.
=Two-way ANOVA: Compares means across two or more groups for two independent variables. It is used when there are two categorical independent variables, allowing for the examination of main effects and interaction effects.
=N-way ANOVA: Extends the concept of ANOVA to more than two independent variables.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

 Partitioning of variance in ANOVA:

=Total Sum of Squares (SST): Represents the total variation in the dependent variable.
=Explained Sum of Squares (SSE): Represents the variation in the dependent variable explained by the independent variable(s).
=Residual Sum of Squares (SSR): Represents the unexplained variation in the dependent variable after accounting for the effects of the independent variable(s)

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
import numpy as np
import scipy.stats as stats

# Example data for three groups
group1 = np.array([10, 12, 15, 18])
group2 = np.array([13, 14, 16, 19])
group3 = np.array([11, 13, 14, 17])

# Calculate means
grand_mean = np.mean(np.concatenate([group1, group2, group3]))
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])

# Calculate SST
SST = np.sum((np.concatenate([group1, group2, group3]) - grand_mean) ** 2)

# Calculate SSE
SSE = np.sum((group1 - group_means[0]) ** 2) + np.sum((group2 - group_means[1]) ** 2) + np.sum((group3 - group_means[2]) ** 2)

# Calculate SSR
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import numpy as np
import scipy.stats as stats

# Example data for two factors (A and B)
factor_A = np.array([1, 2, 3, 4])  # Levels of factor A
factor_B = np.array([5, 6, 7, 8])  # Levels of factor B
observations = np.array([[10, 12, 15, 18],
                         [13, 14, 16, 19],
                         [11, 13, 14, 17]])

# Calculate main effects
mean_A = np.mean(factor_A)
mean_B = np.mean(factor_B)
mean_obs = np.mean(observations)
main_effect_A = np.sum((mean_obs - np.mean(observations, axis=1)) ** 2)
main_effect_B = np.sum((mean_obs - np.mean(observations, axis=0)) ** 2)

# Calculate interaction effect
interaction_effect = np.sum((observations - np.mean(observations, axis=0) - np.mean(observations, axis=1)[:, np.newaxis] + mean_obs) ** 2)

print("Main effect of Factor A:", main_effect_A)
print("Main effect of Factor B:", main_effect_B)
print("Interaction effect:", interaction_effect)

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

With an F-statistic of 5.23 and a p-value of 0.02 obtained from a one-way ANOVA:

=Conclusions: The differences between the groups are statistically significant.
=Interpretation: The probability of observing such extreme differences between group means by random chance alone is 0.02, assuming that the null hypothesis (no difference between group means) is true. Therefore, we reject the null hypothesis and conclude that there are significant differences between at least two of the groups.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA:

=One common approach is to use pairwise deletion, where missing data are ignored on a pairwise basis for each comparison. This method may result in a loss of statistical power.
=Another approach is to use mean substitution, where missing values are replaced with the mean of the available data. However, this method may bias the results, especially if missing values are not missing at random.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Common post-hoc tests used after ANOVA:

=Tukey's Honestly Significant Difference (HSD) test: Used to determine which specific groups differ significantly from each other. It controls the familywise error rate.
=Bonferroni correction: Adjusts the significance level for multiple comparisons to avoid Type I errors.
=Scheffe's method: Provides a more conservative approach for controlling Type I errors compared to Tukey's HSD test, suitable for unequal sample sizes or unequal variances.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
import scipy.stats as stats

# Example data
weight_loss_A = [3, 5, 4, 6, 7, 2, 3, 4, 5, 6]
weight_loss_B = [2, 4, 3, 5, 6, 1, 2, 3, 4, 5]
weight_loss_C = [1, 3, 2, 4, 5, 0, 1, 2, 3, 4]

# Conduct one-way ANOVA
f_statistic, p_value = stats.f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("There are no significant differences between the mean weight loss of the three diets.")

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
# Create DataFrame with columns: Time (dependent variable), Program (factor A), Experience (factor B)
# Perform two-way ANOVA
model = ols('Time ~ Program + Experience + Program:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
import numpy as np
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data
control_group = np.array([80, 85, 78, 82, 87, 79, 81, 83, 86, 84])
experimental_group = np.array([88, 90, 84, 92, 89, 86, 91, 87, 85, 93])

# Conduct two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
    # Follow up with post-hoc test
    tukey_results = pairwise_tukeyhsd(np.concatenate([control_group, experimental_group]),
                                      np.concatenate([np.repeat('Control', len(control_group)),
                                                      np.repeat('Experimental', len(experimental_group))]))
    print(tukey_results)
else:
    print("There is no significant difference in test scores between the control and experimental groups.")

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Example data
# Create DataFrame with columns: Sales (dependent variable), Store (within-subjects factor)
# Perform repeated measures ANOVA
model = AnovaRM(data=df, depvar='Sales', within=['Store']).fit()
anova_table = model.anova_table

print(anova_table)

# Follow up with post-hoc test (e.g., Tukey's HSD)
tukey_results = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(tukey_results)