Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

### Answer 1

The assumptions of ANOVA (Analysis of Variance) are:

Normality: The data should be normally distributed within each group. This assumption can be checked using histograms, normal probability plots, or statistical tests such as the Shapiro-Wilk test.

Homogeneity of variances: The variances of the groups should be equal. This assumption can be checked using statistical tests such as Levene's test or the Brown-Forsythe test.

Independence: The observations should be independent of each other. This means that the data should not be correlated or dependent in any way. For example, repeated measurements on the same subject violate the assumption of independence.

### Answer 2

One-Way ANOVA: One-Way ANOVA is used to test the difference in means among two or more groups. It is used when there is only one independent variable, and it has two or more levels. For example, a One-Way ANOVA can be used to test whether the mean scores on a test are significantly different between three different schools.

Two-Way ANOVA: Two-Way ANOVA is used to test the effects of two independent variables on a dependent variable. It is used when there are two independent variables, and each independent variable has two or more levels. For example, a Two-Way ANOVA can be used to test whether the mean scores on a test are significantly different for boys and girls, and whether there is an interaction effect between gender and age.

Three-Way ANOVA: Three-Way ANOVA is used to test the effects of three independent variables on a dependent variable. It is used when there are three independent variables, and each independent variable has two or more levels. For example, a Three-Way ANOVA can be used to test whether the mean scores on a test are significantly different for boys and girls, and whether there is an interaction effect between gender, age, and socioeconomic status.



### Answer 3 

The partitioning of variance in ANOVA (Analysis of Variance) refers to the process of breaking down the total variability in the data into different sources of variation. This is an important concept in ANOVA because it helps us to understand how much of the variability in the data can be attributed to the independent variables, and how much is due to random error or other factors.

The total variability in the data can be divided into two types of variation: between-group variation and within-group variation. Between-group variation is the variation that is due to differences between the groups, or levels of the independent variable, and within-group variation is the variation that is due to differences within the groups.

The partitioning of variance is typically presented in an ANOVA table, which shows the sums of squares (SS) for each source of variation, the degrees of freedom (df), and the mean squares (MS). The mean squares are obtained by dividing the sums of squares by their respective degrees of freedom. The F-ratio is then calculated by dividing the between-group mean square by the within-group mean square, and this is used to test the significance of the independent variable(s).

Understanding the partitioning of variance is important because it allows us to determine whether the independent variable(s) have a significant effect on the dependent variable, and to estimate the size of this effect. It also allows us to determine the proportion of the total variability in the data that can be attributed to the independent variable(s), and to identify any sources of error or variability that may be affecting the results.



### Answer 4

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a DataFrame with the data
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'], 
                   'value': [1, 2, 4, 5, 7, 8]})

# fit a one-way ANOVA model
model = ols('value ~ group', data=df).fit()

# calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# calculate the residual sum of squares (SSR)
ssr = sst - sse

print('SST =', sst)
print('SSE =', sse)
print('SSR =', ssr)


### Answer 5 

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a DataFrame with the data
df = pd.DataFrame({'group1': ['A', 'A', 'B', 'B', 'C', 'C'], 
                   'group2': ['X', 'Y', 'X', 'Y', 'X', 'Y'], 
                   'value': [1, 2, 4, 5, 7, 8]})

# fit a two-way ANOVA model
model = ols('value ~ C(group1) + C(group2) + C(group1):C(group2)', data=df).fit()

# calculate the main effects
main_effects = sm.stats.anova_lm(model, typ=1)['sum_sq'][:2]

# calculate the interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=1)['sum_sq'][2]

print('Main effects =', main_effects)
print('Interaction effect =', interaction_effect)


Main effects = C(group1)    36.0
C(group2)     1.5
Name: sum_sq, dtype: float64
Interaction effect = 2.5637979419682884e-30


  (model.ssr / model.df_resid))
  (model.ssr / model.df_resid))


### Answer 6

The F-statistic represents the ratio of the variance between groups to the variance within groups. A larger F-value suggests that the differences between the group means are relatively large compared to the variability within the groups. The p-value indicates the probability of obtaining an F-statistic as extreme as the one observed, assuming that the null hypothesis (i.e., no difference between the groups) is true. In this case, the p-value is less than the significance level of 0.05, suggesting that we can reject the null hypothesis and conclude that there is a significant difference between the groups.

### Answer 7

Listwise deletion: In this method, participants with any missing data are removed from the analysis. This is a commonly used approach, but it can lead to a loss of power and bias if the missing data are not completely at random.

Pairwise deletion: In this method, only the data for the specific variables involved in a given analysis are considered, and any missing data for those variables are excluded only for that specific analysis. This can result in increased power and efficiency, but it can also lead to biased estimates if the missing data are not completely at random.

Imputation: In this method, the missing values are replaced with estimated values based on other data points or statistical models. This approach can increase the power and accuracy of the analysis, but it may also introduce additional error if the imputation model is not accurate or if the missing data are not completely at random.

The consequences of using different methods to handle missing data in a repeated measures ANOVA can be significant. For example, listwise deletion can reduce the sample size and result in biased estimates if the missing data are not completely at random. Pairwise deletion can be more efficient, but it may also result in biased estimates if the missing data are not completely at random. Imputation can provide accurate estimates, but it can also introduce additional error if the imputation model is not accurate or if the missing data are not completely at random.

### Answer 8 

After conducting an ANOVA and rejecting the null hypothesis of equal group means, a post-hoc test may be used to determine which specific groups differ from each other. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD) test: This test is used when the number of groups is equal and the sample sizes are equal or nearly equal. It is conservative and can control the Type I error rate.

Bonferroni correction: This test is used to control the overall Type I error rate when conducting multiple pairwise comparisons. It is more conservative than Tukey's HSD test, and may be used when the sample size is small or when the groups are not balanced.

Scheffé's test: This test is used when the number of groups is unequal or the sample sizes are unequal. It is less powerful than Tukey's HSD test, but it is more robust to unequal variances.

Fisher's Least Significant Difference (LSD) test: This test is used when the number of groups is small and the sample sizes are unequal. It is less powerful than other post-hoc tests, but it is more appropriate for situations where the assumptions of ANOVA are not met.

### Answer 9

In [16]:
import pandas as pd
import scipy.stats as stats 

df = pd.DataFrame({"Diet": ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', ],
                  "Weight Loss" : [4.2, 5.1, 3.9, 3.8, 4.9, 4.1, 2.8, 6.1, 4.9, 8.2, 1.9, 2.5 ]})

f_statistic, p_value = stats.f_oneway(df[df['Diet'] == 'A']['Weight Loss'],
                                      df[df['Diet'] == 'B']['Weight Loss'],
                                      df[df['Diet'] == 'C']['Weight Loss'])

print('F-statistic:', f_statistic)
print('p-value:', p_value)


F-statistic: 0.2612642905178211
p-value: 0.7757218253667864


In [17]:
import pandas as pd
import scipy.stats as stats

# Create a dataframe with weight loss data for each diet group
df = pd.DataFrame({'Diet': ['A', 'B', 'C', 'A', 'B', 'C', ..., 'C'],
                   'Weight Loss': [4.2, 5.1, 3.9, 3.8, 4.9, 4.1, ..., 4.5]})

# Conduct a one-way ANOVA
f_statistic, p_value = stats.f_oneway(df[df['Diet'] == 'A']['Weight Loss'],
                                      df[df['Diet'] == 'B']['Weight Loss'],
                                      df[df['Diet'] == 'C']['Weight Loss'])

# Report the results
print('F-statistic:', f_statistic)
print('p-value:', p_value)


F-statistic: 8.30564784053157
p-value: 0.03766252175651733


### Answer 10

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas DataFrame
data = pd.read_csv('surgical_recovery_times.csv')

# Fit the two-way ANOVA model
model = ols('recovery_time ~ procedure + age + gender + procedure:age + procedure:gender', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


### Answer 11

In [27]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(123)

control_test = np.random.normal(loc=70, scale=10, size=100)
experimental_test = np.random.normal(loc=75, scale=10, size=100)

var_1 = np.var(control_test, ddof = 1)
var_2 = np.var(experimental_test, ddof = 1)

F = var_1/var_2

alpha = 0.05

t_stat, p_val = ttest_ind(control_test, experimental_test, equal_var=False)

print("Two-sample t-test results:")
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_val:.3f}")

if p_val < 0.05:
    scores = np.concatenate([control_test, experimental_test])
    groups = np.concatenate([np.zeros_like(control_test), np.ones_like(experimental_test)])
    tukey_results = pairwise_tukeyhsd(scores, groups)
    print("\nPost-hoc Tukey test results:")
    print(tukey_results)

Two-sample t-test results:
T-statistic: -3.032
P-value: 0.003

Post-hoc Tukey test results:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
   0.0    1.0   4.5336 0.0028 1.5846 7.4826   True
--------------------------------------------------


### Answer 12


In [28]:
import pandas as pd
import numpy as np
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.anova import AnovaRM

# create a sample dataset
data = {'Day': np.repeat(np.arange(1, 31), 3),
        'Store': np.tile(['Store A', 'Store B', 'Store C'], 30),
        'Sales': np.random.randint(50, 100, 90)}
df = pd.DataFrame(data)

# conduct a repeated measures ANOVA
aovrm = AnovaRM(df, 'Sales', 'Store', within=['Day'])
res = aovrm.fit()

# print ANOVA table
print(res.anova_table)

# conduct post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
print(posthoc.summary())



      F Value  Num DF  Den DF  Pr > F
Day  0.668415    29.0    58.0  0.8807
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower    upper  reject
-------------------------------------------------------
Store A Store B   3.4667 0.6205  -5.3856 12.3189  False
Store A Store C     -0.1 0.9996  -8.9522  8.7522  False
Store B Store C  -3.5667 0.6035 -12.4189  5.2856  False
-------------------------------------------------------
