In [None]:
 In ANOVA (Analysis of Variance), there are several key assumptions:

1. **Independence:** Data points within and between groups must be independent.

2. **Homogeneity of Variance:** Variance should be roughly equal across all groups.

3. **Normality:** Data within each group should follow a normal distribution.

4. **Independence of Observations and Groups:** Observations should be independent, and groups should not be related.

5. **Random Sampling:** Samples within groups should be randomly select.

In [None]:
One-Way ANOVA: This type of ANOVA is used when you have one independent variable (factor) with more than two levels or groups. 
It assesses whether there are any statistically significant differences between the group means.

Example: Suppose you want to compare the average test scores of students from three different schools (School A, School B, and School C) 
to determine if there's a significant difference in performance among these schools.

Two-Way ANOVA: This ANOVA extends the one-way ANOVA to situations where there are two independent variables, 
and you want to assess their individual and interactive effects on the dependent variable. It's used when you want to explore the influence of two factors simultaneously.

Example: Imagine you are studying the effect of two factors, such as diet (Factor A: Low Fat, High Fat) and
exercise (Factor B: Sedentary, Active), on weight loss. Two-way ANOVA can determine if there are significant
effects of diet, exercise, and their interaction on weight loss.

Repeated Measures ANOVA: This type of ANOVA is used when you have repeated measurements or observations on the same subjects or 
items over multiple time points or conditions. It assesses whether there are any significant differences in the means of these repeated measurements.

Example: Suppose you are testing the effect of a new drug on patients' blood pressure, 
measuring their blood pressure before treatment, one week after treatment, and two weeks after treatment. 
Repeated Measures ANOVA can determine if there are significant changes in blood pressure over time due to the dru


In [None]:
The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps to understand 
how the total variance in a dataset is broken down into different components. It is essential in ANOVA because
it allows researchers to determine the sources of variation and assess whether the observed differences between 
group means are statistically significant.

In [1]:
import numpy as np

# Sample data and group labels
data = [23, 25, 27, 30, 32, 35, 20, 22, 24, 18, 16, 19]
groups = ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D']

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate the Total Sum of Squares (SST)
SST = np.sum((data - overall_mean) ** 2)

# Calculate the group means
group_means = {group: np.mean(np.array(data)[np.array(groups) == group]) for group in np.unique(groups)}

# Calculate the Explained Sum of Squares (SSE)
SSE = np.sum([len(np.array(data)[np.array(groups) == group]) * (group_means[group] - overall_mean) ** 2 for group in np.unique(groups)])

# Calculate the Residual Sum of Squares (SSR)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 376.25
Explained Sum of Squares (SSE): 342.91666666666674
Residual Sum of Squares (SSR): 33.33333333333326


In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a Pandas DataFrame with your data
data = pd.DataFrame({
    'A': [10, 20, 30, 40, 50, 60, 70, 80, 90],
    'B': [5, 15, 25, 35, 45, 55, 65, 75, 85],
    'Y': [15, 25, 35, 45, 55, 65, 75, 85, 95]
})

# Perform two-way ANOVA
formula = 'Y ~ C(A) + C(B) + C(A):C(B)'
model = ols(formula, data=data).fit()
aov_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effect
main_effect_A = aov_table.loc['C(A)', 'sum_sq'] / aov_table.loc['C(A)', 'df']
main_effect_B = aov_table.loc['C(B)', 'sum_sq'] / aov_table.loc['C(B)', 'df']
interaction_effect = aov_table.loc['C(A):C(B)', 'sum_sq'] / aov_table.loc['C(A):C(B)', 'df']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)


In [None]:
F-Statistic (5.23):

The F-statistic measures the ratio of the variation between group means to the variation within groups. In this case, an F-statistic of 5.23 indicates that there is some degree of variability between the group means.

P-Value (0.02):

The p-value associated with the F-statistic represents the probability of observing such a result (or more extreme results) if there were no significant differences between the groups. In other words, it tests the null hypothesis:

Null Hypothesis (H0): There are no significant differences between the group means (all group means are equal).
Alternative Hypothesis (Ha): There are significant differences between at least two group means.
In this case, the p-value of 0.02 is less than the commonly used significance level (alpha) of 0.05. Therefore, we can conclude that there is sufficient evidence to reject the null hypothesis.

Interpretation:

Based on the F-statistic and p-value:

We reject the null hypothesis (H0) that there are no significant differences between the group means.

We conclude that there are statistically significant differences among at least two of the groups being compared.

However, the ANOVA test itself does not tell you which specific groups are different from each other. To determine which groups are different, you may need to perform post hoc tests or pairwise comparisons.


In [None]:
Handling missing data in a repeated measures ANOVA:

1. **Complete Case Analysis:** Removes subjects with missing data; simple but may lose data.

2. **Imputation:** Replaces missing values with estimates; can maintain power but may introduce bias.

3. **Model-Based Methods:** Use statistical models; potentially accurate but complex.

4. **Multiple Imputation:** Creates multiple datasets with imputed values; provides unbiased estimates but requires more effort.

In [None]:
common post-hoc tests after ANOVA:

Tukey's HSD: For multiple groups, identifies which pairs differ significantly.

Bonferroni Correction: Controls familywise error rate in multiple comparisons.

Dunnett's Test: Compares treatment groups to a control group.

Scheffé's Test: Robust when variances are unequal.

Games-Howell Test: Deals with unequal variances and sample sizes.

In [1]:
import numpy as np
import scipy.stats as stats

# Sample data for weight loss for each diet group
diet_A = np.array([2.1, 1.8, 2.5, 2.0, 1.7, 1.9, 2.2, 2.3, 1.5, 1.8,
                   2.0, 1.9, 2.1, 1.7, 2.3, 2.4, 1.8, 2.2, 2.5, 2.1,
                   1.8, 2.0, 1.9, 2.2, 2.3, 1.6, 1.8, 2.1, 2.4, 2.0,
                   1.9, 2.1, 2.3, 2.5, 2.2, 1.7, 1.8, 1.6, 2.0, 2.3,
                   2.2, 2.4, 1.9, 2.1, 2.5, 1.7, 1.8, 2.3, 1.6])

diet_B = np.array([1.5, 1.2, 1.8, 1.3, 1.7, 1.6, 1.9, 1.4, 1.7, 1.5,
                   1.8, 1.3, 1.9, 1.2, 1.4, 1.6, 1.7, 1.3, 1.5, 1.8,
                   1.6, 1.7, 1.4, 1.9, 1.3, 1.6, 1.8, 1.7, 1.2, 1.5,
                   1.4, 1.6, 1.8, 1.3, 1.7, 1.5, 1.9, 1.4, 1.2, 1.6,
                   1.8, 1.3, 1.5, 1.7, 1.4, 1.9, 1.6, 1.8, 1.3, 1.7])

diet_C = np.array([2.8, 2.5, 2.7, 2.4, 2.9, 2.6, 2.7, 2.3, 2.4, 2.8,
                   2.5, 2.9, 2.6, 2.4, 2.7, 2.5, 2.3, 2.6, 2.9, 2.4,
                   2.7, 2.5, 2.8, 2.6, 2.4, 2.7, 2.9, 2.5, 2.8, 2.3,
                   2.6, 2.7, 2.4, 2.9, 2.5, 2.7, 2.8, 2.6, 2.4, 2.7,
                   2.9, 2.3, 2.5, 2.8, 2.6, 2.4, 2.7, 2.9, 2.5])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Report the results
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There are significant differences in mean weight loss among the three diets.")
else:
    print("There are no significant differences in mean weight loss among the three diets.")


F-Statistic: 257.97540322898794
p-value: 1.7255110397929946e-48
There are significant differences in mean weight loss among the three diets.


In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [12.3, 13.1, 11.8, 14.2, 13.8, 12.9, 11.4, 12.6, 13.7, 14.5,
             10.9, 11.8, 12.5, 10.7, 11.1, 15.0, 16.2, 14.8, 15.5, 13.7,
             9.6, 10.2, 9.8, 10.6, 11.4, 10.9, 9.3, 9.7, 10.3, 10.1]
})

# Perform two-way ANOVA
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Report the results
print(anova_table)

# Interpret the results
alpha = 0.05  # Significance level

# Check main effects and interaction effect
if anova_table['PR(>F)']['C(Software)'] < alpha:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

if anova_table['PR(>F)']['C(Experience)'] < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

if anova_table['PR(>F)']['C(Software):C(Experience)'] < alpha:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")


                              sum_sq    df         F    PR(>F)
C(Software)                 6.092667   2.0  0.748362  0.483863
C(Experience)               3.468000   1.0  0.851949  0.365189
C(Software):C(Experience)   2.258000   2.0  0.277350  0.760185
Residual                   97.696000  24.0       NaN       NaN
There is no significant main effect of Software.
There is no significant main effect of Experience.
There is no significant interaction effect between Software and Experience.


In [3]:
import numpy as np
import scipy.stats as stats
import statsmodels.stats.multicomp as mc

# Sample data
control_group_scores = np.array([85, 88, 82, 79, 90, 92, 78, 85, 88, 81,
                                 84, 86, 89, 80, 83, 87, 82, 84, 86, 88,
                                 79, 81, 85, 87, 90, 82, 88, 84, 85, 86,
                                 88, 81, 87, 83, 79, 80, 85, 88, 82, 84,
                                 89, 86, 83, 80, 82, 85, 87, 88, 84, 81])

experimental_group_scores = np.array([92, 94, 91, 88, 95, 96, 89, 93, 94, 90,
                                      92, 95, 97, 89, 91, 94, 90, 92, 94, 96,
                                      88, 90, 92, 94, 96, 89, 95, 91, 92, 94,
                                      95, 90, 94, 91, 88, 89, 93, 94, 91, 92,
                                      97, 95, 91, 88, 90, 93, 94, 95, 91, 90])

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Report the results
print("Two-Sample T-Test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Significance level

if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
    # Follow up with a post-hoc test (e.g., Tukey's HSD) if needed.
else:
    print("There is no significant difference in test scores between the two groups.")


Two-Sample T-Test:
t-statistic: -12.958628817386803
p-value: 5.745592062079566e-23
There is a significant difference in test scores between the two groups.


In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Step 1: Load the data
df = pd.read_csv("sales_data.csv")

# Step 2: Perform a repeated measures ANOVA
rm_anova = ols('Sales ~ Store', data=df).fit()
rm_anova_table = sm.stats.anova_lm(rm_anova, typ=2)

# Print the ANOVA results
print("ANOVA Results:")
print(rm_anova_table)

# Step 3: Perform a post-hoc test (Tukey's HSD)
if rm_anova_table['PR(>F)']['Store'] < 0.05:
    print("\nPost-Hoc Test Results (Tukey's HSD):")
    posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
    print(posthoc)
else:
    print("\nNo significant differences found between stores.")
