<a href="https://colab.research.google.com/github/sivanujands/StatisticalTests/blob/main/UnrelatedSamples/ParametricTests/ANOVA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# 1. Data
data = {
    'Method': ['X'] * 10 + ['Y'] * 10 + ['Z'] * 10,
    'Sales': [120, 115, 125, 118, 122, 110, 128, 119, 123, 117,
              105, 110, 100, 108, 112, 103, 115, 107, 109, 102,
              130, 135, 128, 132, 138, 125, 140, 133, 129, 131]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df.head()) # Show first few rows
print("\n")

# Optional: Check descriptive statistics for each group
print("Descriptive Statistics by Method:")
print(df.groupby('Method')['Sales'].describe())
print("\n")

# 2. Check Homogeneity of Variances using Levene's Test
# H0: Variances are equal across groups
# H1: Variances are not equal across groups
levene_statistic, levene_p_value = stats.levene(df['Sales'][df['Method'] == 'X'],
                                                  df['Sales'][df['Method'] == 'Y'],
                                                  df['Sales'][df['Method'] == 'Z'])

print(f"Levene's Test Statistic: {levene_statistic:.3f}")
print(f"Levene's Test P-value: {levene_p_value:.3f}")

alpha = 0.05
if levene_p_value < alpha:
    print("Conclusion: Variances are significantly different (p < 0.05). Consider Welch's ANOVA or non-parametric test.")
    # For now, we'll proceed with standard ANOVA, but in a real scenario, you might use different approach
else:
    print("Conclusion: Variances are not significantly different (p >= 0.05). Standard ANOVA assumptions likely met.")

print("\n")

# 3. Perform One-Way ANOVA
# Using statsmodels for a more comprehensive ANOVA table
# The formula 'Sales ~ C(Method)' indicates 'Sales' is the dependent variable
# and 'Method' is the categorical independent variable. 'C()' explicitly treats 'Method' as categorical.
model = ols('Sales ~ C(Method)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2) # Type 2 ANOVA sum of squares

print("One-Way ANOVA Results:")
print(anova_table)
print("\n")

# 4. Extract the p-value for the 'Method' effect
# The p-value for our main effect (Method) is in the 'PR(>F)' column for the 'C(Method)' row.
p_value_method_effect = anova_table.loc['C(Method)', 'PR(>F)']

print(f"P-value for the 'Method' effect: {p_value_method_effect:.3f}")
print(f"Significance Level (alpha): {alpha}")

# 5. Make a Decision and Draw a Conclusion (Overall ANOVA)
if p_value_method_effect < alpha:
    print(f"Since p-value ({p_value_method_effect:.3f}) < alpha ({alpha}), we reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in average sales performance among the three training methods.")
    print("Proceeding with post-hoc tests to identify specific differences.")

    # 6. Perform Post-hoc Analysis (Tukey HSD)
    # Tukey HSD is a common post-hoc test when ANOVA's assumption of equal variance is met.
    # It controls the family-wise error rate.
    tukey_results = pairwise_tukeyhsd(endog=df['Sales'], groups=df['Method'], alpha=alpha)

    print("\nTukey HSD Post-hoc Test Results:")
    print(tukey_results)

    # Interpretation of Tukey HSD:
    # Look at the 'reject' column. If True, the difference between that pair of groups is significant.

else:
    print(f"Since p-value ({p_value_method_effect:.3f}) >= alpha ({alpha}), we fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in average sales performance among the three training methods.")

Original DataFrame:
  Method  Sales
0      X    120
1      X    115
2      X    125
3      X    118
4      X    122


Descriptive Statistics by Method:
        count   mean       std    min     25%    50%     75%    max
Method                                                             
X        10.0  119.7  5.165054  110.0  117.25  119.5  122.75  128.0
Y        10.0  107.1  4.677369  100.0  103.50  107.5  109.75  115.0
Z        10.0  132.1  4.581363  125.0  129.25  131.5  134.50  140.0


Levene's Test Statistic: 0.049
Levene's Test P-value: 0.952
Conclusion: Variances are not significantly different (p >= 0.05). Standard ANOVA assumptions likely met.


One-Way ANOVA Results:
                sum_sq    df          F        PR(>F)
C(Method)  3125.066667   2.0  67.404378  3.176023e-11
Residual    625.900000  27.0        NaN           NaN


P-value for the 'Method' effect: 0.000
Significance Level (alpha): 0.05
Since p-value (0.000) < alpha (0.05), we reject the null hypothesis.
Conclusion

**Explanation of the Output:**

* The output begins with Levene's Test results. This helps you decide if the equal variance assumption is met. If its p-value is low, you might consider alternatives like Welch's ANOVA (which statsmodels can also perform, but requires a slightly different approach for implementation) or non-parametric tests.

* Next, the One-Way ANOVA Results table (from statsmodels) is displayed.

* Look at the row corresponding to your independent variable (e.g., C(Method)).

* df (Degrees of Freedom): For the method, it's k-1 (3-1=2). For residuals, it's N-k (30-3=27).

* sum_sq (Sum of Squares): Shows the variability attributed to the method and residuals.

* mean_sq (Mean Square): Sum of squares divided by df.

* F (F-statistic): The ratio of mean_sq for 'C(Method)' to mean_sq for 'Residual'.

* PR(>F) (P-value): This is the crucial p-value for the overall ANOVA test.

* Based on the PR(>F) value:

* If p_value_method_effect < alpha, you reject H_0. This means there's an overall significant difference in average sales performance across the training methods.

* If you reject H_0, the code then proceeds to perform a Tukey HSD Post-hoc Test. This test provides pairwise comparisons between all combinations of your groups (e.g., Method X vs. Method Y, Method X vs. Method Z, Method Y vs. Method Z).

* The reject column in the Tukey HSD output indicates whether the difference between that specific pair of means is statistically significant at the chosen alpha level.

* This helps you pinpoint exactly which methods differ from each other, which the initial ANOVA doesn't do.