<a href="https://colab.research.google.com/github/sivanujands/StatisticalTests/blob/main/UnrelatedSamples/NonParametricTests/Kruskal_Wallis_H_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
#%pip install scikit-posthocs

In [3]:
import pandas as pd
from scipy import stats
import scikit_posthocs as sp # For post-hoc tests

# 1. Data
young_ratings = [4, 3, 5, 4, 3, 2, 4, 5, 3, 4]
middle_aged_ratings = [3, 4, 4, 3, 2, 3, 4, 2, 3, 4]
senior_ratings = [2, 1, 3, 2, 1, 2, 3, 1, 2, 1]

print(f"Young Group Ratings (n={len(young_ratings)}): {young_ratings}")
print(f"Middle-aged Group Ratings (n={len(middle_aged_ratings)}): {middle_aged_ratings}")
print(f"Senior Group Ratings (n={len(senior_ratings)}): {senior_ratings}")
print("\n")

# Optional: Check descriptive statistics (medians are often more appropriate for non-normal data)
print("Median Young Group:", pd.Series(young_ratings).median())
print("Median Middle-aged Group:", pd.Series(middle_aged_ratings).median())
print("Median Senior Group:", pd.Series(senior_ratings).median())
print("\n")

# 2. Perform the Kruskal-Wallis H-test
# stats.kruskal takes each group's data as separate arguments
statistic, p_value = stats.kruskal(young_ratings, middle_aged_ratings, senior_ratings)

print(f"Kruskal-Wallis H-statistic: {statistic:.3f}")
print(f"P-value: {p_value:.3f}")
print("\n")

# 3. Set the Significance Level
alpha = 0.05

# 4. Make a Decision and Draw a Conclusion (Overall H-test)
print(f"Significance Level (alpha): {alpha}")

if p_value < alpha:
    print(f"Since p-value ({p_value:.3f}) < alpha ({alpha}), we reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in customer satisfaction ratings among the three age groups.")
    print("Proceeding with post-hoc tests to identify specific differences.")

    # 5. Post-hoc Analysis (Dunn's test is a common choice for Kruskal-Wallis)
    # For post-hoc tests, it's often easiest to put the data into a single DataFrame.
    all_ratings = young_ratings + middle_aged_ratings + senior_ratings
    groups = ['Young'] * len(young_ratings) + ['Middle-aged'] * len(middle_aged_ratings) + ['Senior'] * len(senior_ratings)
    df_long = pd.DataFrame({'Rating': all_ratings, 'Group': groups})

    print("\nDataFrame in long format for post-hoc analysis:")
    print(df_long.head()) # Show first few rows
    print("\n")

    # Perform Dunn's post-hoc test
    # It requires the long format DataFrame, 'data_col' for the dependent variable,
    # and 'group_col' for the independent variable levels.
    post_hoc_results = sp.posthoc_dunn(a=df_long, val_col='Rating', group_col='Group', p_adjust='bonferroni')
    # 'p_adjust' can be 'bonferroni', 'sidak', 'holm', etc. Bonferroni is conservative.

    print("Dunn's Post-hoc Test Results (Bonferroni-adjusted p-values):")
    print(post_hoc_results)

    # Interpretation of Dunn's Post-hoc:
    # Look at the p-values in the table. Values less than 0.05 (or adjusted alpha) are significant.

else:
    print(f"Since p-value ({p_value:.3f}) >= alpha ({alpha}), we fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in customer satisfaction ratings among the three age groups.")

Young Group Ratings (n=10): [4, 3, 5, 4, 3, 2, 4, 5, 3, 4]
Middle-aged Group Ratings (n=10): [3, 4, 4, 3, 2, 3, 4, 2, 3, 4]
Senior Group Ratings (n=10): [2, 1, 3, 2, 1, 2, 3, 1, 2, 1]


Median Young Group: 4.0
Median Middle-aged Group: 3.0
Median Senior Group: 2.0


Kruskal-Wallis H-statistic: 14.354
P-value: 0.001


Significance Level (alpha): 0.05
Since p-value (0.001) < alpha (0.05), we reject the null hypothesis.
Conclusion: There is a statistically significant difference in customer satisfaction ratings among the three age groups.
Proceeding with post-hoc tests to identify specific differences.

DataFrame in long format for post-hoc analysis:
   Rating  Group
0       4  Young
1       3  Young
2       5  Young
3       4  Young
4       3  Young


Dunn's Post-hoc Test Results (Bonferroni-adjusted p-values):
             Middle-aged    Senior     Young
Middle-aged     1.000000  0.019289  1.000000
Senior          0.019289  1.000000  0.000811
Young           1.000000  0.000811  1.000000

**Explanation of the Output:**

* The output will first show the calculated Kruskal-Wallis H-statistic and its P-value.

* The H-statistic is a non-parametric measure indicating the overall difference in the distributions (or median ranks) across the groups.

* The P-value tells us the probability of observing an H-statistic as extreme as, or more extreme than, the one calculated, assuming no true difference among the groups' satisfaction ratings.

* Based on the comparison of the p-value with the chosen alpha (0.05):

* If p_value < alpha, you reject the null hypothesis. This means there is a statistically significant overall difference in customer satisfaction ratings among the three age groups. Since this is an omnibus test, it doesn't tell you which specific groups differ.


* If you reject H_0, the code then proceeds to perform a Dunn's Post-hoc Test (with Bonferroni adjustment in this example). This test provides pairwise comparisons between all combinations of your groups (e.g., Young vs. Middle-aged, Young vs. Senior, Middle-aged vs. Senior).

* The table of post_hoc_results shows the p-values for each pairwise comparison. You would examine these p-values. If a p-value is less than your chosen alpha (or the Bonferroni-corrected alpha, if applicable), it indicates a significant difference between that specific pair of groups. This helps you pinpoint exactly which age groups have significantly different satisfaction ratings.