###  Hypothesis testing on real-world data

In [23]:
import numpy as np
import pandas as pd

## Example 1: Are men higher than women?
data source: https://www.kaggle.com/datasets/mustafaali96/weight-height

In [24]:
hw = pd.read_csv('data/weight-height.csv')

In [25]:
hw

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.042470
4,Male,69.881796,206.349801
...,...,...,...
9995,Female,66.172652,136.777454
9996,Female,67.067155,170.867906
9997,Female,63.867992,128.475319
9998,Female,69.034243,163.852461


In [26]:
hw_m = hw.loc[hw['Gender']=='Male']
hw_f = hw.loc[hw['Gender']=='Female']

### Descriptive statistics

In [27]:
hw_m['Height'].describe()

count    5000.000000
mean       69.026346
std         2.863362
min        58.406905
25%        67.174679
50%        69.027709
75%        70.988744
max        78.998742
Name: Height, dtype: float64

In [28]:
hw_f['Height'].describe()

count    5000.000000
mean       63.708774
std         2.696284
min        54.263133
25%        61.894441
50%        63.730924
75%        65.563565
max        73.389586
Name: Height, dtype: float64

In [29]:
import plotly.express as px
fig = px.box(hw, x="Gender", y="Height")
fig.show()

### Independent two-sample t-test  
H0: No difference between men and women's height  
Ha: Difference exists between men and women's height

#### check assumption

**Assumptions**
- Independence: The observations within each group are independent of each other. This means that the values in one group are not influenced by the values in the other group.
- Normality: The population distributions of both groups are normal. The assumption of normality can be checked by examining the distribution of scores in each group using histograms or normal probability plots.
- Homogeneity of variances: The variance of values in each group is equal. This assumption can be tested using a statistical test such as Levene's test.

In [30]:
from scipy.stats import shapiro, levene, ttest_ind

Each sample represents a single individual. Based on the description of the data source, all samples are independently collected at random from a general population. Independence assumption met.

Use shapiro-wilk test to test Normality

In [31]:
shapiro(hw_f['Height'])

ShapiroResult(statistic=0.9997759461402893, pvalue=0.9047888517379761)

In [32]:
shapiro(hw_m['Height'])

ShapiroResult(statistic=0.999444842338562, pvalue=0.14202836155891418)

p-value of 0.90 and 0.14, both>0.05, suggest that the Normality assumption is met for both groups of our data.

In [33]:
levene(hw_f['Height'],hw_m['Height'])

LeveneResult(statistic=12.284910854677701, pvalue=0.0004586349895436178)

p-value of 0.0004 << 0.05 suggests that we should reject the null hypothesis (variances of two groups are equal). This assumption is violated. But don't worry, we can use Welch’s t-test instead of Student’s t-test.

**What if normality assumptions are violated?**
- apply transformations, for example, log-transformations to correct data skewness.
- use non-parametric tests, for example Mann-Whitney U test

#### run test

In [34]:
display(ttest_ind(hw_f['Height'],hw_m['Height'], equal_var=False))

TtestResult(statistic=-95.60271449148824, pvalue=0.0, df=9962.077121426351)

#### reporting

We conducted a two-sample unpaired t-test to compare the mean height of male and female participants. The sample consisted of 5000 male participants with a mean weight of 69.02 inches (SD = 2.86 inches) and 5000 female participants with a mean weight of 63.70 inches (SD = 2.70 inches).

Given the test result (t = -95.60, p = .00), we have obtained statistically significant evidence to reject H0 within the significance level of .05, suggesting that there is a significant difference in height between male and female in population, with males being significantly higher than females. 

## Example 2: Does weekly study time affect student's academic grade?
data source: https://archive.ics.uci.edu/ml/datasets/Student+Performance#

In [35]:
stu = pd.read_csv("data/student-mat.csv",sep=';')

In [36]:
stu.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


- studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
- G3: final grade (numeric: from 0 to 20, output target)

### Descriptive statistics

In [37]:
stu['studytime'].value_counts()

studytime
2    198
1    105
3     65
4     27
Name: count, dtype: int64

In [38]:
stu.groupby('studytime')['G3'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
studytime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,105.0,10.047619,4.956311,0.0,8.0,10.0,13.0,19.0
2,198.0,10.171717,4.217537,0.0,8.0,11.0,13.0,19.0
3,65.0,11.4,4.639504,0.0,10.0,12.0,15.0,19.0
4,27.0,11.259259,5.281263,0.0,9.0,12.0,14.5,20.0


In [39]:
fig = px.box(stu, x="studytime", y="G3")
fig.show()

### One-way between-subjects ANOVA (Analysis of Variance)
H0: No difference in average grade between student groups with different study time  
Ha: Difference exists in average grade between student groups with different study time

#### check assumption

In [40]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('G3 ~ C(studytime)', data=stu).fit()

**Assumptions**
- Independence: Met
- Normality of residuals: The distribution of the residuals (the differences between the observed values and the fitted values) within each group should be approximately normal.
- Homogeneity of variances

Use shapiro-wilk test to test Normality

In [41]:
shapiro(model.resid)

ShapiroResult(statistic=0.9356271028518677, pvalue=4.891193353118162e-12)

Normality of residual is violated

In [42]:
levene(stu.loc[stu['studytime']==1]['G3'],stu.loc[stu['studytime']==2]['G3'],stu.loc[stu['studytime']==3]['G3'],stu.loc[stu['studytime']==4]['G3'])

LeveneResult(statistic=1.291279452281806, pvalue=0.27701862882516815)

#### run test

In [43]:
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(studytime),108.200155,3.0,1.727835,0.160723
Residual,8161.708706,391.0,,


#### confidence interval
- 0.05 significance level == 95% C.I.
- probability that the interval will contain the population mean (mean of difference between the two comparing groups)

#### reporting

We conducted a one-way ANOVA to determine if there were significant differences in math test scores among four groups of students: Group whose weekly study time is <2hrs (M = 10.04, SD = 4.96), Group whose weekly study time is between 2 and 5hrs (M = 10.17, SD = 4.22), Group whose weekly study time is between 5 and 10 hours (M = 11.40, SD = 4.64), and Group whose weekly study time is >10hrs (M = 11.25, SD = 5.28). 

Based on the f-statisics = 1.29, p = 0.16, we fail to reject the null hypothesis such that there's no statistically significant evidence showing that there's difference in average grade between student groups with different study time.
