## Hypothesis Testing

In [1]:
import statsmodels.api as sm
import numpy as np
import pandas as pd
import scipy.stats.distributions as dist

In [3]:
n = 1018 
pnull = 0.52
phat = 0.56
sm.stats.proportions_ztest(phat*n, n, pnull, alternative='larger')

(2.571067795759113, 0.005069273865860533)

Is there a significant difference between the population proportions of parents of black children and parents of Hispanic children who report that their child has had some swimming lessons?

Populations: All parents of black children age 6-18 and all parents of Hispanic children age 6-18
Parameter of Interest: p1 - p2, where p1 = black and p2 = hispanic
Null Hypothesis: p1 - p2 = 0
Alternative Hypthosis: p1 - p2 ≠≠ = 0

91 out of 247 (36.8%) sampled parents of black children report that their child has had some swimming lessons.

120 out of 308 (38.9%) sampled parents of Hispanic children report that their child has had some swimming lessons.

In [5]:
n1 = 247
n2 = 308

y1 = 91
y2 = 120

p1 = round(y1/n1, 2)
p2 = round(y2/n2, 2)

phat = (y1 + y2) / (n1 + n2)

va = phat * (1-phat)
se = np.sqrt(va * (1/n1 + 1/n2))

test_stat = (p1 - p2) / se
pvalue = 2 * dist.norm.cdf(-np.abs(test_stat))
print(pvalue)

0.6295434573871281


In [7]:
df = pd.read_csv('Cartwheeldata.csv')
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [12]:
n = len(df)
mean = df['CWDistance'].mean()
std = df['CWDistance'].std()
print(n, mean, std)

sm.stats.ztest(df['CWDistance'], value=80, alternative='larger')

25 82.48 15.058552387264852


(0.8234523266982029, 0.20512540845395266)

In [13]:
cwdata = np.array([80.57, 98.96, 85.28, 83.83, 69.94, 89.59, 91.09, 66.25, 91.21, 82.7 , 73.54, 81.99, 54.01, 82.89, 75.88, 98.32, 107.2 , 85.53, 79.08, 84.3 , 89.32, 86.35, 78.98, 92.26, 87.01])

In [14]:
n = len(cwdata)
mean = cwdata.mean()
std = cwdata.std()
print(n, mean, std)

sm.stats.ztest(cwdata, value=80, alternative='larger')

25 83.84320000000001 10.716018932420752


(1.756973189172546, 0.039461189601168366)

## Difference in population means:

Considering adults in the NHANES data, do males have a significantly higher mean Body Mass Index than females?

Population: Adults in the NHANES data.
Parameter of Interest: 𝜇1−𝜇2μ1−μ2, Body Mass Index.
Null Hypothesis: 𝜇1=𝜇2μ1=μ2
Alternative Hypthosis: 𝜇1≠𝜇2μ1≠μ2

2976 Females 𝜇1=29.94μ1=29.94
𝜎1=7.75σ1=7.75

2759 Male Adults
𝜇2=28.78μ2=28.78
𝜎2=6.25σ2=6.25

𝜇1−𝜇2=1.16

In [17]:
data = pd.read_csv('nhanes_2015_2016.csv')
data.head()

females = data[data['RIAGENDR'] == 2]
males = data[data['RIAGENDR'] == 1]

n1 = len(females)
mu1 = females['BMXBMI'].mean()
std1 = females['BMXBMI'].std()
print(n1, mu1, std1)

n2 = len(males)
mu2 = males['BMXBMI'].mean()
std2 = males['BMXBMI'].std()
print(n2, mu2, std2)

2976 29.93994565217392 7.753318809545674
2759 28.778072111846942 6.2525676168014614


In [18]:
sm.stats.ztest(females['BMXBMI'].dropna(), males['BMXBMI'].dropna())

(6.1755933531383205, 6.591544431126401e-10)