# Hypothesis Testing

In this notebook we demonstrate formal hypothesis testing using the NHANES data.

It is important to note that the NHANES data are a "complex survey".  The data are not an independent and representative sample from the target population.  Proper analysis of complex survey data should make use of additional information about how the data were collected.  Since complex survey analysis is a somewhat specialized topic, we ignore this aspect of the data here, and analyze the NHANES data as if it were an independent and identically distributed sample from a population.

In [0]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg') # workaround, there may be a better way
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats.distributions as dist

In [0]:
url = "/content/cartwheel.csv"
da = pd.read_csv(url)

da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})

In [7]:
da["SMQ020x"].head()

0    Yes
1    Yes
2    Yes
3     No
4     No
Name: SMQ020x, dtype: object

In [8]:
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})

da["RIAGENDRx"].head()

0      Male
1      Male
2      Male
3    Female
4    Female
Name: RIAGENDRx, dtype: object

### Hypothesis Tests for One Proportion

The most basic hypothesis test may be the one-sample test for a proportion.  This test is used if we have specified a particular value as the null value for the proportion, and we wish to assess if the data are compatible with the true parameter value being equal to this specified value.  One-sample tests are not used very often in practice, because it is not very common that we have a specific fixed value to use for comparison. For illustration, imagine that the rate of lifetime smoking in another country was known to be 40%, and we wished to assess whether the rate of lifetime smoking in the US were different from 40%.  In the following notebook cell, we carry out the (two-sided) one-sample test that the population proportion of smokers is 0.4, and obtain a p-value of 0.43.  This indicates that the NHANES data are compatible with the proportion of (ever) smokers in the US being 40%.

In [9]:
smoker_df= da[da['SMQ020x']=='Yes']
smoker_df.head()

Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,DMDMARTL,DMDHHSIZ,WTINT2YR,SDMVPSU,SDMVSTRA,INDFMPIR,BPXSY1,BPXDI1,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210,SMQ020x,RIAGENDRx
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,1.0,2,134671.37,1,125,4.39,128.0,70.0,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0,Yes,Male
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,3.0,1,24328.56,1,125,1.32,146.0,88.0,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,,Yes,Male
2,83734,1.0,,,1,1,78,3,1.0,3.0,1.0,2,12400.01,1,131,1.51,138.0,46.0,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0,Yes,Male
6,83741,1.0,,8.0,1,1,22,4,1.0,4.0,5.0,3,37043.09,2,128,2.08,110.0,70.0,112.0,74.0,76.6,165.4,28.0,38.8,38.0,34.0,86.6,,Yes,Male
10,83747,1.0,,1.0,1,1,46,3,1.0,5.0,6.0,2,34513.08,1,121,0.75,144.0,94.0,150.0,90.0,86.2,176.7,27.6,41.0,38.0,33.6,104.3,2.0,Yes,Male


In [10]:
len(smoker_df)

2319

In [11]:
len(da)

5735

In [0]:
phat = 2319/5735

In [0]:
pnull=0.4


In [14]:
sm.stats.proportions_ztest(2319, 5725, pnull)

(0.7807518954896244, 0.43494843171868214)

### Hypothesis Tests for Two Proportions

Comparative tests tend to be used much more frequently than tests comparing one population to a fixed value.  A two-sample test of proportions is used to assess whether the proportion of individuals with some trait differs between two sub-populations.  For example, we can compare the smoking rates between females and males. Since smoking rates vary strongly with age, we do this in the subpopulation of people between 20 and 25 years of age.  In the cell below, we carry out this test without using any libraries, implementing all the test procedures covered elsewhere in the course using Python code.  We find that the smoking rate for men is around 10 percentage points greater than the smoking rate for females, and this difference is statistically significant (the p-value is around 0.01).

In [15]:
da.columns

Index(['SEQN', 'ALQ101', 'ALQ110', 'ALQ130', 'SMQ020', 'RIAGENDR', 'RIDAGEYR',
       'RIDRETH1', 'DMDCITZN', 'DMDEDUC2', 'DMDMARTL', 'DMDHHSIZ', 'WTINT2YR',
       'SDMVPSU', 'SDMVSTRA', 'INDFMPIR', 'BPXSY1', 'BPXDI1', 'BPXSY2',
       'BPXDI2', 'BMXWT', 'BMXHT', 'BMXBMI', 'BMXLEG', 'BMXARML', 'BMXARMC',
       'BMXWAIST', 'HIQ210', 'SMQ020x', 'RIAGENDRx'],
      dtype='object')

In [16]:
dx = da[['RIAGENDRx', 'SMQ020x', 'RIDAGEYR']].dropna()
dx.head()

Unnamed: 0,RIAGENDRx,SMQ020x,RIDAGEYR
0,Male,Yes,62
1,Male,Yes,53
2,Male,Yes,78
3,Female,No,56
4,Female,No,42


In [17]:
dx['RIAGENDRx'].unique()

array(['Male', 'Female'], dtype=object)

In [0]:
dx['agegrp'] = pd.cut(da['RIDAGEYR'], bins=[18, 20,25.01,30, 40, 50, 60, 70, 80], include_lowest=True)

In [0]:
new_df = dx.groupby(['agegrp','RIAGENDRx']).agg({"SMQ020x": lambda x: np.mean(x=="Yes")}).unstack()

In [79]:
new_df

Unnamed: 0_level_0,SMQ020x,SMQ020x
RIAGENDRx,Female,Male
agegrp,Unnamed: 1_level_2,Unnamed: 2_level_2
"(17.999, 20.0]",0.09697,0.171429
"(20.0, 25.01]",0.248927,0.358491
"(25.01, 30.0]",0.232143,0.409091
"(30.0, 40.0]",0.287526,0.503282
"(40.0, 50.0]",0.268924,0.448878
"(50.0, 60.0]",0.422175,0.572687
"(60.0, 70.0]",0.37415,0.655963
"(70.0, 80.0]",0.325183,0.655779


In [0]:
new_df.columns = ['Female', 'Male']

In [81]:
new_df

Unnamed: 0_level_0,Female,Male
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(17.999, 20.0]",0.09697,0.171429
"(20.0, 25.01]",0.248927,0.358491
"(25.01, 30.0]",0.232143,0.409091
"(30.0, 40.0]",0.287526,0.503282
"(40.0, 50.0]",0.268924,0.448878
"(50.0, 60.0]",0.422175,0.572687
"(60.0, 70.0]",0.37415,0.655963
"(70.0, 80.0]",0.325183,0.655779


In [0]:
dn = dx.groupby(["agegrp", "RIAGENDRx"]).agg({"SMQ020x": np.size}).unstack()
dn.columns = ["Female", "Male"]

In [83]:

dn

Unnamed: 0_level_0,Female,Male
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(17.999, 20.0]",165,175
"(20.0, 25.01]",233,212
"(25.01, 30.0]",280,220
"(30.0, 40.0]",473,457
"(40.0, 50.0]",502,401
"(50.0, 60.0]",469,454
"(60.0, 70.0]",441,436
"(70.0, 80.0]",409,398


In [84]:
new_df1 = pd.DataFrame(new_df*(1-new_df)/dn)
new_df1

Unnamed: 0_level_0,Female,Male
agegrp,Unnamed: 1_level_1,Unnamed: 2_level_1
"(17.999, 20.0]",0.000531,0.000812
"(20.0, 25.01]",0.000802,0.001085
"(25.01, 30.0]",0.000637,0.001099
"(30.0, 40.0]",0.000433,0.000547
"(40.0, 50.0]",0.000392,0.000617
"(50.0, 60.0]",0.00052,0.000539
"(60.0, 70.0]",0.000531,0.000518
"(70.0, 80.0]",0.000537,0.000567


In [0]:
new_df2 = pd.DataFrame(new_df1['Female']+new_df1['Male'])

In [0]:
new_df2.columns=['sum']

In [87]:
new_df2

Unnamed: 0_level_0,sum
agegrp,Unnamed: 1_level_1
"(17.999, 20.0]",0.001342
"(20.0, 25.01]",0.001887
"(25.01, 30.0]",0.001735
"(30.0, 40.0]",0.00098
"(40.0, 50.0]",0.001009
"(50.0, 60.0]",0.001059
"(60.0, 70.0]",0.001049
"(70.0, 80.0]",0.001104


In [0]:
se=pd.DataFrame(np.sqrt(new_df2))

In [89]:
se.columns=['se']
se

Unnamed: 0_level_0,se
agegrp,Unnamed: 1_level_1
"(17.999, 20.0]",0.036638
"(20.0, 25.01]",0.043442
"(25.01, 30.0]",0.041658
"(30.0, 40.0]",0.031307
"(40.0, 50.0]",0.031758
"(50.0, 60.0]",0.032545
"(60.0, 70.0]",0.032382
"(70.0, 80.0]",0.033222


In [90]:
z_score = 0.109564/0.043
z_score

2.548

In [92]:
dx.head()

Unnamed: 0,RIAGENDRx,SMQ020x,RIDAGEYR,agegrp
0,Male,Yes,62,"(60.0, 70.0]"
1,Male,Yes,53,"(50.0, 60.0]"
2,Male,Yes,78,"(70.0, 80.0]"
3,Female,No,56,"(50.0, 60.0]"
4,Female,No,42,"(40.0, 50.0]"


In [0]:
dx_=dx[dx['RIDAGEYR']>=20.0]

In [0]:
dx_=dx_[dx_['RIDAGEYR']<=25.0]

In [95]:
dx_

Unnamed: 0,RIAGENDRx,SMQ020x,RIDAGEYR,agegrp
6,Male,Yes,22,"(20.0, 25.01]"
17,Female,No,24,"(20.0, 25.01]"
26,Male,Yes,22,"(20.0, 25.01]"
38,Female,No,20,"(17.999, 20.0]"
40,Male,Yes,24,"(20.0, 25.01]"
...,...,...,...,...
5688,Male,No,25,"(20.0, 25.01]"
5701,Male,No,25,"(20.0, 25.01]"
5707,Female,No,25,"(20.0, 25.01]"
5729,Male,No,25,"(20.0, 25.01]"


Unnamed: 0,RIAGENDRx,SMQ020x,RIDAGEYR,agegrp
6,Male,Yes,22,"(20.0, 25.0]"
17,Female,No,24,"(20.0, 25.0]"
26,Male,Yes,22,"(20.0, 25.0]"
38,Female,No,20,"(17.999, 20.0]"
40,Male,Yes,24,"(20.0, 25.0]"
...,...,...,...,...
5688,Male,No,25,"(20.0, 25.0]"
5701,Male,No,25,"(20.0, 25.0]"
5707,Female,No,25,"(20.0, 25.0]"
5729,Male,No,25,"(20.0, 25.0]"


In [103]:
female=dx_[dx_['RIAGENDRx']=='Female']

female["SMQ020x"] = female.SMQ020x.replace({ "Yes":1, "No":2})
female = np.array(female['SMQ020x'])
print(len(female))
female

272


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


array([2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2,
       1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1,
       2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2,
       2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1,
       2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2,
       2, 1, 2, 2, 2, 1, 2, 2])

In [104]:
male=dx_[dx_['RIAGENDRx']=='Male']

male["SMQ020x"] = male.SMQ020x.replace({ "Yes":1, "No":2})
male = np.array(male['SMQ020x'])
print(len(male))
male

252


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


array([1, 1, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2,
       2, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2,
       1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2,
       1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2,
       1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 2, 2, 2, 1,
       1, 1, 1, 2, 2, 2, 1, 2, 2, 2])

In [106]:
import statsmodels.api as sm
#z, p_value = sm.stats.proportions_ztest([male,female], [len(male),len(female)], alternative='larger')
sm.stats.ttest_ind(female,male)

(2.594973144626931, 0.00972590232121263, 522.0)

### Hypothesis Tests Comparing Means

Tests of means are similar in many ways to tests of proportions.  Just as with proportions, for comparing means there are one and two-sample tests, z-tests and t-tests, and one-sided and two-sided tests.  As with tests of proportions, one-sample tests of means are not very common, but we illustrate a one sample test in the cell below.  We compare systolic blood pressure to the fixed value 120 (which is the lower threshold for "pre-hypertension"), and find that the mean is significantly different from 120 (the point estimate of the mean is 126).

In [109]:
dx = da[["BPXSY1", "RIDAGEYR", "RIAGENDRx"]].dropna()
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
0,128.0,62,Male
1,146.0,53,Male
2,138.0,78,Male
3,132.0,56,Female
4,100.0,42,Female


In [110]:
dx = dx.loc[(dx.RIDAGEYR >= 40) & (dx.RIDAGEYR <= 50) & (dx.RIAGENDRx == "Male"), :]
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
10,144.0,46,Male
11,116.0,45,Male
20,110.0,49,Male
42,128.0,42,Male
51,118.0,50,Male


In [111]:
print(dx.BPXSY1.mean())

125.86698337292161


In [112]:
sm.stats.ztest(dx.BPXSY1, value=120)

(7.469764137102597, 8.033869113167905e-14)

In the cell below, we carry out a formal test of the null hypothesis that the mean blood pressure for women between the ages of 50 and 60 is equal to the mean blood pressure of men between the ages of 50 and 60.  The results indicate that while the mean systolic blood pressure for men is slightly greater than that for women (129 mm/Hg versus 128 mm/Hg), this difference is not statistically significant. 

There are a number of different variants on the two-sample t-test. Two often-encountered variants are the t-test carried out using the t-distribution, and the t-test carried out using the normal approximation to the reference distribution of the test statistic, often called a z-test.  Below we display results from both these testing approaches.  When the sample size is large, the difference between the t-test and z-test is very small. 

In [0]:
dx = da[["BPXSY1", "RIDAGEYR", "RIAGENDRx"]].dropna()

In [115]:
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
0,128.0,62,Male
1,146.0,53,Male
2,138.0,78,Male
3,132.0,56,Female
4,100.0,42,Female


In [0]:
dx=dx.loc[(dx['RIDAGEYR']>=50) & (dx['RIDAGEYR']<=60 ),:]

In [120]:
dx.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
1,146.0,53,Male
3,132.0,56,Female
9,178.0,56,Male
15,134.0,57,Female
19,136.0,54,Female


In [0]:
dx_male = dx[dx['RIAGENDRx']=='Male']
dx_female =dx[dx['RIAGENDRx']=='Female']

In [123]:
dx_male.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
1,146.0,53,Male
9,178.0,56,Male
24,136.0,56,Male
28,132.0,51,Male
32,114.0,56,Male


In [124]:
dx_female.head()

Unnamed: 0,BPXSY1,RIDAGEYR,RIAGENDRx
3,132.0,56,Female
15,134.0,57,Female
19,136.0,54,Female
23,116.0,58,Female
27,142.0,60,Female


In [125]:
dx_male.describe()

Unnamed: 0,BPXSY1,RIDAGEYR
count,470.0,470.0
mean,129.238298,55.191489
std,18.283442,3.293971
min,92.0,50.0
25%,116.0,52.0
50%,126.0,55.0
75%,138.0,58.0
max,236.0,60.0


In [126]:
dx_female.describe()

Unnamed: 0,BPXSY1,RIDAGEYR
count,484.0,484.0
mean,127.92562,54.952479
std,18.388341,3.221276
min,84.0,50.0
25%,114.0,52.0
50%,126.0,55.0
75%,138.0,58.0
max,218.0,60.0


In [127]:
sm.stats.ztest(dx_female["BPXSY1"].dropna(), dx_male["BPXSY1"].dropna())

(-1.105435895556249, 0.2689707570859362)

In [128]:
print(sm.stats.ttest_ind(dx_female["BPXSY1"].dropna(), dx_male["BPXSY1"].dropna()))

(-1.105435895556249, 0.26925004137768577, 952.0)
