# Practice notebook for hypothesis tests using NHANES data

This notebook will give you the opportunity to perform some hypothesis tests with the NHANES data that are similar to
what was done in the week 3 case study notebook.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels.api as sm
import numpy as np

da = pd.read_csv("nhanes_2015_2016.csv")

In [8]:
nhanes = da.copy()

In [9]:
nhanes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5735 entries, 0 to 5734
Data columns (total 28 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      5735 non-null   int64  
 1   ALQ101    5208 non-null   float64
 2   ALQ110    1731 non-null   float64
 3   ALQ130    3379 non-null   float64
 4   SMQ020    5735 non-null   int64  
 5   RIAGENDR  5735 non-null   int64  
 6   RIDAGEYR  5735 non-null   int64  
 7   RIDRETH1  5735 non-null   int64  
 8   DMDCITZN  5734 non-null   float64
 9   DMDEDUC2  5474 non-null   float64
 10  DMDMARTL  5474 non-null   float64
 11  DMDHHSIZ  5735 non-null   int64  
 12  WTINT2YR  5735 non-null   float64
 13  SDMVPSU   5735 non-null   int64  
 14  SDMVSTRA  5735 non-null   int64  
 15  INDFMPIR  5134 non-null   float64
 16  BPXSY1    5401 non-null   float64
 17  BPXDI1    5401 non-null   float64
 18  BPXSY2    5535 non-null   float64
 19  BPXDI2    5535 non-null   float64
 20  BMXWT     5666 non-null   floa

## Question 1

Conduct a hypothesis test (at the 0.05 level) for the null hypothesis that the proportion of women who smoke is equal to the proportion of men who smoke.

In [10]:
# insert your code here
nhanes["Smoker"] = nhanes.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})  # np.nan represents a missing value
nhanes["Gender"] = nhanes.RIAGENDR.replace({1: "Male", 2: "Female"})
df_smoker = nhanes[["Smoker", "Gender"]].dropna()  # dropna drops cases where either variable is missing
pd.crosstab(df_smoker.Smoker, df_smoker.Gender)

Gender,Female,Male
Smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


Hypothesis proportion males smoking = proportion females smoking
Alternative is they are not equal.

Two population difference in proportions test. T-test for two proportions.

In [11]:
#group by sex and count
df_prob_smoker = df_smoker.groupby(df_smoker.Gender).agg({'Smoker': [lambda x: np.mean(x=='Yes'), np.size]})
df_prob_smoker.columns = ['smokers','size']
df_prob_smoker

Unnamed: 0_level_0,smokers,size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.304845,2972
Male,0.513258,2753


In [13]:
#drop missing values
df_smoker_new = df_prob_smoker[["smokers", "size"]].dropna()

In [14]:
def agg_fun(x):
    print(f"--- called with {type(x).__name__} ---:\n{x}\n{'='*27}")
    return np.mean(x=='Yes')

df_smoker.agg(agg_fun)

--- called with Series ---:
0       Yes
1       Yes
2       Yes
3        No
4        No
       ... 
5730    Yes
5731     No
5732    Yes
5733    Yes
5734     No
Name: Smoker, Length: 5725, dtype: object
--- called with Series ---:
0         Male
1         Male
2         Male
3       Female
4       Female
         ...  
5730    Female
5731      Male
5732    Female
5733      Male
5734    Female
Name: Gender, Length: 5725, dtype: object


Smoker    0.405066
Gender    0.000000
dtype: float64

In [None]:
#conduct a t test with female population vs male populaitons


__Q1a.__ Write 1-2 sentences explaining the substance of your findings to someone who does not know anything about statistical hypothesis tests.

__Q1b.__ Create three 95% confidence intervals: one for the proportion of women who smoke, one for the proportion of men who smoke, and one for the difference in the rates of smoking between women and men.

In [None]:
#Building confidence intervals manually

In [None]:
# Building confidence interval for difference


__Q1c.__ Comment on any ways in which the confidence intervals that you found in part b reinforce, contradict, or add support to the hypothesis test conducted in part a.

## Question 2

Partition the population into two groups based on whether a person has graduated college or not, using the educational attainment variable [DMDEDUC2](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDEDUC2).  Then conduct a test of the null hypothesis that the average heights (in centimeters) of the two groups are equal.  Next, convert the heights from centimeters to inches, and conduct a test of the null hypothesis that the average heights (in inches) of the two groups are equal.

In [None]:
# insert your code here

In [None]:
#drop nans in sample of grads

In [None]:
#test of two means on the null hypothesis that education level of grads is equal to the mean height of non-grads

__Q2a.__ Based on the analysis performed here, are you confident that people who graduated from college have a different average height compared to people who did not graduate from college?

__Q2b:__ How do the results obtained using the heights expressed in inches compare to the results obtained using the heights expressed in centimeters?

## Question 3

Conduct a hypothesis test of the null hypothesis that the average BMI for men between 30 and 40 is equal to the average BMI for men between 50 and 60.  Then carry out this test again after log transforming the BMI values.

In [None]:
# insert your code here

#filter two series for the target populations



In [None]:
# print out mean and std of BMI for each group


In [None]:
from scipy import stats
stats.probplot(bmi_m_30_40, plot=plt, fit=True)
plt.show()

In [None]:
stats.probplot(bmi_m_50_60, plot=plt, fit=True)
plt.show()

In [None]:
#testing of two means using ttest unpooled because of the differences in std and the qq looks non-normal

bmi_30 = sm.stats.DescrStatsW(bmi_m_30_40)
bmi_50 = sm.stats.DescrStatsW(bmi_m_50_60)

sm.stats.CompareMeans(bmi_30, bmi_50).ztest_ind(usevar='unequal')

__Q3a.__ How would you characterize the evidence that mean BMI differs between these age bands, and how would you characterize the evidence that mean log BMI differs between these age bands?

# From here, it is optional work: 

## Question 4

Suppose we wish to compare the mean BMI between college graduates and people who have not graduated from college, focusing on women between the ages of 30 and 40.  First, consider the variance of BMI within each of these subpopulations using graphical techniques, and through the estimated subpopulation variances.  Then, calculate pooled and unpooled estimates of the standard error for the difference between the mean BMI in the two populations being compared.  Finally, test the null hypothesis that the two population means are equal, using each of the two different standard errors.

In [None]:
# insert your code here

__Q4a.__ Comment on the strength of evidence against the null hypothesis that these two populations have equal mean BMI.

__Q4b.__ Comment on the degree to which the two populations have different variances, and on the extent to which the results using different approaches to estimating the standard error of the mean difference give divergent results.

## Question 5

Conduct a test of the null hypothesis that the first and second diastolic blood pressure measurements within a subject have the same mean values.

In [None]:
# insert your code here

__Q5a.__ Briefly describe your findings for an audience that is not familiar with statistical hypothesis testing.

__Q5b.__ Pretend that the first and second diastolic blood pressure measurements were taken on different people.  Modfify the analysis above as appropriate for this setting.

In [None]:
# insert your code here

__Q5c.__ Briefly describe how the approaches used and the results obtained in the preceeding two parts of the question differ.