# Practice notebook for confidence intervals using NHANES data

This notebook will give you the opportunity to practice working with confidence intervals using the NHANES data.

You can enter your code into the cells that say "enter your code here", and you can type responses to the questions into the cells that say "Type Markdown and Latex".

Note that most of the code that you will need to write below is very similar to code that appears in the case study notebook.  You will need to edit code from that notebook in small ways to adapt it to the prompts below.

To get started, we will use the same module imports and read the data in the same way as we did in the case study:

In [1]:
%matplotlib inline
!pip install statsmodels
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm

da = pd.read_csv("nhanes_2015_2016.csv")
da

You should consider upgrading via the '/opt/venv/bin/python -m pip install --upgrade pip' command.[0m


Unnamed: 0,SEQN,ALQ101,ALQ110,ALQ130,SMQ020,RIAGENDR,RIDAGEYR,RIDRETH1,DMDCITZN,DMDEDUC2,...,BPXSY2,BPXDI2,BMXWT,BMXHT,BMXBMI,BMXLEG,BMXARML,BMXARMC,BMXWAIST,HIQ210
0,83732,1.0,,1.0,1,1,62,3,1.0,5.0,...,124.0,64.0,94.8,184.5,27.8,43.3,43.6,35.9,101.1,2.0
1,83733,1.0,,6.0,1,1,53,3,2.0,3.0,...,140.0,88.0,90.4,171.4,30.8,38.0,40.0,33.2,107.9,
2,83734,1.0,,,1,1,78,3,1.0,3.0,...,132.0,44.0,83.4,170.1,28.8,35.6,37.0,31.0,116.5,2.0
3,83735,2.0,1.0,1.0,2,2,56,3,1.0,5.0,...,134.0,68.0,109.8,160.9,42.4,38.5,37.7,38.3,110.1,2.0
4,83736,2.0,1.0,1.0,2,2,42,4,1.0,4.0,...,114.0,54.0,55.2,164.9,20.3,37.4,36.0,27.2,80.4,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5730,93695,2.0,2.0,,1,2,76,3,1.0,3.0,...,112.0,46.0,59.1,165.8,21.5,38.2,37.0,29.5,95.0,2.0
5731,93696,2.0,2.0,,2,1,26,3,1.0,5.0,...,116.0,76.0,112.1,182.2,33.8,43.4,41.8,42.3,110.2,2.0
5732,93697,1.0,,1.0,1,2,80,3,1.0,4.0,...,146.0,58.0,71.7,152.2,31.0,31.3,37.5,28.8,,2.0
5733,93700,,,,1,1,35,3,2.0,1.0,...,106.0,66.0,78.2,173.3,26.0,40.3,37.5,30.6,98.9,2.0


## Question 1

Restrict the sample to women between 35 and 50, then use the marital status variable [DMDMARTL](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm#DMDMARTL) to partition this sample into two groups - women who are currently married, and women who are not currently married.  Within each of these groups, calculate the proportion of women who have completed college.  Calculate 95% confidence intervals for each of these proportions.

In [2]:
# enter your code here
#partioning into male and female
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
dx = da[da.RIAGENDRx == 'Female']

#Restrict the sample to women between 35 and 50
dx['age'] = pd.cut(da.RIDAGEYR,[35,50]) 
dx_female = dx[dx['age'].notna()]


# extracting data
dx_female['DMDMARTLx'] = da.DMDMARTL.replace({1:'married',5:'not married',2:'not married',3:'not married',4:'not married',6:'not married',
77:np.nan,99:np.nan})

dx_female['DMDEDUC2x'] = da.DMDEDUC2.replace({1.0:'not graduated',5.0:'graduated',2.0:'not graduated',3.0:'not graduated',4.0:'not graduated',
6.0:'not graduated',7.0:'not graduated',9.0:np.nan})


dz = dx_female[['DMDMARTLx','DMDEDUC2x']].dropna()
x = pd.crosstab(dx_female.DMDMARTLx,dx_female.DMDEDUC2x)
print(x)


# 95% CI for the proportion of married women who are graduated 
married_graduated = sm.stats.proportion_confint(154,154+265)

# 95% CI for the proportion of married women who are graduated
married_not_graduated = sm.stats.proportion_confint(67,67+254)

print('95% CI for the proportion of married women who are graduated : ',married_graduated,'\n')
print('95% CI for the proportion of married women who are not graduated :',married_not_graduated)





DMDEDUC2x    graduated  not graduated
DMDMARTLx                            
married            154            265
not married         67            254
95% CI for the proportion of married women who are graduated :  (0.3213770303614961, 0.41370650185807434) 

95% CI for the proportion of married women who are not graduated : (0.16426526549807702, 0.2531802173679666)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try usi

__Q1a.__ Identify which of the two confidence intervals is wider, and explain why this is the case. 

95% CI for the proportion of married women who are graduated is wider beacuse the confidence interval range is more which indicates that standard error is more (since 95%CI is proportional to standard error) in the married women who are graduated.

__Q1b.__ Write 1-2 sentences summarizing these findings for an audience that does not know what a confidence interval is (the goal here is to report the substance of what you learned about how marital status and educational attainment are related, not to teach a person what a confidence interval is).

Confidence interval is how close our estimated value is close to the true value. The lower the Confidence interval the closer the estimated value to the true value. From the data, married women has more confidence interval than the not married.

## Question 2

Construct 95% confidence intervals for the proportion of smokers who are female, and for the proportion of smokers who are male.  Then construct a 95% confidence interval for the difference between these proportions.

In [3]:
# enter your code here
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
dx = da[["SMQ020x","RIAGENDRx"]].dropna()
print(pd.crosstab(dx.SMQ020x,dx.RIAGENDRx))
print()
# 95% CI for the proportion of females who smoke (compare to value above)
female = sm.stats.proportion_confint(906,2066+906)
male = sm.stats.proportion_confint(1413,1340+1413)
print('95% CI for the proportion of females who smoke: ',female,'\n')
print('95% CI for the proportion of males who smoke:',male)

RIAGENDRx  Female  Male
SMQ020x                
No           2066  1340
Yes           906  1413

95% CI for the proportion of females who smoke:  (0.2882949879861214, 0.32139545615923526) 

95% CI for the proportion of males who smoke: (0.49458749263718593, 0.5319290347874418)


__Q2a.__ Discuss why it may be relevant to report the proportions of smokers who are female and male, and contrast this to reporting the proportions of males and females who smoke.

The sample size is large and it is hard to find out the percentage of smokers that are male or female. For the better understanding, it is always easy to partion the data into male and female and then calcualte the percentage of men and women who smoke and who doesn't smoke. when the data is more clear then we make better estimations.

__Q2b.__ How does the width of the confidence interval for the difference of the two proportions compare to the widths of the confidence intervals for each proportion separately?

## Question 3

Construct a 95% interval for height ([BMXHT](https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/BMX_I.htm#BMXHT)) in centimeters.  Then convert height from centimeters to inches by dividing by 2.54, and construct a 95% confidence interval for height in inches.  Finally, convert the endpoints (the lower and upper confidence limits) of the confidence interval from inches to back to centimeters   

In [4]:
# enter your code here
print(da.groupby("RIAGENDRx").agg({"BMXHT": [np.mean, np.std, np.size]}))

sem_female = 7.193736 / np.sqrt(2976)
sem_male = 7.834110 / np.sqrt(2759)
print('standard error mean:',sem_female, sem_male)
print()

lcb_female = 159.673184 - 1.96 * 7.193736 / np.sqrt(2976)
ucb_female = 159.673184 + 1.96 * 7.193736 / np.sqrt(2976)
print('95% interval for height (BMXHT) in centimeters:',lcb_female, ucb_female)
print()

lcb_male = 173.132050 - 1.96 *  7.834110 / np.sqrt(2759)
ucb_male = 173.132050 + 1.96 *  7.834110 / np.sqrt(2759)
print('95% interval for height (BMXHT) in centimeters:',lcb_male, ucb_male)


                BMXHT                  
                 mean       std    size
RIAGENDRx                              
Female     159.673184  7.193736  2976.0
Male       173.132050  7.834110  2759.0
standard error mean: 0.13186757882815625 0.14914675712884765

95% interval for height (BMXHT) in centimeters: 159.4147235454968 159.93164445450319

95% interval for height (BMXHT) in centimeters: 172.83972235602744 173.42437764397255


In [10]:
da['BMXHTx'] = da['BMXHT']/2.54
print(da.groupby("RIAGENDRx").agg({"BMXHTx": [np.mean, np.std, np.size]}))
sem_female = 2.832179 / np.sqrt(2976)
sem_male =  3.084295/ np.sqrt(2759)
print('standard error mean:',sem_female, sem_male)
print()

lcb_female = 62.863458 - 1.96 * 2.832179 / np.sqrt(2976)
ucb_female = 62.863458 + 1.96 * 2.832179 / np.sqrt(2976)
print('95% interval for height (BMXHT) in centimeters:',lcb_female, ucb_female)
print()

lcb_male = 68.162224  - 1.96 *  3.084295 / np.sqrt(2759)
ucb_male = 68.162224  + 1.96 * 3.084295 / np.sqrt(2759)
print('95% interval for height (BMXHT) in centimeters:',lcb_male, ucb_male)

              BMXHTx                  
                mean       std    size
RIAGENDRx                             
Female     62.863458  2.832179  2976.0
Male       68.162224  3.084295  2759.0
standard error mean: 0.05191635994675767 0.05871919047329169

95% interval for height (BMXHT) in centimeters: 62.76170193450436 62.965214065495644

95% interval for height (BMXHT) in centimeters: 68.04713438667234 68.27731361332765


__Q3a.__ Describe how the confidence interval constructed in centimeters relates to the confidence interval constructed in inches.

## Question 4

Partition the sample based on 10-year age bands, i.e. the resulting groups will consist of people with ages from 18-28, 29-38, etc. Construct 95% confidence intervals for the difference between the mean BMI for females and for males within each age band.

In [22]:
# enter your code here
da['RIDAGEYRx'] = pd.cut(da.RIDAGEYR,[18,28,29,38,39,48,49,58,59,68,69,80])
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
dx = da.groupby(['RIDAGEYRx','RIAGENDRx']).agg({"BMXBMI":[np.mean,np.std,np.size]}).unstack()

# Calculate the SEM for females and for males within each age band
dx['BMXBMI','sem','Female'] = dx['BMXBMI','std','Female']/np.sqrt(dx['BMXBMI','size','Female'])
dx['BMXBMI','sem','Male'] = dx['BMXBMI','std','Male']/np.sqrt(dx['BMXBMI','size','Male'])

# Calculate the mean difference of BMI between females and males within each age band
dx['BMXBMI','mean_dif',''] = dx['BMXBMI','mean','Female'] - dx['BMXBMI','std','Male']
print('The mean difference of BMI between females and males within each age band: ',dx['BMXBMI','mean_dif',''])
print()

# calculate its SE
dx["BMXBMI", "sem_dif", ""] = np.sqrt(dx["BMXBMI", "sem", "Female"]**2 + dx["BMXBMI", "sem", "Male"]**2) 
print('Its SE: ',dx["BMXBMI", "sem_dif", ""])
print()

#The lower and upper limits of its 95% CI
dx["BMXBMI", "lcb_dif", ""] = dx["BMXBMI", "mean_dif", ""] - 1.96 * dx["BMXBMI", "sem_dif", ""] 
dx["BMXBMI", "ucb_dif", ""] = dx["BMXBMI", "mean_dif", ""] + 1.96 * dx["BMXBMI", "sem_dif", ""]
print('The lower limits of its 95% CI: ',dx["BMXBMI", "lcb_dif", ""])
print()
print('The upper limits of its 95% CI: ',dx["BMXBMI", "ucb_dif", ""])





The mean difference of BMI between females and males within each age band:  RIDAGEYRx
(18, 28]    21.339918
(28, 29]    23.160552
(29, 38]    23.241037
(38, 39]    24.889005
(39, 48]    24.896420
(48, 49]    23.130696
(49, 58]    24.782718
(58, 59]    25.867442
(59, 68]    25.011639
(68, 69]    26.629343
(69, 80]    24.083790
Name: (BMXBMI, mean_dif, ), dtype: float64

Its SE:  RIDAGEYRx
(18, 28]    0.476973
(28, 29]    1.160106
(29, 38]    0.512657
(38, 39]    1.775936
(39, 48]    0.485930
(48, 49]    1.624938
(49, 58]    0.488675
(58, 59]    1.240391
(59, 68]    0.482591
(68, 69]    1.280809
(69, 80]    0.390630
Name: (BMXBMI, sem_dif, ), dtype: float64

The lower limits of its 95% CI:  RIDAGEYRx
(18, 28]    20.405051
(28, 29]    20.886744
(29, 38]    22.236229
(38, 39]    21.408170
(39, 48]    23.943998
(48, 49]    19.945817
(49, 58]    23.824914
(58, 59]    23.436276
(59, 68]    24.065760
(68, 69]    24.118958
(69, 80]    23.318155
Name: (BMXBMI, lcb_dif, ), dtype: float64

The upp

__Q4a.__ How do the widths of these confidence intervals differ?  Provide an explanation for any substantial diferences in the confidence interval widths that you see.

## Question 5

Construct a 95% confidence interval for the first and second systolic blood pressure measures, and for the difference between the first and second systolic blood pressure measurements within a subject.

In [7]:
# enter code here

__Q5a.__ Based on these confidence intervals, would you say that a difference of zero between the population mean values of the first and second systolic blood pressure measures is consistent with the data?


__Q5b.__ Discuss how the width of the confidence interval for the within-subject difference compares to the widths of the confidence intervals for the first and second measures.

## Question 6

Construct a 95% confidence interval for the mean difference between the average age of a smoker, and the average age of a non-smoker.

In [44]:
# insert your code here
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
dx = da[['SMQ020x','RIDAGEYR']].dropna()

#Smoker
mean_smoker = dx[dx['SMQ020x'] == 'Yes'].mean()
print(mean_smoker)
std_smoker = np.std(dx[dx.SMQ020x=='Yes'])
size_smoker = np.size(dx.SMQ020x)

#non smoker
mean_non_smoker = dx[dx['SMQ020x'] == 'No'].mean()
std_non_smoker = np.std(dx[dx.SMQ020x=='No'])
size_non_smoker = np.size(dx.SMQ020x)

# calculating lower and upper confidence interval points for smokers
low_smoker = mean_smoker -  1.96*(std_smoker / np.sqrt(size_smoker - 1)) 
print(low_smoker)
upp_smoker = mean_smoker + 1.96*(std_smoker / np.sqrt(size_smoker - 1)) 
print(upp_smoker)


# calculating lower and upper confidence interval points for non smokers
low_non_smoker = mean_non_smoker -  1.96 * (std_non_smoker / np.sqrt(size_non_smoker - 1))
print(low_non_smoker)
upp_non_smoker = mean_non_smoker +  1.96  * (std_non_smoker / np.sqrt(size_non_smoker- 1)) 
print(upp_non_smoker)


RIDAGEYR    52.096593
dtype: float64
RIDAGEYR    51.644336
dtype: float64
RIDAGEYR    52.54885
dtype: float64
RIDAGEYR    44.779517
dtype: float64
RIDAGEYR    45.740154
dtype: float64


__Q6a.__ Use graphical and numerical techniques to compare the variation in the ages of smokers to the variation in the ages of non-smokers.  

In [9]:
# insert your code here


__Q6b.__ Does it appear that uncertainty about the mean age of smokers, or uncertainty about the mean age of non-smokers contributed more to the uncertainty for the mean difference that we are focusing on here?