# Confidence Intervals


This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.

We will use the 2015-2016 wave of the NHANES data for our analysis.

*Note: We have provided a notebook that includes more analysis, with examples of confidence intervals for one population proportions and means, in addition to the analysis I will show you in this tutorial.  I highly recommend checking it out!

For our population proportions, we will analyze the difference of proportion between female and male smokers.  The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations.  The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".

In [1]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm

This call to matplotlib.use() has no effect because the backend has already
been chosen; matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

The backend was *originally* set to 'module://ipykernel.pylab.backend_inline' by the following code:
  File "C:\ProgramData\Anaconda3\envs\tensorflow_env\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\ProgramData\Anaconda3\envs\tensorflow_env\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\ProgramData\Anaconda3\envs\tensorflow_env\lib\site-packages\spyder_kernels\console\__main__.py", line 11, in <module>
    start.main()
  File "C:\ProgramData\Anaconda3\envs\tensorflow_env\lib\site-packages\spyder_kernels\console\start.py", line 298, in main
    kernel.initialize()
  File "<decorator-gen-124>", line 2, in initialize
  File "C:\ProgramData\Anaconda3\envs\tensorflow_env\lib\site-packages\traitlets\config\applicati

In [26]:
url = "data/nhanes_2015_2016.csv"
da = pd.read_csv(url)

### Investigating and Cleaning Data

In [27]:
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020x"]

0       Yes
1       Yes
2       Yes
3        No
4        No
5        No
6       Yes
7        No
8        No
9        No
10      Yes
11      Yes
12      Yes
13       No
14       No
15       No
16       No
17       No
18      Yes
19       No
20       No
21       No
22      Yes
23       No
24       No
25       No
26      Yes
27      Yes
28       No
29       No
       ... 
5705    Yes
5706    Yes
5707     No
5708     No
5709    Yes
5710     No
5711    Yes
5712     No
5713     No
5714     No
5715     No
5716    Yes
5717    Yes
5718     No
5719    Yes
5720     No
5721     No
5722     No
5723    Yes
5724     No
5725     No
5726    Yes
5727     No
5728     No
5729     No
5730    Yes
5731     No
5732    Yes
5733    Yes
5734     No
Name: SMQ020x, Length: 5735, dtype: object

In [28]:
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["RIAGENDRx"]

0         Male
1         Male
2         Male
3       Female
4       Female
5       Female
6         Male
7       Female
8         Male
9         Male
10        Male
11        Male
12      Female
13      Female
14        Male
15      Female
16      Female
17      Female
18      Female
19      Female
20        Male
21      Female
22      Female
23      Female
24        Male
25      Female
26        Male
27      Female
28        Male
29      Female
         ...  
5705      Male
5706      Male
5707    Female
5708    Female
5709      Male
5710    Female
5711      Male
5712    Female
5713      Male
5714      Male
5715    Female
5716    Female
5717      Male
5718      Male
5719    Female
5720      Male
5721    Female
5722    Female
5723    Female
5724    Female
5725      Male
5726      Male
5727    Female
5728      Male
5729      Male
5730    Female
5731      Male
5732    Female
5733      Male
5734    Female
Name: RIAGENDRx, Length: 5735, dtype: object

In [29]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
ctab = pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)
ctab

RIAGENDRx,Female,Male
SMQ020x,Unnamed: 1_level_1,Unnamed: 2_level_1
No,2066,1340
Yes,906,1413


In [52]:
#one approach for computing best estimate for proportion 
#p_1 prop of males who smoke among all males
p_1 = ctab.loc['Yes','Male']/ctab.loc[:,'Male'].sum()
#p_2 prop of females who smoke among all females
p_2 = ctab.loc['Yes','Female']/ctab.loc[:,'Female'].sum()
print(p_1,p_2)

#another approach of doing the same thing
da = da[~pd.isnull(da["SMQ020x"])]#this step is cruicial. Otherwise lambda x: np.mean(x=='Yes')
# will not be accurate as it will also treat NAN in the total count
dz = da.groupby("RIAGENDRx").agg({"SMQ020x": [lambda x: np.mean(x=='Yes'), 'count']})
#p_1 = dz['SMQ020xx'].loc['Female','mean']
dz.columns = dz.columns.droplevel(level=0) #drop the newly created multiindex SMQ020x on columns
dz = dz.rename(columns={
    "<lambda>": "proportions"
})
print(dz)

0.5132582637123139 0.30484522207267833
           proportions  count
RIAGENDRx                    
Female        0.304845   2972
Male          0.513258   2753


In [36]:
np.sqrt(p_1*(1-p_1)/2753)

0.009526078653689868

### Constructing Confidence Intervals

Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals.  From lecture, we know that the equation is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

The Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Lastly, the standard error for difference of population proportions and means is:

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{SE_{Proportion\ 1}^2 + SE_{Proportion\ 2} ^2}$$

#### Difference of Two Population Proportions

In [None]:
p = .304845
n = 2972
se_female = np.sqrt(p * (1 - p)/n)
se_female

In [None]:
p = .513258
n = 2753
se_male = np.sqrt(p * (1 - p)/ n)
se_male

In [None]:
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

In [None]:
d = .304845 - .513258
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

#### Difference of Two Population Means

In [None]:
da["BMXBMI"].head()

In [None]:
da.groupby("RIAGENDRx").agg({"BMXBMI": [np.mean, np.std, np.size]})

In [None]:
sem_female = 7.753319 / np.sqrt(2976)
sem_male = 6.252568 / np.sqrt(2759)
(sem_female, sem_male)

In [None]:
sem_diff = np.sqrt(sem_female**2 + sem_male**2)
sem_diff

In [None]:
d = 29.939946 - 28.778072

In [None]:
lcb = d - 1.96 * sem_diff
ucb = d + 1.96 * sem_diff
(lcb, ucb)