<a href="https://colab.research.google.com/github/arbi11/YCBS-272/blob/master/Lec_8_2_1_Confidence_Intervals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Statistical Inference with Confidence Intervals

In this notebook we will explore the concept of confidence intervals, how to calculate them, interpret them, and what confidence really means.  

More specifically, we're going to review how to calculate confidence intervals of population proportions and means.

### Why Confidence Intervals?

Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence.  For example,  we saw an example where we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.

This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.

Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.

### How are Confidence Intervals Calculated?

Our equation for calculating confidence intervals is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

Lastly, the Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$


In [0]:
import numpy as np

In [0]:
tstar = 1.96
p = .85
n = 659

se = np.sqrt((p * (1 - p))/n)
se

In [0]:
lcb = p - tstar * se
ucb = p + tstar * se
(lcb, ucb)

In [0]:
import statsmodels.api as sm

In [0]:
sm.stats.proportion_confint(n * p, n)

Now, lets load Cartwheel dataset and calculate a confidence interval for our mean cartwheel distance:

In [0]:
! git clone https://github.com/arbi11/YCBS-272.git

In [0]:
! ls

In [0]:
cd YCBS-272/

In [0]:
! ls

In [0]:
import pandas as pd

df = pd.read_csv("Cartwheeldata.csv")

In [0]:
df.head()

In [0]:
mean = df["CWDistance"].mean()
sd = df["CWDistance"].std()
n = len(df)

n

In [0]:
tstar = 2.064

se = sd/np.sqrt(n)

se

In [0]:
lcb = mean - tstar * se
ucb = mean + tstar * se
(lcb, ucb)

In [0]:
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

# Confidence Intervals


This tutorial is going to demonstrate how to load data, clean/manipulate a dataset, and construct a confidence interval for the difference between two population proportions and means.

We will use the 2015-2016 wave of the NHANES data for our analysis.

*Note: We have provided a notebook that includes more analysis, with examples of confidence intervals for one population proportions and means, in addition to the analysis I will show you in this tutorial.  I highly recommend checking it out!

For our population proportions, we will analyze the difference of proportion between female and male smokers.  The column that specifies smoker and non-smoker is "SMQ020" in our dataset.

For our population means, we will analyze the difference of mean of body mass index within our female and male populations.  The column that includes the body mass index value is "BMXBMI".

Additionally, the gender is specified in the column "RIAGENDR".

In [0]:
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [0]:
url = "nhanes_2015_2016.csv"
da = pd.read_csv(url)

### Investigating and Cleaning Data

In [0]:
da["SMQ020x"] = da.SMQ020.replace({1: "Yes", 2: "No", 7: np.nan, 9: np.nan})
da["SMQ020x"]

In [0]:
da["RIAGENDRx"] = da.RIAGENDR.replace({1: "Male", 2: "Female"})
da["RIAGENDRx"]

In [0]:
dx = da[["SMQ020x", "RIAGENDRx"]].dropna()
pd.crosstab(dx.SMQ020x, dx.RIAGENDRx)

In [0]:
dz = dx.groupby("RIAGENDRx").agg({"SMQ020x": [np.mean, np.size]})
dz.column = ["Proportion", "Total n"]
dz

### Constructing Confidence Intervals

Now that we have the population proportions of male and female smokers, we can begin to calculate confidence intervals. We know that the equation is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** from the sample and the *Margin of Error* is the **t-multiplier**.

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

The Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Lastly, the standard error for difference of population proportions and means is:

$$Standard\ Error\ for\ Difference\ of\ Two\ Population\ Proportions\ Or\ Means = \sqrt{SE_{Proportion\ 1}^2 + SE_{Proportion\ 2} ^2}$$

#### Difference of Two Population Proportions

In [0]:
p = .304845
n = 2972
se_female = np.sqrt(p * (1 - p)/n)
se_female

In [0]:
p = .513258
n = 2753
se_male = np.sqrt(p * (1 - p)/ n)
se_male

In [0]:
se_diff = np.sqrt(se_female**2 + se_male**2)
se_diff

In [0]:
d = .304845 - .513258
lcb = d - 1.96 * se_diff
ucb = d + 1.96 * se_diff
(lcb, ucb)

#### Difference of Two Population Means

In [0]:
da["BMXBMI"].head()

In [0]:
da.groupby("RIAGENDRx").agg({"BMXBMI": [np.mean, np.std, np.size]})

In [0]:
sem_female = 7.753319 / np.sqrt(2976)
sem_male = 6.252568 / np.sqrt(2759)
(sem_female, sem_male)

In [0]:
sem_diff = np.sqrt(sem_female**2 + sem_male**2)
sem_diff

In [0]:
d = 29.939946 - 28.778072

In [0]:
lcb = d - 1.96 * sem_diff
ucb = d + 1.96 * sem_diff
(lcb, ucb)