## Statistical Inference with Confidence Intervals

Throughout week 2, we have explored the concept of confidence intervals, how to calculate them, interpret them, and what confidence really means.  

In this tutorial, we're going to review how to calculate confidence intervals of population proportions and means.

To begin, let's go over some of the material from this week and why confidence intervals are useful tools when deriving insights from data.

### Why Confidence Intervals?

Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence.  For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.

This is *__different__* than having a 95% probability that the true population proportion is within our confidence interval.

Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.

### How are Confidence Intervals Calculated?

Our equation for calculating confidence intervals is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

Lastly, the Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Let's replicate the car seat example from lecture:

In [1]:
import numpy as np

In [2]:
tstar = 1.96
p = .85
n = 659

se = np.sqrt((p * (1 - p))/n)
se

0.01390952774409444

In [3]:
lcb = p - tstar * se
ucb = p + tstar * se
(lcb, ucb)

(0.8227373256215749, 0.8772626743784251)

In [4]:
import statsmodels.api as sm

In [5]:
sm.stats.proportion_confint(n * p, n)

(0.8227378265796143, 0.8772621734203857)

Now, lets take our Cartwheel dataset introduced in lecture and calculate a confidence interval for our mean cartwheel distance:

In [6]:
import pandas as pd

df = pd.read_csv("Cartwheeldata.csv")

In [7]:
df.head()

Unnamed: 0,ID,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
0,1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
1,2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
2,3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
3,4,39,F,1,N,0,64.0,63.0,87,Y,1,10
4,5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 12 columns):
ID               25 non-null int64
Age              25 non-null int64
Gender           25 non-null object
GenderGroup      25 non-null int64
Glasses          25 non-null object
GlassesGroup     25 non-null int64
Height           25 non-null float64
Wingspan         25 non-null float64
CWDistance       25 non-null int64
Complete         25 non-null object
CompleteGroup    25 non-null int64
Score            25 non-null int64
dtypes: float64(2), int64(7), object(3)
memory usage: 2.4+ KB


In [9]:
df.describe()

Unnamed: 0,ID,Age,GenderGroup,GlassesGroup,Height,Wingspan,CWDistance,CompleteGroup,Score
count,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0,25.0
mean,13.0,28.24,1.52,0.56,67.65,66.26,82.48,0.76,6.4
std,7.359801,6.989754,0.509902,0.506623,4.431187,5.492647,15.058552,0.43589,2.533114
min,1.0,22.0,1.0,0.0,61.5,57.5,63.0,0.0,2.0
25%,7.0,24.0,1.0,0.0,64.0,62.0,70.0,1.0,4.0
50%,13.0,26.0,2.0,1.0,68.0,66.0,81.0,1.0,6.0
75%,19.0,29.0,2.0,1.0,71.0,71.0,92.0,1.0,8.0
max,25.0,56.0,2.0,1.0,75.0,76.0,115.0,1.0,10.0


In [10]:
mean = df["CWDistance"].mean()
mean

82.48

In [11]:
sd = df["CWDistance"].std()
sd

15.058552387264855

In [12]:
n = len(df)
n

25

In [13]:
#自由度(degree of freedom, df)在数学中能够自由取值的变量个数，如有3个变量x、y、z，但x+y+z=18，因此其自由度等于2。
#在统计学中，自由度指的是计算某一统计量时，取值不受限制的变量个数。通常df=n-k。其中n为样本含量，k为被限制的条件数或变量个数，或计算某一统计量时用到其它独立统计量的个数。自由度通常用于抽样分布中。
# 这里面 n=25, k =1, 所以df = 25-1=24. 查表的时候要查  df = 24

* 查表(t-table.pdf)可得 95% confidence level 在df =24 时候 t-multiplier = 2.064
![T-table](pic-t-table-snapshot.png)

In [18]:
tstar = 2.064

In [19]:
se = sd/np.sqrt(n)
se

3.0117104774529713

In [20]:
lcb = mean - tstar * se
ucb = mean + tstar * se
(lcb, ucb)

(76.26382957453707, 88.69617042546294)

In [21]:
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

(76.57715593233026, 88.38284406766975)