**Statistical Inference with Confidence Intervals**

Confidence intervals are a calculated range or boundary around a parameter or a statistic that is supported mathematically with a certain level of confidence. For example, in the lecture, we estimated, with 95% confidence, that the population proportion of parents with a toddler that use a car seat for all travel with their toddler was somewhere between 82.2% and 87.7%.

This is different than having a 95% probability that the true population proportion is within our confidence interval.

Essentially, if we were to repeat this process, 95% of our calculated confidence intervals would contain the true proportion.

Probability is the odds of a favorable outcome occurring. So if I am flipping a coin the probability of flipping a head is 1/2.

In statistics, there is a concept known as a confidence interval. Say I am flipping 10 coins, I wonder how many of these coins are likely to some up heads, anywhere from 0 to 10.

I am 100% confident of an interval of 0 to 10 heads. But that is a very broad interval, so it is not much help. Instead let us look at 5 ± 3, from 2 to 8, still a pretty broad interval, my confidence level is almost 98% still very high. Lets look at 5 ± 1, from 4 to 6, now my confidence level is only approximately 65.6%. A confidence level is about how sure we are that a result will fall within a certain range of options, like 4 to 6 heads or 2 to 8 heads.



**How are Confidence Intervals Calculated?**

Our equation for calculating confidence intervals is as follows:

$$Best\ Estimate \pm Margin\ of\ Error$$

Where the *Best Estimate* is the **observed population proportion or mean** and the *Margin of Error* is the **t-multiplier**.

The t-multiplier is calculated based on the degrees of freedom and desired confidence level.  For samples with more than 30 observations and a confidence level of 95%, the t-multiplier is 1.96

The equation to create a 95% confidence interval can also be shown as:

$$Population\ Proportion\ or\ Mean\ \pm (t-multiplier *\ Standard\ Error)$$

Lastly, the Standard Error is calculated differenly for population proportion and mean:

$$Standard\ Error \ for\ Population\ Proportion = \sqrt{\frac{Population\ Proportion * (1 - Population\ Proportion)}{Number\ Of\ Observations}}$$

$$Standard\ Error \ for\ Mean = \frac{Standard\ Deviation}{\sqrt{Number\ Of\ Observations}}$$

Let's replicate the car seat example from lecture:

In [0]:
import numpy as np


In [0]:
zstar =1.96
p = .85       #population proportion
n =659        #sample size


se = np.sqrt(p*(1-p)/n)

In [4]:
se

0.01390952774409444

In [5]:
lcb = p-zstar*se
ucb = p+zstar*se
(lcb,ucb)

(0.8227373256215749, 0.8772626743784251)

**using statsmodel module**

In [6]:
import statsmodels.api as sm
sm.stats.proportion_confint(n*p , n)

(0.8227378265796143, 0.8772621734203857)

In [8]:
import pandas as pd
df = pd.read_csv('/content/cartwheeldata.csv',index_col=0)
df.head()

Unnamed: 0_level_0,Age,Gender,GenderGroup,Glasses,GlassesGroup,Height,Wingspan,CWDistance,Complete,CompleteGroup,Score
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,56,F,1,Y,1,62.0,61.0,79,Y,1,7
2,26,F,1,Y,1,62.0,60.0,70,Y,1,8
3,33,F,1,Y,1,66.0,64.0,85,Y,1,7
4,39,F,1,N,0,64.0,63.0,87,Y,1,10
5,27,M,2,N,0,73.0,75.0,72,N,0,4


In [29]:
standard_dev=df['CWDistance'].std()
standard_dev

15.058552387264855

In [30]:
df['CWDistance'].describe()

count     25.000000
mean      82.480000
std       15.058552
min       63.000000
25%       70.000000
50%       81.000000
75%       92.000000
max      115.000000
Name: CWDistance, dtype: float64

In [34]:
num=df['CWDistance'].count()
se_mean = standard_dev/np.sqrt(num)
se_mean

3.0117104774529713

In [0]:
tstar = 2.064

In [36]:
lcb_mean = df['CWDistance'].mean() - tstar*se_mean
ucb_mean = df['CWDistance'].mean() + tstar*se_mean
(lcb_mean, ucb_mean)

(76.26382957453707, 88.69617042546294)

In [37]:
sm.stats.DescrStatsW(df["CWDistance"]).zconfint_mean()

(76.57715593233024, 88.38284406766977)