# Confidence intervals

## Outline

- Introduction
- Confidence Intervals
- Wrap up


## Introduction 

A **confidence interval** is an estimate computed from an observed data's statistics. A Confidence Interval is a range of values we are fairly sure our true value lies within. In the vernacular, "We are 99% certain (confidence level) that most of these datasets (confidence intervals) contain the true population parameter."

**Z-score** - The Z score shows us how many standard deviations we are away from the mean of a distribution. For example, in investing, Z-scores are measures of an observation's variability and can be put to use by traders in determining market volatility. 

**t-score** - The t score allows you to take one score and standardize it which then enables you to compare it to other scores. The t score determines the ratio of differences between two groups or samples, as well as the differences within a group or sample.

A **z-test** is used when the sample size is more than **30** while **t-test** is used for sample size less than **30**. In some situations even with samples more than 30 the data may be non-normal, in that case you can use the z-test.


## Confidence Intervals

A confidence interval is an interval that contains the unknown parameter (such as the population mean $\mu$) with certain degree of confidence.

![](images/conf_interval.png)

Given a “Standard Bell Curve Z” having mean 0 and standard deviation 1, a z–score is the value $z_r$ such that $P(– z_r ≤ Z ≤ z_r ) = r$. That is, there is probability r between the
points – $z_r$ and + $z_r$.

We would like to say that µ is about $x$ ± some margin of error. We can never be 100% sure if the unknown $\mu$ will really be within our margin of error. But with larger sample sizes, we have a higher probability that $\mu$ will be within our bounds.

We can define the z-score as

$$
z_{score} = \frac{x-\mu}{\sigma}
$$

Under the normality assumption

$$P\Big( \bar{X} - 1.96\cdot\frac{\sigma}{\sqrt{n}}\ \le \mu \le \ \bar{X} + 1.96\cdot\frac{\sigma}{\sqrt{n}} \Big) = 0.95 $$

defines a confidence interval of 95%. In general, by the CLT, for reasonably large sample size $n$, the above equation is still approximately true. 

Then we can say that for a 95% CI for $\mu$ when $\sigma$ is know in:

$$\Big( \bar{X} - 1.96\cdot\frac{\sigma}{\sqrt{n}}\ ,\ \bar{X} + 1.96\cdot\frac{\sigma}{\sqrt{n}} \Big)$$  

More generally, if we define $z_{\alpha/2}$ as the value that cuts off an area of $\alpha/2$ in the upper tail of the standard normal distribution, we can define a $1-\alpha$ confidence interval for the population mean $\mu$ as:

$$\Big( \bar{X} - Z_{\alpha/2}\cdot\frac{\sigma}{\sqrt{n}}\ ,\ \bar{X} + Z_{\alpha/2}\cdot\frac{\sigma}{\sqrt{n}} \Big)$$  


To calculate $Z_{\alpha/2}$ for a 95% confidence interval we can use the "`.interval()`" function or the "`.ppf()`" function.

### Include the Python libraries we will use

In [54]:
# from the import the stats functions from scipy and import numpy
import scipy.stats as stats
import numpy as np

### Using an alpha value (significance level).
So in our case, an alpha vaue of .05 (5%) means that we are declaring that we would like to see the confirmation of the null hypothesis, with 95% confidence (or the rejection of the null hypothesis with 5% error).

### Let's take a look at the 95% interval for the normal distribution  

In [55]:
print(stats.norm.interval(.95)) # interval
# We can also calculate the confidence intervals using the ppf (percent point function) function
print(stats.norm.ppf( 1- ((1 - .95)/2))) # upper Z value
print(stats.norm.ppf( (1 - .95)/2)) # lower Z value

(-1.959963984540054, 1.959963984540054)
1.959963984540054
-1.959963984540054


### We can calculate the CI interval in the following manner (refer back to our equation):

In [56]:
alpha = .05
interval_end = 1-((1-alpha)/2) # upper Z value
z_mult = stats.norm.ppf(interval_end) # using the percent point function (or you can look it up)
# standard deviation
sd = 25
# mean of all values
x_bar = 135
# number of values
n = 56
# let's see the variables
print ('alpha:', alpha, 'interval_end:', interval_end, 'z_mult:', z_mult)

alpha: 0.05 interval_end: 0.525 z_mult: 0.06270677794321385


In [57]:
print('Conventional Calulation:',
           (x_bar - z_mult*(sd/np.sqrt(n)),x_bar + z_mult*(sd/np.sqrt(n)) ))

Conventional Calulation: (134.79051135813214, 135.20948864186786)


In [58]:
print('Calculation with .interval():',  # 'loc' is the mean of the distribution, 'scale' is the sd
           stats.norm.interval(alpha = .05, loc = x_bar, scale= sd/ np.sqrt(n)))

Calculation with .interval(): (134.79051135813214, 135.20948864186786)


## Wrap up

- Confidence Intervals
