# Goals for today
- Understand the central limit theorem and why it's important for statistical analysis
- Understand sampling 
- Understand confidence intervals


# Central Limit Theorem
**The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement(why??), then the distribution of the sample means will be approximately normally distributed**

![](images/central-limit-theorem.png)

Essentially tells us that as we take more samples from the population, we would approach the true population mean.

We can compare our one sample and compare it to our predicted population mean (from the distribution of sample means usually).

If our sample mean is very different, we are either very lucky/unlucky or there is something fundamentally different about our sample!

![](images/something-is-different.png)

# Our View of the World Isn't Perfect

We don't have **perfect** information; life doesn't have an answer key

<br/>

![no answer in the back of the book meme](images/no-answers-in-back-of-book.jpg)

## Why can't we we look at the whole population?

1. Expensive
2. Unrealistic
3. We don't need it to gain insights!

## A scenario
[Video Explanation](https://www.youtube.com/watch?v=jvoxEYmQHNM)

# Sampling

From a sample, we can get **point estimates**: estimates of the population parameters(a value that describes a characteristic of an entire population, such as the population mean).

- Each sample will have it's own mean
- Showing each sample mean, we will get a normal distribution!

Let's look at an example taken from the ubiquitous Iris dataset. This histogram represents the distributions of sepal length:


[probgif](http://localhost:8888/view/seattle_beta/distributions_and_clt_seattle-ds/img/probability-basics.gif)

##### Population v Sample Terminology
Characteristics of populations are called *parameters*

Characteristics of a sample are called *statistics*

![](https://media.cheggcdn.com/media/7ac/7ac1a812-3b41-4873-8413-b6a7b8fab530/CL-26481V_image_006.png)

## Coding some samples

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://github.com/datasciencedojo/datasets/blob/master/titanic.csv')

ParserError: Error tokenizing data. C error: Expected 1 fields in line 49, saw 2


In [3]:
display(df.info())
df.head()

NameError: name 'df' is not defined

In [None]:
all_ages = df['Age'].dropna()
mean = all_ages.mean()
print(f'There are {all_ages.size} people, average age is {mean :.1f}')

In [None]:
# Take a random sample
sample = all_ages.sample(n=100)#, random_state=27) #Take a sample of 100 people
mean_s = sample.mean()

calc_percent_error = lambda pop_mean, sample_mean: np.abs(sample_mean - pop_mean) / pop_mean
    
percent_err = calc_percent_error(mean, mean_s)

print(f'The sample mean was {mean_s:.1f} with a percent error of {percent_err*100:.2f}%')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Repeatedly take samples and plot out sample means
sample_means = []
for i in range(10**3):
    sample = all_ages.sample(n=30) 
    sample_means.append(sample.mean()) # Calculate the sample mean


plt.hist(sample_means, bins=250);



In [None]:
mean_of_means = np.mean(sample_means)

print(f'The mean of sample means was {mean_of_means:.3f} vs the actual population mean {mean:.3f}')

In [None]:
print(f'Percent error: {100*calc_percent_error(mean, mean_of_means):.2f}%')

### Confidence Interval Interpretation 
![](https://thepsychologist.bps.org.uk/sites/thepsychologist.bps.org.uk/files/methodsfig1.jpg)
**A confidence interval is just an interval that covers 95% of the sample means. Because the interval covers 95% of the sample means we know that anything outside of it occurs less than 5% of the time. If we relate this to p-values, the p-value of anything outside of the confidence interval is <5% and thus is statistically significant.** 

"we found our 95% confidence interval for ages to be from 26.3 and 28.3"

OR

"we are 95% confident that the average age falls between 26.3 and 28.3"