### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)

In [2]:
sample_data.shape

(200, 4)

`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [3]:
diff = []
for _ in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    drinks = boot_sample[boot_sample['drinks_coffee']==True]['height'].mean()
    nondrinks = boot_sample[boot_sample['drinks_coffee']==False]['height'].mean()
    diff.append(drinks-nondrinks)

In [4]:
np.mean(diff)

1.3362713474811618

In [5]:
np.percentile(diff,1),np.percentile(diff, 100)

(0.23034449860735282, 3.0795905671682817)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [6]:
diff_age_height =[]
for _ in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    elder = boot_sample[boot_sample['age'] == '>=21']['height'].mean()
    younger = boot_sample[boot_sample['age'] == '<21']['height'].mean()
    diff_age_height.append(elder - younger) 

In [7]:
np.mean(diff_age_height)

4.2473246928677675

In [8]:
np.percentile(diff_age_height,1),np.percentile(diff_age_height, 100)

(3.4357212278738629, 5.5130868057033666)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [38]:
diff_height_under_21 = []
for _ in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    drinks = boot_sample[boot_sample['drinks_coffee']==True]
    drinks_under_21 = drinks[drinks['age']=='<21']['height'].mean()
    
    non_drinks = boot_sample[boot_sample['drinks_coffee']==False]
    non_drinks_under_21 = non_drinks[non_drinks['age'] == '<21']['height'].mean()
    diff_height_under_21.append(non_drinks_under_21 - drinks_under_21)

In [39]:
np.mean(diff_height_under_21)

1.8451513244318667

In [40]:
np.percentile(diff_height_under_21,2.5),np.percentile(diff_height_under_21, 97.5)

(1.0697115403286301, 2.6095761316556154)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [41]:
diff_height_greater_21 = []
for _ in range(10000):
    boot_sample = sample_data.sample(200, replace=True)
    drinks = boot_sample[boot_sample['drinks_coffee']==True]
    drinks_greater_21 = drinks[drinks['age']=='>=21']['height'].mean()
    
    non_drinks = boot_sample[boot_sample['drinks_coffee']==False]
    non_drinks_greater_21 = non_drinks[non_drinks['age'] == '>=21']['height'].mean()
    diff_height_greater_21.append(non_drinks_greater_21 - drinks_greater_21)

In [42]:
np.mean(diff_height_greater_21)

3.1288407754803331

In [44]:
np.percentile(diff_height_greater_21,2.5),np.percentile(diff_height_greater_21, 97.5)

(1.8354657110658437, 4.3981485868571255)