### Confidence Interval - Difference In Means

Here you will look through the example from the last video, but you will also go a couple of steps further into what might actually be going on with this data.

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
np.random.seed(42)

full_data = pd.read_csv('coffee_dataset.csv')
sample_data = full_data.sample(200)
sample_data.head()

Unnamed: 0,user_id,age,drinks_coffee,height
2402,2874,<21,True,64.357154
2864,3670,>=21,True,66.859636
2167,7441,<21,False,66.659561
507,2781,>=21,True,70.166241
1817,2875,>=21,True,71.36912


`1.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for coffee and non-coffee drinkers.  Build a 99% confidence interval using your sampling distribution.  Use your interval to start answering the first quiz question below.

In [18]:
diff = []
for _ in range(10000):
    boot_samp = sample_data.sample(200, replace = True)
    coff_mean = boot_samp[boot_samp['drinks_coffee'] == True]['height'].mean()
    not_coffee_mean = boot_samp[boot_samp['drinks_coffee'] == False]['height'].mean()
    diff.append(coff_mean - not_coffee_mean)

In [19]:
np.percentile(diff, 0.5), np.percentile(diff, 99.5)

(0.10258900080919674, 2.5388333707966284)

`2.` For 10,000 iterations, bootstrap sample your sample data, compute the difference in the average heights for those older than 21 and those younger than 21.  Build a 99% confidence interval using your sampling distribution.  Use your interval to finish answering the first quiz question below.  

In [20]:
diff_age = []
for _ in range(10000):
    boot_samp = sample_data.sample(200, replace = True)
    under_21 = boot_samp[boot_samp['age'] == '<21']['height'].mean()
    over_21 = boot_samp[boot_samp['age'] != '<21']['height'].mean()
    diff_age.append(over_21 - under_21)

In [21]:
np.percentile(diff_age, 0.5), np.percentile(diff_age, 99.5)

(3.3652749452554089, 5.0932450670660936)

`3.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **under** 21 years old.  Using your sampling distribution, build a 95% confidence interval.  Use your interval to start answering question 2 below.

In [22]:
diff_coffee_under_21 = []
for _ in range(10000):
    boot_samp = sample_data.sample(200, replace = True)
    under_21_coffee = boot_samp.query("age == '<21' and drinks_coffee == True")['height'].mean()
    under_21_no_coffee = boot_samp.query("age == '<21' and drinks_coffee == False")['height'].mean()
    diff_coffee_under_21.append(under_21_no_coffee - under_21_coffee)

In [23]:
np.percentile(diff_coffee_under_21, 2.5), np.percentile(diff_coffee_under_21, 97.5)

(1.0593651244624331, 2.5931557940679251)

`4.` For 10,000 iterations bootstrap your sample data, compute the **difference** in the average height for coffee drinkers and the average height for non-coffee drinkers for individuals **over** 21 years old.  Using your sampling distribution, build a 95% confidence interval. Use your interval to finish answering the second quiz question below. As well as the following questions. 

In [27]:
diff_coffee_over_21 = []
for _ in range(10000):
    boot_samp = sample_data.sample(200, replace = True)
    over_21_coffee = boot_samp.query("age != '<21' and drinks_coffee == True")['height'].mean()
    over_21_no_coffee = boot_samp.query("age != '<21' and drinks_coffee == False")['height'].mean()
    diff_coffee_over_21.append(over_21_no_coffee - over_21_coffee)

In [28]:
np.percentile(diff_coffee_over_21, 2.5), np.percentile(diff_coffee_over_21, 97.5)

(1.8299236976994853, 4.414728838621194)