# Lambda School Data Science - Unit 1 Sprint 2 Module 3

---

## Assignment: Sampling and Confidence Intervals

### Objectives

* Objective 01 - explain the concepts of statistical estimate, precision, and standard error as they apply to inferential statistics
* Objective 02 - explain the implications of the central limit theorem in inferential statistics
* Objective 03 - explain the purpose of and identify applications for confidence intervals
demonstrate how to build a confidence interval around a sample estimate
* Objective 04 - visualize a confidence interval in order to communicate the precision of sample estimates

## Confidence Intervals

Soft drinks like Coke and Pepsi are manufactured to have a standard caffeine content. For example, a 12-oz serving of Coke has 34mg of caffeine, and a 12-oz serving of Pepsi has 37.6mg of caffeine. However, fountain soft drinks are typically mixed in individual restaurant dispensers, so it is more difficult to maintain a standard level of caffeine per serving. In this study, researchers randomly sampled Coke, Diet Coke, Pepsi, and Diet Pepsi at a set of franchise restaurants and measured the caffeine content in 12oz of each soft drink. The data is found in the Soda.xlsx dataset.

Because individuals can be sensitive to caffeine – and because the manufacturers are interested in product consistency – we wish to estimate the mean caffeine content in 12oz of Coke served in franchise restaurants using a 95% confidence interval. 

You can find the Coke data here: 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv'

The first variable is the sample ID and the second variable is the caffeine content in the 12-oz sample measured in mg.

Source: A.N. Garand and L.N. Bell (1997). "Caffeine Content of Fountain and Private-Label Store Brand Carbonated Beverages," Journal of the American Dietetic Association, Vol. 97, #2, pp. 179-182.


### 1) Load the dataset and print the first few rows.

* name your DataFrame `coke_df`
* set `skipinitialspace=True`
* set `header=0`

In [None]:
# Import your libraries and load the data
import pandas as pd
import numpy as np
### your code here ###
coke_df = pd.read_csv('https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Coke.csv', skipinitialspace=True, header=0)

In [None]:
# Take a look at your data

### your code here ###
coke_df

Unnamed: 0,Drink,Caffeine
0,1,47.32
1,2,43.78
2,3,48.12
3,4,43.25
4,5,46.42
5,6,45.16
6,7,45.17
7,8,42.48
8,9,47.7
9,10,36.62


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check if the DataFrame was created
assert not coke_df.empty, 'Make sure the df name is accurate and you loaded the correct URL.'
# check the shape of the DataFrame
assert coke_df.shape == (50, 2), 'Is your data loaded with the correct argument?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###2) Calculate the mean, standard deviation (SD), standard error (SE) for the caffeine content and n for the sample size. 

Label your variables as follows:

* `mean_caffeine`
* `sd_caffeine`
* `n_caffeine`
* `se_caffeine`

In [None]:
### your code here ###

mean_caffeine = coke_df['Caffeine'].mean()
sd_caffeine = coke_df['Caffeine'].std()
n_caffeine = coke_df['Caffeine'].count()
se_caffeine = sd_caffeine / (n_caffeine**(1/2))

print ('Mean :', mean_caffeine)
print ('SD :', sd_caffeine)
print ('Number of Samples :', n_caffeine)
print ('Standard Error :', se_caffeine)

Mean : 37.9402
SD : 5.243756828216712
Number of Samples : 50
Standard Error : 0.7415792024250598


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check statistics calculations
assert round(mean_caffeine) == 38, 'Check your mean value'
assert round(sd_caffeine) == 5, 'Check your standard deviation value'
assert n_caffeine == 50, 'Check the sample number'
assert round(se_caffeine, 2) == 0.74, 'Check the standard error value'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Summarize your results from above in a sentence or two.

Answer-->

The mean of the 50 samples of caffeine population is 37.9402 and from the Standard Deviation it can be inferred that the sample values varies a lot compared with the mean.

###3) Find t* for a 95% confidence interval.  

Use the starter code below and *fill in the degrees of freedom*. The `t_star` variable has been created for you.

In [None]:
# Import the stats library
from scipy.stats import t

#Don't worry too much about where the 0.975 comes from.  It has to do
#with wanting to determine the *middle* 95% of the t-distribution
#We're going to learn how to calculate a 95% CI this easy way in just a minute.

#Hint: Recall that n = 223 for the body temp problem. What was the dof for that problem?

### your code here: fill in the degrees of freedom ###
t_star = t.ppf(0.975, df = n_caffeine - 1)
print ('t_star =', t_star)

t_star = 2.009575234489209


In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check statistics calculations
assert round(t_star) == 2, 'Check the dof you entered!'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###4) Calculate the margin of error for a 95% confidence interval for the mean caffeine content in a 12-oz Coke. Name your variable `m_err`.



In [None]:
### your code here ###
# margin of error forumla  = t* x s / sqrt(n)
m_err = t_star * sd_caffeine / (n_caffeine**(1/2))
m_err

1.49025919960566

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check margin of error calculation
assert round(m_err, 2) == 1.49, 'Did you multiply m_err by the correct value?'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### State the margin of error answer with the correct units. (example: The margin of error is 5 pounds per bag of cat food).

Answer-->
The margin of error is aprox. 1.5 mg of caffeine in a 12-oz fountain coke.

###5) Calculate a 95% CI for the mean caffeine content in a 12-oz fountain Coke with the CI formula using the summary statistics and t* that you calculated above.

Name your variables as follows:

* lower confidence level: `lower_CL`
* upper confidence level: `upper_CL`

In [None]:
### your code here ###
lower_CL = mean_caffeine - m_err
upper_CL = mean_caffeine + m_err

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(lower_CL, 2) == 36.45, 'Check your lower CL calculation.'
assert round(upper_CL, 2) == 39.43, 'Check your upper CL calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###6) Calculate a **95% confidence interval** for the mean caffeine content in a 12-oz fountain Coke using the t-interval function in Python. Name your variable `t_int_95`.

In [None]:
### your code here ###
t_int_95 = t.interval(alpha= 0.95, df = n_caffeine-1, loc = mean_caffeine, scale = se_caffeine)
t_int_95

(36.44994080039434, 39.43045919960566)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_95[0], 2) == 36.45, 'Check your interval calculation.'
assert round(t_int_95[1], 2) == 39.43, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


###7) Compare the two confidence intervals you calculated.  Do they match?  Should they?

They do match and they should!

###8) Interpret the meaning of the 95% confidence interval for the mean caffeiene content in a 12-oz fountain Coke. in a sentence or two.

ANSWER-->
We are 95% confident that the population mean of caffeine in 12-oz Coke is between 36.45 mg and 39.43 mg.

###9) Using the t-interval Python function, calculate a **90% confidence interval** for the mean caffeine content in a 12-oz Coke. Name your variable `t_int_90` (make sure to use `90` at the end!).


In [None]:
### your code here ###
t_int_90 = t.interval(alpha= 0.90, df = n_caffeine-1, loc = mean_caffeine, scale = se_caffeine)
t_int_90

(36.696904726749196, 39.1834952732508)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_90[0], 2) == 36.70, 'Check your interval calculation.'
assert round(t_int_90[1], 2) == 39.18, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Is this estimate *more accurate* or *more precise* (pick one) than the 95% confidence interval?


ANSWER-->

The estimates in 90% confidence interval are more **precise** compared with the estimates with the 95% confidence level.


###10) Using the t-interval Python function, calculate a **99% confidence interval** for the mean caffeine content in a 12-oz Coke.  Name your variable `t_int_99` (make sure to use `99` at the end!).




In [None]:
### your code here ###
t_int_99 = t.interval(alpha= 0.99, df = n_caffeine-1, loc = mean_caffeine, scale = se_caffeine)
t_int_99

(35.95280335285685, 39.92759664714315)

In [None]:
# This is an ANSWER CHECK cell.
# Don't alter this cell if you want accurate feedback
#------------------------------------------------------------------------------#

# check confidence level calculations
assert round(t_int_99[0], 2) == 35.95, 'Check your interval calculation.'
assert round(t_int_99[1], 2) == 39.93, 'Check your interval calculation.'
print('Correct! Continue to the next question')

Correct! Continue to the next question


### Is this estimate more *accurate* or more *precise* (pick one) than the 95% confidence interval?

ANSWER-->
We are 99% confident that the population mean of caffeine in a 12-oz Coke is between 35.95mg and 39.93 mg.

With a 99% confidence interval the estimate is more **accurate** compared with a 95% confidence level.

## Stretch goals:

###1) The correspondence between confidence intervals and hypothesis tests.

Read [this](https://https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-confidence-intervals-and-confidence-levels#:~:text=If%20a%20hypothesis%20test%20produces,corresponding%20confidence%20level%20is%2095%25.&text=If%20the%20confidence%20interval%20does,the%20results%20are%20statistically%20significant.) article about the correspondence between confidence intervals and hypothesis tests.  Feel free to read the whole article, but the relevant part can be found under the heading Why P Values and Confidence Intervals Always Agree About Statistical Significance.

Imagine you work for quality control at Coke and are tasked with making sure that the caffeiene content in the fountain beverages served in restaurants is the same as in a 12-oz can of Coke (34mg).  If you believe that the mean caffeiene content in fountain coke is not 34mg, you must re-train the franchise managers to make sure the Coke served has the correct caffeiene level.

Based on the confidence interval you calculated in the assignment, do you believe that the mean caffeiene content is statistically significantly different from 34 mg in a 12-oz serving?


Because 34mg is not in the bounds of the 95% confidence interval, we can reject the null hypothesis that the mean caffeiene content in 12-oz of fountain Coke is equal to 34mg.  Instead, we conclude it is between about 36.4 and 39.4 mg.

###2) If we increased the sample size from 50 to 100 but the sample mean and SD remained the same, describe **two** ways the margin of error would change.  Would the margin of error become smaller or larger?

Both t* and n would change.  Therefore

In [None]:
t_star_n100 = t.ppf(0.975,df=100-1)
print ('t_star at n = 100', t_star_n100)
### your code here ###

t_star_n50 = t.ppf(0.975,df=50-1)
print ('t_star at n = 50', t_star_n50)
### your code here ###

t_star at n = 100 1.9842169515086827
t_star at n = 50 2.009575234489209


ANSWER-->
Consider the forumla below for margin of error:-

$t^* \frac{s}{\sqrt{n}}$ 

The margin of error is inversely proportional to the sample size. As the n value **increases** the margin of error will **decrease** if sample mean and SD are kept constant. 

Other than this we can see that t_star is a function of degrees of freedom from above calculation (compare t_star for 95 % confidence level for degrees of freedom at 100 is 1.9842 vs. 95 % confidence level for degrees of freedom at 50 is 2.0095). Based on the values comparison of 95 % confidence level to 99% confidence level the t_star **decreases** as the sample size is increased.