# Homework 2: Introduction to Inference

!!! **IMPORTANT, DO NOT PROCEED BEFORE COMPLETING THE STEP BELOW** !!!

If you haven't already, please make a copy of this notebook and save to your Google Drive.
This is imperative so that your work is saved as you go.

**Due Date**: Thursday April 17th at 11:59pm.

**Submission Instructions**:
- Download the notebook: Go to File --> Download --> Download .ipynb.
- Upload the notebook: Click the Files icon (left side under the Key icon) --> Click the Upload icon (left most of 4) --> Select the file you just downloaded.
- Run the last cell in this notebook.
- Find the new pdf file in the same location as your uploaded notebook.
- Click the 3 vertical dots for this pdf file --> Click Download.
- IMPORTANT: check that your pdf file has not cut off any work from your notebook.
- Upload the pdf to Gradescope.

**Learning Outcomes**:
- Understand statistical inference and uncertainty using Python.
- Critically assess assumptions in common inference methods.
- Construct and interpret confidence intervals.

## Poll Link

Please put a link to your poll here! Remember to make sure that your link is publicly accessible, and is also pasted in the Google Sheet.

Answer here!

------------------------------------------------

------------------------------------------------

## Set up

Run the cell below to import the libraries we are going to use.

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

## Inference with M&Ms


### M&Ms data generating process

Building on the experiment we did in class, this exercise explores the question: On average, what proportion of M&M's are primary colors (yellow, red, or blue)?

You'll use data from a single bag of M&M's (the *sample*) to make inferences about the larger, unseen M&M's bagging process (the *population* or *data-generating process*).

> At this point, it's natural to worry that filling bags with candy is totally unrelated to your future career prospects. However, this setup is surprisingly common in industry settings.
>
> For example, suppose you're a product manager who is interested in understanding your customer base. If we survey a random sample of customers, we can think of the aggregate opinion of the entire customer base as the properties of the unobserved M&M's bagging process, and our survey results as the observed bag of M&M's.
>
> Same idea if you're a pollster trying to understand the fraction of all voters who identify as Republicans when all you get to observe is a small sample of voters.
>
> The methods taught in this notebook are used *constantly* by practitioners.

### Exercise 0

If you attended Lecture 3 in person, you received a bag of M&Ms. Before eating your M&Ms, you reported the count of M&Ms of each color. Report the count of your M&Ms in the code cell below. If you do not remember your count, please locate it in the [aggregated data](https://docs.google.com/spreadsheets/d/1ltmPLWrfBs7JWcOLLq1GFQiibN9W0OfWYFzEiUFohrU/edit?resourcekey=&gid=469698708#gid=469698708).

In [24]:
# Code here!
# --------------------------------- #

# --------------------------------- #

### Point estimates

Our first objective is to provide a *point estimate*, or single best guess, of the *population* proportion of M&M's that are primary colored.

### Exercise 1

In the code cell below, provide a point estimate for the population proportion of M&M's that are primary colored. **Make sure you print the result.**

In [None]:
# Code here!
# --------------------------------- #

# --------------------------------- #

print("Point estimate: ", my_point_estimate)

### Uncertainty

Point estimates are often straightforward to calculate.

Here's the problem: With only one bag of M&M's, how sure are you of your point estimate?

- If you were instead given a different bag of M&M's, would you have had the same point estimate?

- If you were instead given a smaller bag of M&M's, would you be less confident of your point estimate?

- If you were instead given a Costco-sized plastic tub of M&M's, would you be more confident of your point estimate?

What's going on here is **counterfactual reasoning**. In statistical inference, we need to think about **what could have happened in parallel universes**. The (unobserved) distribution of point estimates across these parallel universes is called a **sampling distribution**. This idea powers [frequentist statistical inference](https://en.wikipedia.org/wiki/Frequentist_inference#Relationship_with_other_approaches)!

### Observing parallel universes?!

We're in an exciting scenario where we can actually *observe parallel universes* where other point estimates were generated. Most of your classmates have also gathered data from their own small random sample of M&Ms.

It's important to stress that **this is an unrealistic scenario**. We normally only see one sample of data.

If we plot the point estimates from all students in the course, we can get an approximation of the theoretical sampling distribution.

### Exercise 2



Plot the distribution of your and your classmates's point estimates for the proportion of M&M's that are primary colored. Draw a vertical line on your plot indicating the value of your own point estimate.

*Reminder*: Starting with this homework, we will expect to see plots that are appropriately formatted for readibility. The plotting tips in Lecture 2 are a helpful reference.

In [38]:
# The classwide M&Ms count data is stored at this URL
csv_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vTdZfVH8hX4t0TGGaHQcShiV4Tk-OLj1xcD4GUeCz2M-OKfqXn-4O1593LfWdUsnr9Dun4CV5ivjbWi/pub?gid=469698708&single=true&output=csv"
count_data = pd.read_csv(csv_url)

# Code here!
# --------------------------------- #

# --------------------------------- #

### Exercise 3

Aggregate all of your classmates data into one giant "super sample" of M&Ms. Calculate the proportion of the "super sample" that is primary colored in the code cell below. **Make sure you print your result.**

Are you at all surprised by the result? If yes, what do you think could account for the discrepancy between your expectation and reality? **Answer in the cell below, in no more than three sentences.**

> Given approximately 80 rows of data and 10 M&Ms per bag, our super sample has nearly 800 M&Ms.
>
> Under an assumption of true randomness, the proportion you calculate in this exercise should be very close to the true proportion of M&Ms that are primary colored.
>
> For those interested in going down an M&Ms counting rabbit hole, [this article](https://qz.com/918008/the-color-distribution-of-mms-as-determined-by-a-phd-in-statistics) is a good start.

In [42]:
# Code here!
# --------------------------------- #

# --------------------------------- #

Answer here!

------------------------------------------------

------------------------------------------------

### Constructing parallel universes with statistics

Let's recap some notation we saw in class:

- $p$: the *population* proportion of M&M's that are primary colored

- $\hat{p}$: the *sample* proportion of M&M's that are primary colored

- Let's assume that an M&M's color is a random variable $X$, where each $X_i$ is generated i.i.d. (independently and identically) via a Bernoulli distribution with probability of success $p$. In other words,

$$ X \sim Bernoulli(p) $$

$x=1$ denotes a primary colored M&M, and $x=0$ denotes a non-primary colored  M&M.

### The theoretical sampling distribution of $\hat{p}$


### Exercise 4

Taking into account the formulas derived in class for the expected value of $\hat{p}$ and the variance of $\hat{p}$ , answer the following 4 subquestions.


**Part (a)**: In the M&Ms counting setting, what is the plain language interpretation of the standard error? **Answer below in no more than two sentences.**

Answer here!

------------------------------------------------

------------------------------------------------

**Part (b)**: Using just your own sample of data, calculate the *estimated standard error* of the sampling distribution of $\hat{p}$ (*Hint*: you do not need to use the empirical variance formula for your estimate). Then, calculate the *true standard error* using our purported value of *p* obtained from the "super sample" (remember, in a realistic setting, *p* is not observed!). Finally, calculate the *empirical standard error* of the sampling distribution, using the collection of all point estimates. **Make sure you print the three results with the corresponding labels.**

In [53]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (c)**: Based on your results, do you feel comfortable using the estimated standard error from your single sample as an approximation of the true standard error? Why or why not? In a realistic setting, would we be able to compare the estimated standard error with the true standard error? **Answer below in no more than three sentences.**

Answer here!

------------------------------------------------

------------------------------------------------

**Part (d)**: Why do you think the empirical standard error is different than the true standard error? **Answer below in no more than two sentences.**

Answer here!

------------------------------------------------

------------------------------------------------

### The central limit theorem (CLT)

### Exercise 5

**Part (a)** Look up the conditions required for the CLT to apply to a Bernoulli distribution. In particular, find the commonly used rule-of-thumb involving n, p, and (1 - p). Then, using what you found, determine whether each of the CLT's conditions are satisfied in the M&Ms context. **Address each condition briefly (no more than 1–2 sentences per bullet point).**

Answer here!

------------------------------------------------

------------------------------------------------

**Part (b)** Go back to the histogram you created in Exercise 2, which shows the sampling distribution of the sample proportions. Based on the shape of the histogram, do you think the CLT applies? Does this visual evidence match your conclusion in part (a)? **Explain in 2-3 sentences.**

Answer here!

------------------------------------------------

------------------------------------------------

### Putting it all together

The CLT allows us to construct normally-approximated confidence intervals for estimators that satisfy the CLT.

> We have arrived at why we should care about everything we have learned above: With confidence intervals in hand, we can make statistically-informed industry decisions.

In lecture, we saw how to derive approximate confidence intervals using properties of the normal distribution.

### Exercise 6

**Part (a)**: Using your single M&M's sample and the formula for the estimated standard error of $\hat{p}$, use the normal approximation to construct a 95% confidence interval for $p$ in the code cell below. **Make sure to print the bounds of your confidence interval**.

In [69]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (b)**: Interpret your confidence interval in no more than one sentence. Does your interval contain the value of $p$ that we calculated from the super sample?  **Write your answer in the cell below.**

Answer here!

------------------------------------------------

------------------------------------------------


### Exercise 7

Repeat the previous exercise for each of your classmates' samples (i.e., construct $N$ normally-approximated confidence intervals, where $N$ is the number of students who submitted M&Ms data).

**Part (a)**: What fraction of the confidence intervals contain the purported value of $p$ that we calculated from the "super sample"? **Do your work in the code cell below and make sure to print out the fraction you calculated**.

In [75]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (b)**: If the data generating process (DGP) was correct and all assumptions of the CLT were sufficiently satisfied, what would you expect this fraction to be? **Answer in no more than two sentences in the cell below.**

Answer here!

------------------------------------------------

------------------------------------------------

### Exercise 8

**Part (a)**: Repeat the coding portions of the previous exercise (7), but instead of using the *estimated standard error* for each confidence interval, use the *empirical standard error* calculated from the sampling distribution of the entire class's estimates. **Make sure you print your result.**

In [81]:
# Code here!
# --------------------------------- #

# --------------------------------- #

**Part (b)**: How does the fraction calculated in this exercise compare with the fraction calculated in the previous exercise? What might account for the discrepancy? **Answer in no more than two sentences in the cell below.**

Answer here!

------------------------------------------------

------------------------------------------------

### Exercise 9

Repeat the coding portions of Exercises 7 and 8 above, but construct 80% confidence intervals instead of 95% confidence intervals. How have your results changed? **Answer in no more than three sentences.**

In [87]:
# Code here for repeating Excercise 7 with 80% CI!
# --------------------------------- #

# --------------------------------- #

In [89]:
# Code here for repeating Excercise 8 with 80% CI!
# --------------------------------- #

# --------------------------------- #

Answer here!

------------------------------------------------

------------------------------------------------

## Converting to PDF

Use the below cell to convert your notebook to pdf, using the instructions at the beginning of the notebook.

In [93]:
!apt-get update -qq > /dev/null
!apt-get install -qq --fix-missing pandoc texlive-latex-base texlive-latex-extra > /dev/null
!jupyter nbconvert --to latex "/content/HW2.ipynb" > /dev/null
!sed -i 's/❗/!/g' /content/HW2.tex
!pdflatex -interaction=nonstopmode -halt-on-error "/content/HW2.tex" > /dev/null

The system cannot find the path specified.
The system cannot find the path specified.
The system cannot find the path specified.
'sed' is not recognized as an internal or external command,
operable program or batch file.
The system cannot find the path specified.
