# Homework 3: Confidence Intervals and Bootstrap

**Due**: Tuesday Apr 25th

### HW Logistics

- **Release**: Every week there will be a HW assignment released on *Wednesday* and due the following *Tuesday at 11:59pm*. 
Homework will be posted to the course website. 
- **Format**: We expect students to complete the homework notebooks using Google Colab (see Discussion 1), but this is not explicitly required and you may use whatever software you would like to run notebooks. 
- **Answers**: As a general guiding policy, you should always try to make it as clear as possible what your answer to each question is, and how you arrived at your answer. Generally speaking, this will mean including all code used to generate results, outputting the actual results to the notebook, and (when necessary) including written answers to support your code.
- **Submission**: Homeworks will be *submitted to Gradescope*, and we expect all students to do question matching on Gradescope upon submission.
- **Late Policy**: All students are allowed 7 total slip days for the quarter, and at most 5 can be used for a single HW assignment. There will be no late credit if you have used up all your slip days. Also, your lowest HW grade will be dropped.

In [75]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()
from scipy import stats

## Question 1: Time Spent on HW

Course staff is interested in understanding how much time students spend on homework assignments. Based on data from previous years, you are told that the true distribution for hours spent on homework is the following discrete distribution:

<center>

| Hours | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Probability | 0.05 | 0.1 | 0.25 | 0.45 | 0.15 |

</center>

**Part (a)**: What is the expected value of the above distribution? What is the variance? Write code to show the calculations used to arrive at your answers.

In [77]:
# The mean is sum_j x_j p_j
hours = np.array([1, 2, 3, 4, 5])
probs = np.array([0.05, 0.1, 0.25, 0.45, 0.15])
mean_hours = np.sum(hours * probs)
print(f"Mean hours: {mean_hours}")

# The variance is EX^2 - (EX)^2
var_hours = np.sum(hours**2 * probs) - mean_hours**2
print(f"Variance of hours: {np.round(var_hours, 2)}")

Mean hours: 3.55
Variance of hours: 1.05


**Part (b)**: Generate a sample of size $n = 100$ (with replacement) from the above distribution. Use your sample to calculate a point estimate for the mean and variance of the distribution. Lastly, calculate a 95% confidence interval for the mean of the distribution, using the analytical formula to estimate the standard error.

In [86]:
# Set seed
np.random.seed(10)

# -------  Write code below ---------- #
hours_df = pd.DataFrame({
    "hours": hours,
    "probs": probs
})

samples = hours_df.sample(n=100, weights="probs", replace=True)["hours"]
print(f"Sample mean: {samples.mean()}")
print(f"Sample var: {np.round(samples.var(), 2)}")

stderr = np.sqrt(samples.var() / 100)
print(f"CI: ({np.round(samples.mean() - 1.96*stderr, 4)}, {np.round(samples.mean() + 1.96*stderr, 4)})")

# Also valid
# print(f"CI: ({stats.norm.interval(0.95, loc=samples.mean(), scale=stderr)}")

Sample mean: 3.49
Sample var: 1.06
CI: (3.2882, 3.6918)


**Part (c)**: Repeat the process from part (b) for 1000 trials, with each trial generating a new sample and corresponding point estimates and confidence intervals. For each trial, calculate whether or not the confidence interval contains the true mean calculated in part (a). In what percent of trials did the confidence interval contain the true mean? Does this align with expectations?

In [87]:
# Set seed
np.random.seed(10)

# -------  Write code below ---------- #

n_trials = 1000
covers = np.zeros(n_trials)
for i in range(n_trials):
    samples = hours_df.sample(n=100, weights="probs", replace=True)["hours"]
    mean = samples.mean()
    stderr = np.sqrt(samples.var() / 100)
    lower = samples.mean() - 1.96*stderr
    upper = samples.mean() + 1.96*stderr
    covers[i] = mean_hours >= lower and mean_hours <= upper

print(f"Coverage: {covers.mean()}")

Coverage: 0.946


We get coverage on 94.6% of trials. If we constructed our confidence interval correctly, then we should expect to get coverage close to 95%, since are construction 95% confidence intervals, and indeed our coverage is close to 95%.

## Question 2: Clickthrough rate

Suppose a company has a website with an advertisement. The company believes that the ad is working if at least 30% of website visitors click on the ad. In order to access whether or not the ad is successful, the company collects data for 50 website visitors, of which 18 click on the advertisement.

**Part (a)**: Calculate a point estimate for the clickthrough rate, i.e. the proportion of website visitors that clicked on the ad. Then, calculate a 95% confidence interval for the clickthrough rate using the analytical formula to estimate the standard error. Based on your CI, do you think you should tell the company that the ad is work?

In [91]:
phat = 18 / 50
stderr = np.sqrt(phat * (1 - phat) / 50)
print(f"CI: {(np.round(phat - 1.96*stderr, 4), np.round(phat + 1.96*stderr, 4))}")
# Also valid
# print(f"CI: ({stats.norm.interval(0.95, loc=phat, scale=stderr)}")

CI: (0.227, 0.493)


The confidence interval contains 0.30, and so we cannot be very confident that the clickthrough rate is actually greater than 30%. Thus, we should tell the company that we do not know that the clickthrough rate is >30%. However, since the point estimate is >0.30, it would be reasonable to suggest that we should collect more data, and with more data, we may be able to more effectively conclude that the ad is working.

**Part (b)**: Calculate a 95% confidence interval for the clickthrough rate based on *bootstrap* samples from the observed data, using both the percentile method and the normal approximation method. For each, use 1000 bootstrap replications. Is your recommendation to the company the same as in part (a)?

In [92]:
# Set seed
np.random.seed(10)

# -------  Write code below ---------- #

df = pd.DataFrame({
    "outcome": ["clicked", "not clicked"],
    "prob": [0.35, 0.65]
})

n_trials = 1000
phats = np.zeros(n_trials)
for i in range(n_trials):
    samples = df.sample(n=50, weights="prob", replace=True)
    phats[i] = samples.groupby("outcome").size()["clicked"] / 50 

print(f"Percentile CI: ({np.percentile(phats, 2.5)}, {np.percentile(phats, 97.5)})")
boot_mean = phats.mean()
boot_std = phats.std()
# print(f"Normal Approx CI: ({stats.norm.ppf(0.025, boot_mean, boot_std)}, {stats.norm.ppf(0.975, boot_mean, boot_std)})")
print(f"CI: {(np.round(boot_mean - 1.96*boot_std, 4), np.round(boot_mean + 1.96*boot_std, 4))}")

Percentile CI: (0.22, 0.48)
CI: (0.2212, 0.4808)


Both confidence intervals are very close to the confidence interval we obtained in part (a), and thus the recommendation is the same.

## Question 3: SAT Scores

The file `sat_data.csv` contains a collection of SAT math, reading, and writing scores. Use the following code to read in the dataset.

In [67]:
sat_data = pd.read_csv("https://raw.githubusercontent.com/stanford-mse-125/homework/main/data/sat_data.csv")

**Part (a)**: Calculate a point estimate for the 80th percentile score in each subject area using the sat scores dataset.

In [63]:
sat_data.quantile(0.8)

Critical_Reading_Score    528.2
Math_Score                549.0
Writing_Score             520.0
Name: 0.8, dtype: float64

**Part (b)**: Using the bootstrap percentile method, calculate 95\% confidence intervals for the 80th percentile of SAT scores for each subject area. Print your resulting confidence intervals.

In [93]:
# Set seed
np.random.seed(10)

# -------  Write code below ---------- #

n_trials = 1000
percentiles = np.zeros((n_trials, 3))
for i in range(n_trials):
    samples = sat_data.sample(n=sat_data.shape[0], replace=True)
    percentiles[i, :] = samples.quantile(0.8)

bootstrap_data = pd.DataFrame(percentiles, columns=sat_data.columns)

for c in bootstrap_data.columns:
    lower = bootstrap_data[c].quantile(0.025)
    upper = bootstrap_data[c].quantile(0.975)
    print(f"CI for {c}: ({np.round(lower, 4), np.round(upper, 4)})")

CI for Critical_Reading_Score: ((517.59, 536.0))
CI for Math_Score: ((538.39, 563.4))
CI for Writing_Score: ((511.4, 528.2))


**Part (c)**: If you look [here](https://blog.prepscholar.com/sat-historical-percentiles-for-2014-2013-2012-and-2011), you should notice that the actual 80th percentile scores from 2015 (the last year with 3 subject areas) are higher than the point estimates we obtained in part (a). List 2 reasons why the percentiles in our dataset may be lower. 

*Answer* A few examples:
- The data is biased demographically, e.g. all samples come from a county with historically lower SAT scores.
- The data comes from a year with abnormally low SAT scores.

**Part (d)**: Do you think the percentiles listed in the linked website from part (c) should have confidence intervals, like the ones we calculated in part (d)? Why or why not?

*Answer*: They should not, because they are population statistics and thus there is not uncertainty.