In [None]:
#: the usual imports
import babypandas as bpd
import numpy as np
import sys
sys.path.append('./data')

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 13

### The Bootstrap and Confidence Intervals

## Question

- What is the median salary of San Diego city employees?
- All city employee salary data is public.

In [None]:
#: read in the data
population = bpd.read_csv('data/salaries.csv')
population

## Only need the total pay...

In [None]:
population = population.get(['Total Pay'])
population

## The median salary

- We can use `.median()`:

In [None]:
#...population_median
# the median of the "Total Pay" column
population_median = population.get('Total Pay').median()
population_median

## But now...

- ...suppose we don't have access to this data.
- It is costly and time-consuming to survey *all* 11,000+ employees.
- So we gather salaries for a random sample of, say, 500 people.
- Hope the median of the sample $\approx$ median of the population.

## In the language of statistics...

- The full table of salaries is the **population**.
- We observe a **sample** of 500 salaries from the population.
- We really want the population median, but we don't have the whole population.
- So we compute sample median as an **estimate**.
- Hopefully the sample median is close to the population median.

## The sample median

- Let's survey 500 employees at random.
- We can use `.sample()`:

In [None]:
#: take a sample of size 500
sample = population.sample(500, replace=False)

In [None]:
#: compute the sample median
sample_median = sample.get('Total Pay').median()
sample_median

## How confident are we?

- Our estimate depended on a random sample.
- If our sample was different, our estimate would've been different, too.
- **How different could the estimate have been?**
- Our confidence in the estimate depends on the answer.

## The sample median is random

- The sample median is a random number.
- It comes from some distribution, which we don't know.
- How different could the estimate have been?
    - "Narrow" distribution $\Rightarrow$ not too different
    - "Wide" distribution $\Rightarrow$ quite different
- **What is the distribution of the sample median?**

## A (costly) approach

- Every sample of 500 people gives me one observation of the sample median.
- So draw a bunch of samples, compute medians.

In [None]:
#: imports for animation
from lecture import sampling_animation
from IPython.display import HTML

In [None]:
%%capture
anim, sample_medians = sampling_animation(population)

In [None]:
#: display animation
HTML(anim.to_jshtml())

## Visualize the distribution

- We can plot the distribution of the sample median with a histogram.
- This is an approximation using 128 samples.
- Sample median is usually in [62,000, 70,000].

In [None]:
#: plot a histogram
bpd.DataFrame().assign(SampleMedians=sample_medians).plot(kind='hist', bins=15, density=True)

## The problem

- Drawing new samples like this is costly (why not just do a census?)
- Often, we can't ask for new samples from the population.
- What about sampling the sample?
- The **bootstrap**:
    - the sample itself looks like the distribution.
    - so re-sampling from the sample is like drawing from the distribution.

In [None]:
fig, ax = plt.subplots()
bins=np.arange(10_000, 300_000, 10_000)
population.plot(kind='hist', y='Total Pay', ax=ax, density=True, alpha=.75, bins=bins)
sample.plot(kind='hist', y='Total Pay', ax=ax, density=True, alpha=.75, bins=bins)
plt.legend(['Population', 'Sample'])

## The bootstrap

- We have a sample of 500 salaries, we want another.
- Can't draw from the population.
- But the original sample looks like the population.
- So we re-sample the sample.

## Discussion question

Which of these effectively resamples the sample, simulating the drawing of a new sample of 500 people?

- A) `np.random.choice(sample, 500, replace=True)`
- B) `np.random.choice(sample, 500, replace=False)`
- C) `sample.sample(sample.shape[0], replace=True)`
- D) `sample.sample(sample.shape[0], replace=False)`

## Answer

- If we sample without replacement, we're just shuffling.
- So we sample *with* replacement to get something new.

## Running the bootstrap

- Now we can simulate new samples by bootstrapping
- I.e., we sample with replacement from our original sample

In [None]:
n_resamples = 5000

boot_medians = np.array([])
for i in range(n_resamples):
    # perform bootstrap resampling
    resample = sample.sample(500, replace=True)
    
    # compute the median
    median = resample.get('Total Pay').median()
    
    # tack it on to our list of medians
    boot_medians = np.append(boot_medians, median)

## Bootstrap distribution of the sample median

- Bulk of the time, the sample median is typically in [60,000, 70,000].
- Similar to what we found before.
- The population median (red dot) is near the middle.

In [None]:
#: visualize
bpd.DataFrame().assign(BootstrapMedians=boot_medians).plot(kind='hist')
plt.scatter(population_median, 0, color='red', s=80).set_zorder(2)

## Bootstrap rules of thumb

- The bootstrap is an awesome tool:
    - We used just one sample to get the (approximate) distribution of the sample median.
- But it has limitations:
    - Not good for sensitive statistics, like maximum.
    - Requires sample to be good approximation of population.
    - Works best when population is roughly bell-shaped.
    - Can be slow (recommend 10,000+ bootstrap samples)

## Example: boostrapping in the German aircraft problem

- We observe a random sample of 30 planes.
- Our goal: estimate total # of planes from serial numbers of 30 planes.

In [None]:
#: we don't know this, but there are actually 400 planes in total
plane_population = bpd.DataFrame().assign(SerialNumber=np.arange(400))

In [None]:
#: sample 30 tanks
np.random.seed(4242)
plane_sample = plane_population.sample(30, replace=False)

## Running the bootstrap

- We want to estimate the maximum number in the population
- Our estimator will be the max in the sample.
- We run the bootstrap:

In [None]:
n_resamples = 5000

boot_maxes = np.array([])
for i in range(n_resamples):
    # resample
    resample = plane_sample.sample(plane_sample.shape[0], replace=True)
    
    # compute max
    boot_max = resample.get('SerialNumber').max()
    
    boot_maxes = np.append(boot_maxes, boot_max)

## Visualize

- The bootstrap distribution doesn't surround the right maximum of 399.

In [None]:
bpd.DataFrame().assign(BootstrapMax=boot_maxes).plot(kind='hist', bins=20)
plt.scatter(399, 0, color='red')

# Confidence intervals

## Confidence intervals

- Bootstrapping approximates the distribution of an estimate
- The true value typically lies within bulk of the distribution
- Rather than returning only a single number, we can give an interval that we are confident that the correct value lies within

## A 95% confidence interval for median salary

- Recall our bootstrap distribution of the sample median
- Suppose by "bulk", we mean containing the middle 95% of the area.

In [None]:
#: visualize
bpd.DataFrame().assign(BootstrapMedians=boot_medians).plot(kind='hist')

## Finding the endpoints

- We want to find two points, $x$ and $y$, such that the area:
    - to the left of $x$ is about 2.5%
    - to the right of $y$ is about 2.5%
- Then the interval $[x,y]$ will have about 95% of the total area
- I.e., we want the 2.5th **percentile** and 97.5th **percentile**.

In [None]:
np.percentile

## Computing percentiles

- Use `np.percentile(array, percentile)` function:
    - First arg: array of values
    - Second arg: percentile to find as # in [0, 100]

In [None]:
# left
left = np.percentile(boot_medians, 2.5)
left

In [None]:
# right
right = np.percentile(boot_medians, 97.5)
right

In [None]:
#: our interval is
[left, right]

## Visualizing our 95% confidence interval

- Let's draw the interval on the histogram.
- 95% of the bootstrap medians fell into this interval.

In [None]:
#: visualize
bpd.DataFrame().assign(BootstrapMedians=boot_medians).plot(kind='hist')
plt.plot([left, right], [0, 0], color='lime', linewidth=10)

## Discussion question

Would an 80% confidence interval be bigger, smaller, or the same size?

- A) Bigger
- B) Smaller
- C) The same size

## Discussion Question

Suppose you had the true distribution of the sample median and used it to compute a 100% confidence interval. And suppose you compute a 100% confidence interval using the bootstrap. Which is bigger?

- A) The first confidence interval (from the true distribution).
- B) The second confidence interval (from the boostrap).
- C) They're the same size.

## Interpreting confidence intervals

- 95% of our bootstrap medians fell within this interval
- We're pretty confident that the true median does, too.
- How confident should we be about this?

## Capturing the correct value

- If we run the bootstrap again, we get a different distribution.
- And so we get a different 95% confidence interval.
- (Roughly) 95% of the time, the interval will capture the correct median.

## Interpreting confidence intervals

- Doesn't have to be for same experiment!
- Suppose you only ever make 95% confidence intervals.
- Then roughly 95% of the CIs you make in your life will contain the true value of the thing being estimated.

## Misinterpreting confidence intervals

- A 95% confidence interval has a 95% chance of containing the true value of the thing being estimated.
- The interval is random, not the thing being estimated!

## Misinterpreting confidence intervals

- Our 95% confidence interval for the median salary was:

In [None]:
#: remember...
[left, right]

- This does not mean that 95% of salaries are in this range!

## Example: Estimating proportions

- Can use the bootstrap to get confidence intervals on other things.
- Such as: proportions.

In [None]:
import pandas as pd

In [None]:
this_section = bpd.read_csv('data/eldridge-2020.csv')
this_section

## Discussion Question

What is the most popular college in this section?

- A) Sixth
- B) Warren
- C) Revelle
- D) Marshall

## Answer

In [None]:
this_section.groupby('College').count().sort_values('Major', ascending=False)

## Estimation

- The proportion of students in Warren is...
- This is an *estimate* of the proportion in the population.
- But what is the population?

In [None]:
proportion = this_section[this_section.get('College') == 'WA'].shape[0] / this_section.shape[0]
proportion

## Bootstrapped confidence interval

- Let's bootstrap a 95% CI for the proportion

In [None]:
#: run the bootstrap
n_resamples = 5000

boot_props = np.array([])
for i in range(n_resamples):
    resampled = this_section.sample(this_section.shape[0], replace=True)
    boot_prop = resampled[resampled.get('College') == 'WA'].shape[0] / resampled.shape[0]
    boot_props = np.append(boot_props, boot_prop)

## Visualizing the distribution

In [None]:
#: visualize
bpd.DataFrame().assign(BootstrapProportions=boot_props).plot(kind='hist', bins=20)
plt.scatter(proportion, 0, color='red', s=40).set_zorder(2)

## Computing the confidence interval

In [None]:
#: left endpoint
left = np.percentile(boot_props, 2.5)
left

In [None]:
#: right endpoint
right = np.percentile(boot_props, 97.5)
right

In [None]:
#: the interval
[left, right]

## Visualizing the confidence interval

In [None]:
#: visualize
bpd.DataFrame().assign(BootstrapProportions=boot_props).plot(kind='hist', bins=20)
plt.scatter(proportion, 0, color='red', s=40).set_zorder(10)
plt.plot([left, right], [0, 0], color='lime', zorder=3)

## My class in FA18

- If last year's class was drawn from same population, its proportion is likely to be in this interval.
- Why? The interval was made by simulating draws from the population.

In [None]:
#: compute proportion in Warren for other section
other_section = bpd.read_csv('data/eldridge-2018.csv')
other_proportion = other_section[other_section.get('College') == 'WA'].shape[0] / other_section.shape[0]
other_proportion

In [None]:
#: remember the interval
[left, right]

## Are they from the same distribution?

- A/B test!
- New columns:
    - "Warren": True/False, if in Warren.
    - "Section": 'This' or 'Other'

In [None]:
#: adding columns...
this_section_in_warren = this_section.assign(
    Warren=this_section.get('College') == 'WA',
    Section=['This']*this_section.shape[0]
)

other_section_in_warren = other_section.assign(
    Warren=other_section.get('College') == 'WA',
    Section=['Other']*other_section.shape[0]
)

## Combine the sections

In [None]:
#: combine the sections
combined = this_section_in_warren.append(other_section_in_warren)
combined = combined.get(['Warren', 'Section'])
combined

## Statistic

- The difference between the group proportions

In [None]:
#: the difference in proportion between groups
def statistic(combined):
    group_proportions = combined.groupby('Section').mean().get('Warren')
    return abs(group_proportions.loc['This'] - group_proportions.loc['Other'])

## Permutation test

In [None]:
#: permutation test
n_shuffles = 500

differences = np.array([])
for i in range(n_shuffles):
    shuffled_sections = np.random.permutation(combined.get('Section'))
    shuffled = combined.assign(Section=shuffled_sections)
    difference = statistic(shuffled)
    differences = np.append(differences, difference)

## Visualize

In [None]:
#: visualize
bpd.DataFrame().assign(Differences=differences).plot(kind='hist')
plt.scatter(statistic(combined), 0, color='red', s=40).set_zorder(2)

## Calculate a p-value

In [None]:
#: the p-value
np.count_nonzero(differences >= statistic(combined)) / len(differences)