In [3]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 04

## Percentiles 

Suppose we wanted to manually compute the 55th percentile of the following array:

In [6]:
x = make_array(43, 20, 51, 7, 28)

array([43, 20, 51,  7, 28])

**Step 1.** To compute percentiles we first sort the data

In [5]:
sorted_x = ... # EXERCISE
sorted_x

Ellipsis

In [2]:
ptbl = Table().with_columns(
    "Percentile", 100*(np.arange(0, len(x))+1)/len(x),
    "Element", sorted_x)
ptbl

NameError: name 'Table' is not defined

**Step 2.** Figure out where the $p^\text{th}$ percentile would be.

In [None]:
p = 55
ind = ... # EXERCISE
ind

In [None]:
sorted_x.item(ind)

The above calculation is confusing and brittle (try p=0).  Instead, we should use the `percentile` function.

### Using the Percentile Function

In [None]:
percentile?

Recall the precentile table. 

In [None]:
ptbl

Let's try a few values.

In [None]:
percentile(50, x)

In [None]:
percentile(55, x)

In [None]:
percentile(0, x)

In [None]:
percentile(100, x)

---
<center> Return to Slides </center>

---

## Discussion Question

In [None]:
s = make_array(1, 3, 5, 7, 9)

In [None]:
Table().with_columns(
    "Percentile", 100*(np.arange(0, len(s))+1)/len(s),
    "Element", sorted(s))

In [None]:
percentile(10, s) == 0

In [None]:
percentile(39, s) == percentile(40, s)

In [None]:
percentile(40, s) == percentile(41, s)

In [None]:
percentile(50, s) == 5

---
<center> Return to Slides </center>

---

## Inference: Estimation

To demonstrate the process of estimating a parameter, let's examine the 2019 San Francisco public records.  We obtained this data from the [SF Open Data Portal](https://datasf.org/opendata/).  For the purposes of this exercise, we will assume that this a census of the compensation data: that it contains the compensation for a public workers.  

In [None]:
sf = Table().read_table('data/san_francisco_2019.csv')
sf.show(3)

Suppose we are interested in studying `"Total Compensation"`.  Let's make a histogram of the total compensation.

Who is getting paid the most?

Who is getting paid the least?

There is a clear spike around **zero**!  Why?

We will focus on those that worked at least 20 hours at minimum wage for an entire year. 

In [None]:
min_salary = 15 * 20 * 50 # $15/hr, 20 hr/wk, 50 weeks
print("Min Salary", min_salary)

sf = ... # EXERCISE filter total compensation above min_salary

In [None]:
salary_bins = np.arange(min_salary, 500000, 10000)
sf.hist("Total Compensation", bins=salary_bins)

### The Population Parameter

Here we have access to the population so we can compute parameters directly.  

For example, suppose we were interested in the median compensation.  Then we could compute it directly on our data:

In [None]:
pop_median = percentile(50, sf.column("Total Compensation"))
pop_median

In most real-world settings, you won't have access to the population.  Instead, you will take a random sample. 

Suppose we sample 400 people from our population.

In [None]:
# An Empirical Distribution
our_sample = ... # EXERCISE: sample 400 from population
our_sample.hist('Total Compensation', bins=salary_bins)

We can use the sample median (statistic) as an estimate of the parameter value.

In [None]:
# Estimate: Median of a Sample
percentile(50, our_sample.column('Total Compensation'))

But in the real world we won't be able to keep going back to the population. How do we generate a new random sample *without going back to the population?*

---
<center> Return to Slides </center>

---

## Variability of the Estimate

If we could get additional samples from the population, how much variability would their be in our estimate of the median?

In [None]:
def generate_sample_median(samp_size):
    ... # EXERCISE

In [None]:
generate_sample_median(400)

## Quantifying Uncertainty

Because we have access to the population, we can simulate many samples from the population:

In [None]:
sample_medians = make_array()

for i in np.arange(1000):
    # Exercise

In [None]:
med_bins = np.arange(120000, 160000, 1000)
Table().with_column('Sample Medians', sample_medians).hist(bins=med_bins)

plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');

What happens if we do the same thing again with slightly larger samples?

In [None]:
sample_medians2 = make_array()

for i in np.arange(1000):
    new_median = generate_sample_median(800)
    sample_medians2 = np.append(sample_medians2, new_median)

In [None]:
(Table()
     .with_columns("Sample Medians", sample_medians,
                   "Sample Size", 400)
     .append(Table().with_columns("Sample Medians", sample_medians2,
                                  "Sample Size", 800))
     .hist("Sample Medians", group="Sample Size", bins=med_bins)
)
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color='red');

But in the real world we won't be able to keep going back to the population. How do we generate a new random sample *without going back to the population?*

---
<center> Return to Slides </center>

---

# Bootstrap

Sample randomly
 - from the original sample
 - with replacement
 - the same number of times as the original sample size

**Step 1:** Sample the original sample **With Replacement** the same number of times as the original sample size.

```python
table.sample() # All you need!
```

The default behavior of tbl.sample:
1. at random with replacement,
2. the same number of times as rows of tbl

In [None]:
bootstrap_sample = ... # EXERCISE
print("Number of Rows:", bootstrap_sample.num_rows)

In [None]:
bootstrap_sample.hist('Total Compensation', bins=salary_bins)

**Step 2:** Compute statistic on bootstrap sample.

In [None]:
percentile(50, bootstrap_sample.column('Total Compensation'))

**Repeat** the sampling process many times:

In [None]:
def one_bootstrap_median():
    # draw the bootstrap sample
    bootstrap_sample = ...  # Exercise
    # return the median total compensation in the bootstrap sample
    return percentile(50, bootstrap_sample.column('Total Compensation'))

In [None]:
one_bootstrap_median()

In [None]:
# Generate the medians of 1000 bootstrap samples
num_repetitions = 1000
bstrap_medians = make_array()
for i in np.arange(num_repetitions):
    bstrap_medians = np.append(bstrap_medians, one_bootstrap_median())

Examine the empirical distribution of the samples.

In [None]:
resampled_medians = Table().with_column('Bootstrap Sample Median', bstrap_medians)
median_bins=np.arange(120000, 160000, 2000)
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');

### A General Bootstrap Function

The following function implements the general bootstrap procedure.


In [None]:
def bootstrapper(sample, statistic, num_repetitions):
    """
    Returns the statistic computed on a num_repetitions  
    bootstrap samples from sample.
    """
    bstrap_stats = make_array()
    for i in np.arange(num_repetitions):
        # Step 1: Sample the Sample
        bootstrap_sample = ... # EXERCISE
        # Step 2: compute statistics on the sample of the sample
        bootstrap_stat = ... # EXERCISE
        # Accumulate the statistics
        bstrap_stats = np.append(bstrap_stats, bootstrap_stat)

    return bstrap_stats    

In [None]:
og_sample = sf.sample(400)

def compute_median(sample):
    return percentile(50, sample.column("Total Compensation"))

bootstrap_medians = bootstrapper(og_sample, compute_median, 1000)

In [None]:
(Table().with_column("bootstraps", bootstrap_medians)
        .hist(bins=median_bins))

## Extra Viz code
parameter_green = '#32CD32'
plots.ylim(-0.000005, 0.00014)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2)
plots.title('Bootstrap Medians and the Parameter (Green Dot)');

---
<center> Return to Slides </center>

---

## Percentile Method: Middle 95% of the Bootstrap Estimates 

Computing confidence intervals is as simple as computing percentiles on the bootstrap samples.  No magic equations!

In [None]:
left = ... # EXERCISE 2.5 percentile
right = ... # EXERCISE 97.5 percentile 

make_array(left, right)

In [None]:
resampled_medians.hist(bins = median_bins)

# Plotting parameters; you can ignore this code
plots.ylim(-0.000005, 0.00014)
plots.plot(make_array(left, right), make_array(0, 0), color='yellow', lw=3, zorder=1)
plots.scatter(pop_median, 0, color=parameter_green, s=40, zorder=2);