In [None]:
#:
import babypandas as bpd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

# Lecture 15

## Models and Statistics

## Models

* A model is a set of assumptions about the data
* We want to assess the quality of models (are they right or wrong)?

## Models
![image.png](attachment:image.png)

[Galileo's Leaning Tower of Pisa Experiment](https://en.wikipedia.org/wiki/Galileo%27s_Leaning_Tower_of_Pisa_experiment)

## Statistical Inference

* Making conclusions about models using data from random samples.

### Terminology

* **Parameter**: A number associated with the population
    - Example: the population mean.
* **Statistic**: A number calculated from the sample
    - Example: the sample mean.

A statistic can be used as an **estimate** of a parameter

### Estimating the number of German tanks in WWII

![tank.jpg](attachment:tank.jpg)

[German Tank Problem](https://en.wikipedia.org/wiki/German_tank_problem)

## Population and Sample

- Population: all enemy tanks (unknown)
- Sample: the tanks we've seen (captured or destroyed)

### Setup

* Tanks have serial numbers 1, 2, 3, …, N.
* We don’t know N.
* We would like to estimate N based on the serial numbers of the tanks that we see.

### Discussion Question

If you saw these serial numbers, what would be your estimate (guess) of N?
```
170	 271	285	 290	 48
235	 24	 90 	 291 	19
```


- A) 291
- B) 353
- C) 438
- D) 487

### Approach #1: The largest number observed

* Is it likely to be close to the total number of tanks, N?
    - How likely?
    - How close?

### Making some data

* We'll manufacture an unknown number of tanks (between 200 and 400).
* We'll see a random sample of the tanks (and their serial numbers).
* From the sample, we'll try to guess how many tanks were manufactured.
* Then we'll see if our guesses were any good.

### The main assumption
The serial numbers of the tanks that we see are a uniform random sample drawn without replacement from 1, 2, 3, …, N.

In [None]:
# manufacture tanks
N = np.random.randint(200, 400)
serialnos = bpd.DataFrame().assign(SerialNumber=np.arange(1, N))

### Estimate: approach #1

- Our sample: 30 tanks.
- Our statistic: the biggest serial number seen.
- Sample is random, so biggest seen is random.

In [None]:
# the biggest serial number
serialnos.sample(30, replace=False).get('SerialNumber').max()

In [None]:
# what was N?

In [None]:
N

## Empirical Distribution of the Statistic

In [None]:
repetitions = 1000
sample_size = 30
maxes = np.array([])
for i in np.arange(repetitions):
    m = serialnos.sample(sample_size, replace=False).get('SerialNumber').max() 
    maxes = np.append(maxes, m)

In [None]:
# plot the distribution

bpd.DataFrame().assign(maxes=maxes).plot(kind='hist', bins=np.arange(N-100, N+100, 5), density=True)
plt.axvline(N, color='C2')

### Discussion Question

How often is our guess within 5 of the actual number of tanks, N?

(A) 50% of the time  
(B) 55% of the time  
(C) 60% of the time  
(D) 65% of the time

In [None]:
# same histogram, zoomed in
bpd.DataFrame().assign(maxes=maxes).plot(kind='hist', bins=np.arange(N-50, N+50, 5), density=True)
plt.axvline(N, color='C2')

### Verdict on the estimate

* The largest serial number observed is likely to be close to N.
* But it is also likely to underestimate N.

### Estimate: approach #2
* Average of the serial numbers observed  ~  N/2
* Try to estimate the number of tanks using twice the average seen in the sample

In [None]:
# the average, times two
serialnos.sample(30, replace=False).get('SerialNumber').mean() * 2

In [None]:
# remember what the right answer was?
N

## Empirical Distribution of the Statistic

In [None]:
repetitions = 1000
sample_size = 30
twice_means = np.array([])
for i in np.arange(repetitions):
    m = serialnos.sample(sample_size, replace=False).get('SerialNumber').mean() * 2 
    twice_means = np.append(twice_means, m)

In [None]:
# plot the distribution

bpd.DataFrame().assign(twice_means=twice_means).plot(kind='hist', bins=np.arange(N-100, N+100, 5), density=True)
plt.axvline(N, color='C2')


## Probability Distribution of a Statistic

* Values of a statistic vary because random samples vary
* “Sampling distribution” or “probability distribution” of the statistic
    - All possible values of the statistic and all the corresponding probabilities.
* Can be hard to calculate: 
    - either have to do the math or have to generate all possible samples and calculate the statistic based on each sample


## Empirical Distribution of a Statistic
* Empirical distribution of the statistic
    - Based on simulated values of the statistic
    - Consists of all the observed values of the statistic,
    - and the proportion of times each value appeared

* Good approximation to the probability distribution of the statistic 
    - if the number of repetitions in the simulation is large



## Estimating the Number of Tanks
* Statistic: `max`, `2 * mean`
* Probability distribution: 
    - e.g. likelihood `max` of a sample of 30 out of 300 is equal to N
* Empirical distribution: histograms from our simulations

## Bias and Variance
* Which statistic was a better estimate?

## Bias
* Biased estimate: On average across all possible samples, the estimate is either too high or too low.
* Bias creates a systematic error in one direction.
* Good estimates typically have low bias.

## Variability

* The degree to which the value of an estimate varies from one sample to another.
* High variability makes it hard to estimate accurately.
* Good estimates typically have low variability.

## Bias-variance trade-off
* The max has low variability, but it is biased.
* 2 * average has little bias, but it is highly variable.
* Life is tough.

In [None]:
# plot the distribution

bpd.DataFrame().assign(maxes=maxes, twice_means=twice_means).plot(kind='hist', bins=np.arange(N-100, N+100, 5), density=True)
plt.axvline(N, color='C2')

# Example: Jury Selection

### Swain vs. Alabama, 1965
* Talladega County, Alabama
* Robert Swain, black man convicted of crime
* Appeal: one factor was all-white jury
* Only men 21 years or older were allowed to serve
* 26% of this population were black
* Swain’s jury panel consisted of 100 men
* 8 men on the panel were black


### Supreme Court Ruling

* About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote:

> "... the overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negroes”

* The Supreme Court denied Robert Swain’s appeal
* Is this conclusion reasonable? Let's check.


### Our model for simulating Swain's jury panel

* *Assume* jury panel is 100 *randomly* chosen eligible prospective jurors
* 26% of population is black
* Our question: is this model (i.e., assumption) right or wrong?

## Our approach: simulation

- We will assume that this model is true.
- We'll generate a bunch of jury panels using the assumption.
- We'll see how likely it is for a random panel to contain $\leq$ 8 black men.

## Recall: simulation

1. Figure out how to run the experiment once
2. Run the experiment a bunch of times, store results in array with `np.append`.
3. Analyze the results.

## Simulation for Statistics

1. Figure out the code to generate one value of the statistic
2. Run the experiment a bunch of times, generating many values of the statistic, store in an array.
3. Visualize distribution

## 1. Running the experiment once

- How do we randomly sample a jury panel?
* Sample at random from a categorical distribution

```
np.random.multinomial(
    sample_size, pop_distribution
)
```

* Samples at random from the population
    - Returns a random array containing counts in each category


## Example

- In 2008, M&Ms were produced with the following probabilities:

> 24% blue, 20% orange, 16% green, 14% yellow, 13% red, and 13% brown

In [None]:
# sample a bag of 100 M&Ms
np.random.multinomial(100, [.24, .2, .16, .14, .13, .13])

In [None]:
demographics = [0.26, 0.74]

In [None]:
np.random.multinomial(100, demographics)

## 1. Running the experiment once
- Calculate the statistic: number of black men among random sample of 100 men from eligible population

In [None]:
np.random.multinomial(100, demographics)[0]

## 2. Run the experiment a bunch

* run 10k simulations.
* keep number of black men in each panel in array `counts`

In [None]:
counts = np.array([])

for i in np.arange(10000):
    new_count = np.random.multinomial(100, demographics)[0]
    counts = np.append(counts, new_count)

### 3. Visualize the distribution
* Was a jury panel with 8 black men suspiciously unusual?

In [None]:
# in 10000 random experiments, the panel with the fewest black men had:
counts.min()

In [None]:
bpd.DataFrame().assign(count=counts).plot(kind='hist', bins = np.arange(9.5, 45, 1), density=True)
observed_count = 8
plt.axvline(observed_count, color='red')