# Lecture 10: Models and Statistics

## Models

* A model is a set of assumptions about how data was generated
* We care to assess the quality of models

## Models
![image.png](attachment:image.png)

## Statistical Inference

* Making conclusions based on data in random samples.

### Use it to...
* Guess the value of an unknown (fixed) number.
* Create an estimate of the unknown quantity using a random sample.

### Terminology

* Parameter: A number associated with the population
* Statistic: A number calculated from the sample

A statistic can be used as an estimate of a parameter

## Approach

* Figure out the code to generate one value of the statistic
* Create an empty array in which you will collect all the simulated values
* For each repetition of the process:
    - Simulate one value of the statistic
    - Append this value to the collection array

### Estimating the number of enemy planes in WWII

![image.png](attachment:image.png)

### Setup

* Planes have serial numbers 1, 2, 3, …, N.
* We don’t know N.
* We would like to estimate N based on the serial numbers of the planes that we see.

### The model: how the data is generated
The serial numbers of the planes that we see are a uniform random sample drawn with replacement from 1, 2, 3, …, N.

### Discussion Question

If you saw these serial numbers, what would be your estimate of N?
```
170	 271	285	 290	 48
235	 24	 90 	 291 	19
```


- A) 291
- B) 350
- C) 470
- D) Not enough information
- E) Different guess

### The largest number observed

* Is it likely to be close to N?
    - How likely?
    - How close?

### Simulate Serial Numbers
* Fix number of planes at 300
* Sample uniformly with replacement

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import warnings; warnings.simplefilter('ignore')

plt.style.use('fivethirtyeight')

In [None]:
N = 300
serialno = Table().with_column('Serial number', np.arange(1, N))
serialno

### Estimate: attempt #1
* Guess: number of planes equal to max observed in sample

In [None]:
# statistic
serialno.sample(30).column(0).max()

In [None]:
repetitions = 1000
sample_size = 30
maxes = make_array()
for i in np.arange(repetitions):
    m = serialno.sample(sample_size).column(0).max()
    maxes = np.append(maxes, m)
maxes

In [None]:
estimates = Table().with_column("estimated_N", maxes)
estimates

In [None]:
estimates.hist(0, bins=np.arange(200, 400, 10))

### Verdict on the estimate

* The largest serial number observed is likely to be close to N.
* But it is also likely to underestimate N.


### Any other ideas for how to use the plane numbers you have seen to estimate N?

### Estimate: attempt #2
* Average of the serial numbers observed  ~  N/2
* Try to estimate the number of planes using twice the average seen in the sample

In [None]:
serialno.sample(30).column(0).mean() * 2

In [None]:
repetitions = 1000
sample_size = 30
avgs = make_array()
for i in np.arange(repetitions):
    m = serialno.sample(sample_size).column(0).mean() * 2
    avgs = np.append(avgs, m)
avgs

In [None]:
estimates = Table().with_column("estimated_N", avgs)
estimates

In [None]:
estimates.hist(0, bins=np.arange(150, 450, 10))

## Probability Distribution of a Statistic

* Values of a statistic vary because random samples vary
* “Sampling distribution” or “probability distribution” of the statistic
    - All possible values of the statistic and all the corresponding probabilities.
* Can be hard to calculate: 
    - either have to do the math or have to generate all possible samples and calculate the statistic based on each sample


## Empirical Distribution of a Statistic
* Empirical distribution of the statistic
    - Based on simulated values of the statistic
    - Consists of all the observed values of the statistic,
    - and the proportion of times each value appeared

* Good approximation to the probability distribution of the statistic 
    - if the number of repetitions in the simulation is large



## Discussion Question

Is this the histogram of a probability distribution or an empirical distribution? 

A. probability distribution  
B. empirical distribution

In [None]:
estimates.hist(0, bins=np.arange(150, 450, 10))

## Estimating the Number of Planes
* Statistic: `max`, `2 * mean`
* Probability distribution: 
    - e.g. likelihood `max` of a sample of 30 out of 300 is equal to N
* Empirical distribution: histograms from our simulations

## Bias and Variance
* Which statistic was a better estimate?

## Bias
* Biased estimate: On average across all possible samples, the estimate is either too high or too low.
* Bias creates a systematic error in one direction.
* Good estimates typically have low bias.

## Variability

* The degree to which the value of an estimate varies from one sample to another.
* High variability makes it hard to estimate accurately.
* Good estimates typically have low variability.

## Bias-variance trade-off
* The max has low variability, but it is biased.
* 2 * average has little bias, but it is highly variable.
* Life is tough.

# Example: Jury Selection

### Swain vs. Alabama, 1965
* Talladega County, Alabama
* Robert Swain, black man convicted of crime
* Appeal: one factor was all-white jury
* Only men 21 years or older were allowed to serve
* 26% of this population were black
* Swain’s jury panel consisted of 100 men
* 8 men on the panel were black


### Supreme Court Ruling

* About disparities between the percentages in the eligible population and the jury panel, the Supreme Court wrote:

> "... the overall percentage disparity has been small and reflects no studied attempt to include or exclude a specified number of Negroes”

* The Supreme Court denied Robert Swain’s appeal
* Is this conclusion reasonable? Let's check.


### The model
* Jury panel is 100 *randomly chosen* eligible prospective jurors
* 26% of this population were black

### Question
* Is the model good, or not?

## Sampling from a Distribution

* Sample at random from a categorical distribution

`sample_proportions(sample_size, pop_distribution)`

* Samples at random from the population
    - Returns an array containing the distribution of the categories in the sample


In [None]:
help(sample_proportions)

### Calculate the statistic: 
* number of black men among random sample of 100 men from eligible population

In [None]:
eligible_population = make_array(0.26, 0.74)

In [None]:
# sample from `eligible_population` 100 times
sample_proportions(100, eligible_population)

In [None]:
100 * sample_proportions(100, eligible_population).item(0)

### Simulate drawing the jury panel
* run 10,000 simulations
* keep number of black men in each panel in array `counts`

In [None]:
counts = make_array()

for i in np.arange(10000):
    new_count = 100 * sample_proportions(100, eligible_population).item(0)
    counts = np.append(counts, new_count)

### Visualize the simulation
* Was a jury panel with 8 black men suspiciously unusual?

In [None]:
Table().with_column('Random Sample Count', counts).hist(bins = np.arange(9.5, 45, 1))

observed_count = 8
plt.scatter(observed_count, 0, color='red', s=30);

# Example: the Genetics of Peas

## Gregor Mendel, 1822-1884
![Screen%20Shot%202018-11-05%20at%2010.33.48%20PM.png](attachment:Screen%20Shot%202018-11-05%20at%2010.33.48%20PM.png)

## Mendel's model

* Pea plants of a particular kind
* Each one has either purple flowers or white flowers

* Mendel’s model:
    - Each plant is purple-flowering with chance 75%,
    - regardless of the colors of the other plants

### Question
* Is the model good, or not?

### Choosing a Statistic

Given a sample of pea plants with colored flowers. What statistic would you use to measure whether the sample seems to be generated according to the model?

### The statistic
* Distance between sample percent (of purple plants) and 75  
```| 'sample percent of purple-flowering plants' - 75 |```
* If the statistic is large, that is evidence against the model

In [None]:
model = make_array(0.75, 0.25)

In [None]:
sample_proportions(929, model)

In [None]:
abs(100 * sample_proportions(929, model).item(0) - 75)

### Simulate Mendel's experiment
* Mendel observed results of growing 929 pea plants

In [None]:
distances = make_array()

for i in np.arange(10000):
    new_distance = abs(100 * sample_proportions(929, model).item(0) - 75)
    distances = np.append(distances, new_distance)

In [None]:
Table().with_column('Distance from 75%', distances).hist()

### Mendel's experiment
* Of the 929 pea plants, Medel observed 705 of these had purple flowers.
* What his model a good model?

In [None]:
observed_distance =  abs(100*(705/929) - 75)
observed_distance

In [None]:
Table().with_column('Distance from 75%', distances).hist()
plt.scatter(observed_distance, 0, color='red', s=30);

## Summary

A model is an assumption about how data was generated. We want to figure out whether observed data was likely generated in the way suggested by the model.To do this:
 - Choose a statistic 
     - should measure how similar a data set is to what you would expect from the model
     - should help you decide between the model and alternative views of how data was generated
 - Predict the statistic under the model
     - simulate many data sets under the assumptions of the model, and calculate the statistic for each of them
 - Compare the data to the predictions
     - calculate the value of the statistic for the observed data set
     - determine whether the observed statistic is similar to the simulated statistics

# Testing Hypotheses

## Choosing One of Two Viewpoints 
* Based on data, one can try to determine:

    - “Chocolate has no effect on cardiac disease.”
    - “Yes, Chocolate does have an effect on cardiac disease.”

* Based on data, one can try to determine:

    - “This jury panel was selected at random from eligible jurors.”
    - “No, it has too many people with college degrees.”

