In [None]:
# load the packages
library('ggplot2')
library('dplyr')

# ignore the commands below; these just make sure plots fit on the screen
library('repr')
options(repr.plot.width=4, repr.plot.height=4)

# Lesson 19: Comparing Distributions


Today:
1. A quick review of statistical models (big picture)
2. Hypothesis Testing: Comparing Distributions
    + Distribution as a statistical model
    + Hypothesis testing to assess "goodness" of the model
        + Hypothesis: data follows a particular distribution (model)
        + Test: does the data indeed follow this particular distribution?  Is there enough evidence from data to reject this hypothesis?
        + Statistic: Distance between data and model
    + Testing using simulations

### Example 1: Mendel's Model

**Scenario 1:**

Mendel's Hypothesis: 75% of pea flowers are purple

Actual Data: 100 plants, 72% of the plants are purple-flowering

Do we reject Mendel's hypothesis?

In [None]:
# Simulate Mendel's Hypothesis
# Procedure:
#   Repeat the following 1000 times:
#      1. Generate sample of 100 plants: with 75% probability purple and 25% white
#      2. Count how many purple flowers and white flowers are actually generated
#      3. Sum the differences from 75 and 25 (in absolute value), respectively.  This is the distance to the model
#      4. Store this sum of differences in a data frame

simulated_data_df <- data.frame( matrix( nrow = 1000, ncol = 3))
names(simulated_data_df) <- c('fraction_purple', 'fraction_white', 'distance')






**Question**: Out of the 1000 simulated data, how many of them are **closer to the model** than the actual data is to the model?

That is, if the distance from our actual data to the model is 0.06, **what percentile** would this be among the simulated data?

+ Sort the simulated data's list of distances from smallest to largest
+ Count how many of the simulated data have distance less than 0.06.
+ Divide by total number of simulated data.

**Scenario 2:**

Mendel's Hypothesis: 75% of pea flowers are purple

Actual Data: 63% of flowers are purple

Do we reject Mendel's hypothesis?

<table>
    <tr>
        <th></th>
        <th>Model</th>
        <th>Data</th>
        <th>Difference (absolute value) </th>
    </tr>
    <tr>
        <td>Purple</td>
        <td>0.75</td>
        <td>0.63</td>
        <td>|0.75 - 0.63| = 0.12</td>
    </tr>
    <tr>
        <td>White</td>
        <td>0.25</td>
        <td>0.37</td>        
        <td>|0.25 - 0.37| = 0.12</td>
    </tr>
    <tr>
        <td><b>Distance</b></td>
        <td> </td>
        <td> </td>        
        <td> 0.24</td>
    </tr>
</table>

## Comparing Models and Data

### Example 2: Racial and Ethnic Disparity in Manhattan Jury Pools


<img src="images/lec15_jury.png" width="400">


Sources:
+ https://ppefny.org/2007/06/racial-and-ethnic-disparity-in-manhattan-jury-pools-results-of-a-survey-and-suggestions-for-reform/474
+ https://ppefny.org/wp-content/uploads/2007/06/200706RJJuryPoolStudy.pdf

Other reading/references: https://www.nytimes.com/2007/06/27/nyregion/27jurors.html

"Both federal and state laws seek to ensure equal representation by mandating trial by **jury selected at random from a “fair cross section of the community"**."

<img src="images/lec15_jury3.png" width="550">

In [None]:
# do not change this code cell
jurydata2 <- data.frame( proportion = c(54.4, 17.4, 9.5, 18.8, 77.7, 10.1, 6.5, 5.7), 
                    description = c('% in census', '% in census', '% in census', '% in census', '% in jury pool', '% in jury pool', '% in jury pool', '% in jury pool'), 
                    group = c('white', 'black', 'asian', 'other', 'white', 'black', 'asian', 'other'))

library(tidyverse)
jurydata1 <- spread(jurydata2, description, proportion)

In [None]:
jurydata1

In [None]:
jurydata2
# create bar plot, for the proprtion of each group, with fill = type
ggplot(jurydata2, aes( x = group, y = proportion, fill = description )) + geom_col( position='dodge')

#### Question: Is there a racial and ethnic disparity in jury selection in Manhattan?

+ Model: Jury panels **are representative** of Manhattan's racial and ethnic diversity.
+ Alternative viewpoint: **No**, jury panels **are not representative** of Manhattan's racial and ethnic diversity.

In [None]:
jurydata1

In [None]:
# add a column to compute distance between model and data



#### Approach
1. Suppose that the model is true, 
    + SIMULATE: generate data based on the model
    + compute distances between simulated data and the model
2. Compare: (1) Distance from actual data to model with (2) Distances from simulated data to model

In [None]:
num_simulations <- 
num_jurors <- 


simulated_data_df <- data.frame( matrix(nrow = num_simulations, ncol = 5) )
colnames(simulated_data_df) <- c('white', 'black', 'asian', 'other', 'distance')





#### Conclusions from this study?

+ Assuming that the study used reliable data, the study seems to provide an evidence against the viewpoint that the jury pool is representative.
+ However, whether this is a reliable evidence depends on the quality of the data.
    + How was the data collected?
    + Was this a representative sample?  Was there a bias in the data or how it was collected?
    + etc.

https://www.nytimes.com/2007/06/27/nyregion/27jurors.html