In [4]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import random

# Resampling from multiple populations.

Today's example is from my own work. Trying to understand how we are doing in educating student at UNC. This is also an example where we are going to simmulate the data rather than work with real data. One reason for this is that I want to show you exactly what is happening, and the other is that in doing research on human subjects, you need to be respectful of their data and privacy. We can understand what we need to from the problem using simmulated data, and sharing the actual data could potentially be a concern.

## Grades in Courses

The university is very concerned with passing rates in our courses and occasionally that concern comes down to individual sections of a multi section course. For example we offer 5-8 sections of MATH 120 - *Mathematics for Liberal Arts* each semester. The question is sometimes, did one section have particularly low grades because of something the instructor did, the time of day it was offered, or some other factor.

These things absolutely have an effect. But sometimes the variations in distributions of grades that are observed across sections are just because of the randomness with which students divide themselves up into sections.

I also want to demonstrate how we can do these experiments without building up the whole population like we did last week. We will need to do this using random sampling with replacement. 

## With Replacement

So last week most of our examples used random sampling without replacement. The difference is in whether when we select an individual for our sample, whether we return them to the population (replace them) or not. If the size of the sample we are taking is large compared to the size of the population (Nixon's pardon petitions for example) then sampling without replacement should be used - our sample is large enough that as it is built it is changing the distribution of the population. But in cases where the sample is much smaller than the population, sampling with replacement is good enough -- what we need to know is that building the sample is not changing the distribution.

## The Population

So over the last six years MATH 120 has had the following distribution of grades 11% A; 27% B; 30% C; 15% D; and 17% F (again completely made up numbers) for a large population of students (close to 1000). A section of 25 students is found to have 5 A, 6 B; 8 C; 2 D; and 4 F. A slightly better than average performance for the class. 

### Our Null Hypothesis:  This distribution of grades is the result of randomly chosing 25 students from the total student population with the grade distribution above.

### Our Alternative Hypothesis: This distribution of grades is the result of something being different about this section. 

In particular if we know that the only thing different about this section is that the instructor gave them all a piece of chocolate at the beginning of each class, than our conclusion in the alternative would be that the chocolate improve how the students did in the class.

----

### The Test Statistic

It is not so clear what to use as the **Test Statistic** here. The test statistics is the thing we will compute from our sample to compare.

Last week we used:

- For the Pitchers/NonPitchers we used the median and mean of their heights (and note that we came to opposite conclusions!)
- For Nixon it was the proportion of the petitions received who were granted pardons.
- For the PPP loans it was the mean number of jobs saved by the largest loans.
- For the airlines it was the median and mean arrival delays of the sample.

There are not necessarily rules, but some guidelines:

- The test statistics should be meaningful for the problem - and in particular should give you in the Null or Alternative a conclusion.
- For numerical data median and mean are common, other test statistics that might be useful would be standard deviation or other measures of spread such as a specific quartile. Even minimum or maximum might be appropriate depending on the problem. 
- For categorical data the proportion in the category is standard particularly for a categorical variable with only two values. 
- For categorical variable with multiple values, the total deviation is useful (and what we will use today).

**Finally a warning**

What you should not do is hunt for a test statistic that lets you reject the null hypothesis. This is a form of p-hacking, is unethical, and can lead you to a false conclusion and a false sense of the confidence in your result. Why would this be? We are interpreting the p-value computed from our experiment as a probability, but it is a probability in the case when the experiment has only been run once. Using the same data to run the expirement multiple times, which essentially what you are doing if you modify the test statistic, breaks down our notion of this interpretation.

Note for example the baseball player problem: The mean heights of the pitchers are significantly taller, the median heights of the players are not. Are the pitchers taller than non-pitchers?

### Test Statistics for Grades

Okay so what test statistics to use for the distribution of grades. There are choices:

- Since what we might care about is just whether students passed and not what specific grade they got, we could convert this to a two value categorical variable by changing the grade for a passing / failing and then computing the propotion that passed.

- The other thing to do is ask how the propotions of the grades in the class deviated from the proportions of the grades in the population.

In [19]:
distributions = pd.DataFrame([ [5/25, 6/26, 8/25, 2/25, 4/25], [0.11, 0.27, 0.30, 0.15, 0.17]], 
                             index = ['Section Distribution', 'Population Distribution'], 
                             columns = ['A', 'B', 'C', 'D', 'F']).transpose()
distributions

Unnamed: 0,Section Distribution,Population Distribution
A,0.2,0.11
B,0.230769,0.27
C,0.32,0.3
D,0.08,0.15
F,0.16,0.17


In [21]:
# Now compute the deviation of the section distribution from the Population's distribution

distributions.loc[:, 'Deviation'] = distributions.loc[:, 'Population Distribution'] - distributions.loc[:, 'Section Distribution']
distributions

NameError: name 'distribtuions' is not defined

In [3]:
# Using the percentages for the distribution of grades in the whole population we build
# a set of 100 grades with the same distribution. If we had fractions of a percent we could either 
# round the values or if some were really close it we thought it might matter we would build
# a population of size 1000 or more. 

sim_population = ['A']*11 + ['B']*27 + ['C']*30 + ['D']*15 + ['F'] * 17
# Note again how easy this is in Python!


We sample from sim_population but without replacement so that the distribution of the grades always matches that of the whole population of MATH 120 students.

In [14]:
size = 25
sample = random.choices(sim_population, k=size)
# build our sample class
sample

['F',
 'A',
 'F',
 'F',
 'B',
 'C',
 'F',
 'C',
 'F',
 'D',
 'B',
 'C',
 'F',
 'C',
 'F',
 'C',
 'C',
 'C',
 'F',
 'B',
 'D',
 'A',
 'A',
 'B',
 'F']

Lets compute the number of each grade that we got. 