# Class 7

- Models

- Comparing two samples

- Causality


# Models

- A model is a set of assumptions about the data.  In many cases models include assumptions about random (stochastic) processes used to generate the data.

- Data scientists are often in a position of formulating and assessing models.

## Goals of Data Science

- Deeper understanding of the world.

- Make the world a better place to live.

- For example, help expose injustice.

- The skills you are learning help empower you to do this.

## Jury Selection

- U.S. Constitution grants equal protection under the law

- All defendants have the right to due process 

- Robert Swain, a Black man, was convicted in Talladega County, AL

- He appealed to the U.S. Supreme Court

- Main reason: Unfair jury selection in the County’s trials

- At the time of the trial, only men aged 21 or more were eligible to serve on juries in Talladega County. In that population, 26% of the men were Black. 

- But only eight men among the panel of 100 men (that is, 8%) were Black.

- The U.S. Supreme Court reviewed the appeal and concluded, “the overall percentage disparity has been small.” But was this assertion reasonable? 

- If jury panelists were selected at random from the county’s eligible population, there would be some chance variation. We wouldn’t get exactly 26 Black panelists on every 100-person panel. But would we expect as few as eight?

# A model of random selection

- A model of the data is that the panel was selected at random and ended up with a small number of Black panelists just due to chance.

- Since the panel was supposed to resemble the population of all eligible jurors, the model of random selection is important to assess. Let’s see if it stands up to scrutiny.

- The `numpy.random` function `multinomial(n, pvals, size)` can be used to simulate sample proportions or counts with two or more categories.

## Example 1: rolling a six-sided die 20 times

In [None]:
import numpy as np

sample_size = 20

num_simulations = 1

true_probabilities = [1/6]*6

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

proportions = counts / sample_size

print('Sample counts: \n', counts) 
print('Sample proportions: \n', proportions)

## Example 2: rolling a loaded six-side 100 times more likely to land on 6 - repeated 3 times

In [None]:
sample_size = 100

num_simulations = 3

true_probabilities = [1/7]*5 + [2/7]

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

proportions = counts / sample_size

print('Sample counts:\n', counts) 
print('Sample proportions: \n', proportions)

- Let's use this to simulate the jury selection process.

- The size of the jury panel is 100, so `sample_size` is 100. 

- The distribution from which we will draw the sample is the distribution in the population of eligible jurors: 26% of them were Black, so 100% - 26% = 74% are white (very simplistic assumption, but let's go with it for now). 

- This mean `true_proportions` is `[0.26, 0.74]`.

- One simulation is below.

In [None]:
import numpy as np
import pandas as pd

sample_size = 100

true_probabilities = [0.26, 0.74]

num_simulations = 1

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

proportions = counts / sample_size

sim_counts = pd.DataFrame(proportions, columns = ['Black', 'White'])

print(sim_counts)

sim_counts.iloc[0,0]


## Simulate one value

In [None]:
def simulate_one_count():
    sample_size = 100
    true_probabilities = [0.26, 0.74]
    num_simulations = 1
    counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)
    sim_counts = pd.DataFrame(counts, columns = ['Black', 'White'])
    return sim_counts.iloc[0,0]


In [None]:
simulate_one_count()

## Simulate multiple values

-  Our analysis is focused on the variability in the counts. 

- Let’s generate 10,000 simulated values of the count and see how they vary.

- We will do this by using a for loop and collecting all the simulated counts in a list.

In [None]:
sim_counts = []
for _ in np.arange(10000):
    sim_counts.append(simulate_one_count())

In [None]:
import matplotlib.pyplot as plt

plt.hist(sim_counts, bins = np.arange(5.5, 50, 1), edgecolor = 'black', color = 'grey', density = True);
plt.xlabel('Count in random sample')
plt.ylabel('Frequency')
plt.scatter(8, 0, color = 'red', s =50);

- The simulation also could have been done using `np.random.multinomial`.

- This is an example of a 'vectorized' computation, and are usually faster than non-vectorized computations.

In [None]:
sample_size = 100

true_probabilities = [0.26, 0.74]

num_simulations = 10000

counts = np.random.multinomial(sample_size, true_probabilities, size = num_simulations)

counts

## Conclusion of the data analysis

- The histogram shows that if we select a panel of size 100 at random from the eligible population, we are very unlikely to get counts of Black panelists that are as low as the eight that were observed on the panel in the trial.

- This is evidence that the model of random selection of the jurors in the panel is not consistent with the data from the panel. While it is possible that the panel could have been generated by chance, our simulation demonstrates that it is hugely unlikely.

- Therefore the most *reasonable* conclusion is that the assumption of random selection is unjustified for this jury panel.

# Comparing two samples

## Comparing plant fertilizers

- A gardener wanted to discover whether a change in fertilizer mixture applied to her tomato plants would result in improved yield.

- She had 11 plants set out in a single row:
   + 5 were given standard fertilizer mixture A
   + 6 were given a supposedly improved mixture B

- The A's and B's were randomly applied to positions along the row to give the following data:

|                |  |   |   |   |   |   |   |   |   |   |   |
|----------------|---|---|---|---|---|---|---|---|---|---|---|
|Position in row | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10| 11 |
|Fertilizer      | A | A | B | B | A | B |B  |B  | A | A | B  |
|Pounds of tomatoes | 29.2 | 11.4 | 26.6 | 23.7 | 25.3 | 28.5 | 14.2 | 17.9 | 16.5 | 21.1 | 24.3 |


- The random arrangement was arrived at by taking 11 playing cards, 5 marked A, and 6 marked B.

- Thoroughly shuffling the cards once the gardener arrived at the arrangement above. 


In [None]:
import pandas as pd

fertilizer = ['A', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'A', 'A', 'B']

tomatoes = [29.2, 11.4, 26.6, 23.7, 25.3, 28.5, 14.2, 17.9, 16.5, 21.1, 24.3]

data = {'fert' : fertilizer,
        'tomatoes' : tomatoes}

plant_df = pd.DataFrame(data)

The (observed) mean difference is computed below.

In [None]:
# mean of plants assigned fert A
mean_A_obs = plant_df.loc[plant_df['fert'] == 'A', 'tomatoes'].mean() 

# mean of plants assigned fert B
mean_B_obs = plant_df.loc[plant_df['fert'] == 'B', 'tomatoes'].mean()

# mean difference
obs_diff =  mean_B_obs - mean_A_obs

obs_diff

- Since the assignment of fertilizers to plants is random it could have happened another way.

- We can use the `pandas` `sample` function with `frac` and `replace` parameters to simulate these other potential assignments of fertilizers to plants.

    + `frac = 1` - specifies the fraction of rows to return (1 means return all the rows)
    + `replace = False` - specifies to sample without replacement

In [None]:
plant_df['fert'].sample(frac=1, replace = False)

- Notice that the index is out of order.

- We are going to want to have an ordered index later on.  

- To do this we can use the `pandas` function `reset_index(drop = True)`. 

- `drop = True` indicates that we don't want to keep the previous index. 

In [None]:
plant_df['fert'].sample(frac=1, replace = False).reset_index(drop = True)

## The Logic of Hypothesis Testing

### 1. Hypotheses

Two claims:

1. There is no difference in the mean weight between fertilizers A and B.  This is called the null hypothesis.

2. There is no difference in the mean weight between fertilizers A and B.  This is called the alternative hypothesis.

### 2. Test statistic

The test statistic is a number, calculated from the data, that captures what we're interested in.

What would be a useful test statistic for this study?

### 3. Simulate what the null hypothesis predicts will happen

- If the null hypothesis is true then the weight of tomatoes for each plant will be the same regardless of how they are labeled. That means we can randomly shuffle the labels and the mean difference should be close to 0.

Assume that there is no difference in mean weight between A and B (i.e., the null hypothesis is true).  Now, consider the following **thought experiment (we don't actually do this, this is a model for the data)**:

- Imagine we have 5 playing cards labelled `A` and 6 cards labelled `B`.


- Shuffle the cards ...


- Assign the cards to the 11 plants then calculate the mean weight difference between `A` and `B`.  This is one simulated value of the test statistic. 


- Shuffle the cards again ...


- Assign the cards to the 11 plants then calculate the mean weight difference between `A` and `B`.  This is second simulated value of the test statistic. 


- Continue shuffling, assigning , and computing the mean difference.

## Simulating what the null hypothesis predicts

Let's assume the null hypothesis is true and simulate what the null hypothesis predicts.

### Step 1

Randomly shuffle the assignment of fertilizers to plants. 

In [None]:
fert_sim = plant_df['fert'].sample(frac=1, replace = False).reset_index(drop = True)

### Step 2

Compute the mean difference between fertilizers A and B.

In [None]:
mean_A = plant_df.loc[fert_sim == 'A', 'tomatoes'].mean()

mean_B = plant_df.loc[fert_sim == 'B', 'tomatoes'].mean()

sim_diff = mean_B - mean_A

sim_diff

### Step 3

Repeat Steps 1 and 2 a large number of times (e.g., 5000) to get the distribution of mean differences. 

In [None]:
simulated_diffs = []

for _ in range(5000):
    fert_sim = plant_df['fert'].sample(frac=1, replace = False).reset_index(drop = True)
    mean_A = plant_df.loc[fert_sim == 'A', 'tomatoes'].mean()
    mean_B = plant_df.loc[fert_sim == 'B', 'tomatoes'].mean()
    sim_diff = mean_B - mean_A
    simulated_diffs.append(sim_diff)
    

In [None]:
import matplotlib.pyplot as plt

plt.hist(simulated_diffs, edgecolor = 'black', color = 'lightgrey')
#plt.vlines(x = obs_diff, ymin = 0, ymax = 1000, color = 'black', linewidth = 5);

- The histogram above shows the randomization distribution with the observed difference as the black line.

- What proportion of the simulated differences are larger than the observed mean difference of 1.83?  This is known as the **p-value**.

In [None]:
simulated_diffs_df = pd.DataFrame({'sim_diffs' : simulated_diffs})

numgreater = (simulated_diffs_df['sim_diffs'] >= obs_diff).sum()

print('The number of simulated differences greater than the observed difference is:', numgreater)

pvalue = numgreater / 5000

print('The p-value is:', pvalue)

### Step 4

Interpret the results.  

Assuming that there is no difference in the mean tomato plant weights between A and B, 31% of simulations had as large or larger value than the observed mean difference of 1.83.  Therefore, there is little reason to doubt the null hypothesis that one fertilizer is better than another.

## Quiz

Suppose that in a similar study of two fertilizers effect on yield of tomatoes a similar simulation of 5000 yielded that 10 simulated differences were greater than the observed difference.  Interpret the results?

## In-class exercise

modify the simulation so that you compare the difference in medians.

In [None]:
# code goes here

# Are mammals are larger or smaller than birds?

<mark> Caroline describe data </mark>

## Read in IUCN and Amniote Data files

In [None]:
import pandas as pd
animal_iucn = pd.read_csv('animal_iucn.csv')
Amniote_db = pd.read_csv('Amniote_Database_Aug_2015.csv')

In [None]:
print(animal_iucn.columns)
animal_iucn.shape

In [None]:
print(Amniote_db.columns)
Amniote_db.shape

## Merge IUCN and Amniote DataFrames

- We want to merge `Amniote_db` and `animal_iucn`
- What column can we merge on?
- The `'scientificName'` column in `animal_iucn` can be found in `Amniote_db` if we concatenate `'genus'`, `' '`, and `'species'`.
- `+` concatenates (links) strings together in python. 

In [None]:
# an example of of concatenation

string1 = 'IUCN'

string3 = 'is an interesting'

string4 = 'dataset.'

space = ' '

string1 + space + string3 + space + string4

Let's create a column called `'sciname'` in `Amniote_db`.

In [None]:
sciname = Amniote_db['genus'] + ' ' + Amniote_db['species']
Amniote_db['sciname'] = sciname

In [None]:
Amniote_db['sciname']

### `merge`

use the `pandas` `merge` function to join `Amniote_iucn` and `animal_iucn`.

In [None]:
Amniote_iucn = Amniote_db.merge(animal_iucn, left_on='sciname', right_on='scientificName')
Amniote_iucn[['scientificName', 'sciname']].head()

- We want to compare body mass between `'Aves'` and `'Mammalia'`. 

- So, let's create a DataFrame with only these two classes.

In [None]:
aves = Amniote_iucn['class'] == "Aves" 

mam = Amniote_iucn['class']=="Mammalia"

# select aves or mammals
Amniote_iucn_aves_mam = Amniote_iucn[aves | mam]

In [None]:
Amniote_iucn_aves_mam['class'].value_counts()

The observed weight distributions in the two groups is:

In [None]:
Amniote_iucn_aves_mam.groupby('class')['adult_body_mass_g'].describe()

Extract the group means.

In [None]:
mean_table = Amniote_iucn_aves_mam.groupby('class')['adult_body_mass_g'].mean()
mean_table

Compute the observed difference.

In [None]:
observed_mean_difference = mean_table.iloc[1] - mean_table.iloc[0]
observed_mean_difference 

- So, mammals are on average 132,658 grams larger than aves.

- Could this difference be due to the sample of mammmals and aves in our data?  In other words, is this due to chance? 

## The Logic of Hypothesis Testing

### 1. Hypotheses

Two claims:

1. There is no difference in the mean body weight between mammals and Aves.  This is called the null hypothesis.

2. There is a difference in the mean body weight between mammals and Aves.  This is called the alternative hypothesis.

### 2. Test statistic

The test statistic is a number, calculated from the data, that captures what we're interested in.

What would be a useful test statistic for this study?

### 3. Simulate what the null hypothesis predicts will happen

- If the null hypothesis is true then the mean weight in mammals should be the same as aves.  This implies that we can randomly shuffle the labels and the mean difference should be close to 0.


- Imagine we have 8644 playing cards labelled `Aves` and 4490 cards labelled `Mammalia`.


- Shuffle the cards ...


- Assign the cards to the 13,134 animals then calculate the mean difference between `Aves` and `Mammalia`.  This is one simulated value of the test statistic. 


- Shuffle the cards again ...


- Assign the cards to the 13,134 animals then calculate the mean difference between `Aves` and `Mammalia`.  This is one simulated value of the test statistic. 


- Continue shuffling, assigning to neigbourhoods, and computing the mean difference.

## Simulating what the null hypothesis predicts

### Step 1

Randomly shuffle the assignment of `Aves` and `Mammalia` to animals

In [None]:
avemam_sim = Amniote_iucn_aves_mam['class'].sample(frac = 1, replace = True).reset_index(drop = True)
avemam_sim

## Step 2

Calculate the mean difference for the shuffled labels.

In [None]:
Ave_mean_sim = Amniote_iucn_aves_mam.loc[avemam_sim == 'Aves', 'adult_body_mass_g'].mean()

Mam_mean_sim = Amniote_iucn_aves_mam.loc[avemam_sim == 'Mammalia', 'adult_body_mass_g'].mean()

sim_diff = Mam_mean_sim - Ave_mean_sim

sim_diff

### Step 3

Repeat Steps 1 and 2 a large number of times (e.g., 5000) to get the distribution of mean differences. 

In [None]:
simulated_diffs = []

for _ in range(5000):
    avemam_sim = Amniote_iucn_aves_mam['class'].sample(frac = 1, replace = True).reset_index(drop = True)
    Ave_mean_sim = Amniote_iucn_aves_mam.loc[avemam_sim == 'Aves', 'adult_body_mass_g'].mean()
    Mam_mean_sim = Amniote_iucn_aves_mam.loc[avemam_sim == 'Mammalia', 'adult_body_mass_g'].mean()
    sim_diff = Mam_mean_sim - Ave_mean_sim
    simulated_diffs.append(sim_diff)
    

In [None]:
import matplotlib.pyplot as plt

plt.hist(simulated_diffs, edgecolor = 'black', color = 'lightgrey')
plt.vlines(x = observed_mean_difference , ymin = 0, ymax = 1000, color = 'black', linewidth = 5);

- The histogram above shows the randomization distribution with the observed difference as the black line.

- What proportion of the simulated differences are larger than the observed mean difference of 35829.4?  This is known as the **p-value**.

In [None]:
simulated_diffs_df = pd.DataFrame({'sim_diffs' : simulated_diffs})

numgreater = (simulated_diffs_df['sim_diffs'] >= observed_mean_difference).sum()

print('The number of simulated differences greater than the observed difference is:', numgreater)

pvalue = numgreater / 5000

print('The p-value is:', pvalue)

### Step 4

Interpret the results.  

Assuming that there is no difference in the mean weights between Aves and Mammalia, 0% of simulations had as large or larger value than the observed mean difference of 35829.4.  Therefore, there is little reason to doubt the null hypothesis that one fertilizer is better than another.

# Causality

Imagine the following hypothetical scenario:

- you have a headache 

- you take an Aspirin at 11:00 to relieve your pain. your pain goes away after 30 minutes.

Now, you go back in time to 11:00 ...

- you don't take an Aspirin at 11:00 to relieve your pain. your pain goes away after 48 minutes.

The **causal effect** of taking an Aspirin is 18 minutes (48 minutes - 30 minutes).


## Potential outcomes and randomized control trials

- Establishing causality involves comparing these **potential outcomes**.  

- The problem is that we can never observe both taking an Aspirin and not taking as Aspirin (in the same person at the same time under the same conditions).

- A close approximation to comparing potential outcomes is to compare two groups of people that are similar on average (age, sex, income, etc.) except one group is allowed to take Aspirin after a headache and the other group takes a fake Aspirin (sugar pill/placebo) after a headache.  This is an example of a randomized control trial.

- Then the mean difference between time to pain relief should be due to Aspirin and not other factors related to why people may or may not take an Aspirin.  



## Review

- The tomato plant example is an example of a test where the conclusion is indeterminate. The observed difference between groups is plausible under the model that the fertilizer had no effect on the weight of tomatoes. 

- If the null hypothesis is true then the two results from each particular pot will be exchangeable.  But, this hypothesis could be false if, say, some of the plants were diseased.

- Does it make sense that biological class (i.e., Aves or Mammalia) *causes* a higher weight in Mammalia?