In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

# Lecture 17: Hypothesis Testing Pt. III

## Warm up

Suppose that we find a coin on the sidewalk and want to decide if it's fair (i.e., a 50-50 chance of landing heads or tails). 

### Step 1: Choose the Null Hypothesis

**Null Hypothesis:**

In [None]:
# ...

What is the reasonable alternative hypothesis? I.e., what is the viewpoint opposing the null hypothesis?

In [None]:
# ...

### Step 2: Gather Data

To test the null hypothesis, suppose that we flip the coin 100 times and find that it lands on heads 58% of the time.

### Step 3: Choose a Test Statistic

We want to find a test statistic where larger values make is lean more and more toward the alternative hypothesis (and away from the null). What would be a good test statistic for our null hypothesis?

In [None]:
# ...

What is the observed value of this test statistic, based on the data provided in Step 2?

In [None]:
test_stat_observed = ...

### Step 4: Find the Distribution of the Test Statistic Under the Null Hypothesis

To see if our null hypothesis is reasonable given the data from Step 2, we need to see where `test_stat_observed` is in comparison to the distribution predicted by the null hypothesis. In order to approximate this distribution, we need to repeatedly simulate the outcome of 100 coin flips as if the null hypothesis were true.

**Question:** write a function called `simulate_coin_flips` that takes no arguments. The function should simulate 100 fair coin flips (50% chance heads, 50% chance tails), and return what *fraction* of these coin flips landed on heads.

In [None]:
def simulate_coin_flips():
    ...

We will now use the `simulate_coin_flips` function to sample the distribution of the test statistic:

In [None]:
coin_test_stats = make_array()
for i in range(10000):
    
    # Simulate the fraction of heads / tails out of 100 fair coin flips 
    frac_heads = simulate_coin_flips()
    
    # Compute the test statistic and append it to the coin_test_stats array
    simulated_test_stat = abs(frac_heads - 0.5)
    coin_test_stats = np.append(coin_test_stats, simulated_test_stat)

In [None]:
coin_test_stats

### Step 5: Compare the Observed Test Statistic to the Distribution

We now use the sample `coin_test_stats` to approximate the distribution of the test statistic under the null hypothesis.

In [None]:
# Plot a histogram of the test statistic distribution
my_bins = np.arange(0, 0.22, 0.02)
Table().with_column('Simulated Test Stats', coin_test_stats).hist(bins=my_bins)

# Mark the observed value on the histogram 
plots.ylim([-0.5, 18])
plots.scatter(0.08, 0, color='red', s=30);

The observed value is larger than most of the simulated outcomes, indicating that observing this data would be unlikely if the null hypothesis were true. But is it *so* unlikely that we have disproven the null hypothesis, demonstrating that the coin is biased?

**Question:** using the `coin_test_stats` array, calculate what fraction of the test statistics simulated under the null hypothesis are *at least as large* as the observed value of 0.08.

In [None]:
# ...

**Question:** if the null hypothesis is true, and we flip the coin 100 times and compute the observed test statistic, approximately what is the probability that its value is at least 0.08?

In [None]:
# ...

## Comparing Two Samples

We will load a dataset representing 1,174 newborn babies:

In [None]:
births = Table.read_table('data/baby.csv')
births

In particular, we will study whether there is a link between an infant's birth weight and whether or not their mother smoked:

In [None]:
smoking_and_birthweight = births.select('Maternal Smoker', 'Birth Weight')
smoking_and_birthweight

In [None]:
smoking_and_birthweight.group('Maternal Smoker')

In [None]:
smoking_and_birthweight.hist('Birth Weight', group='Maternal Smoker')

In [None]:
means_table = smoking_and_birthweight.group('Maternal Smoker', np.average)
means_table

It appears that the average birth weight in the smoker category is lower than that in the non-smoker category. *Is this difference too large to be explained by random variation*? We will have to do a hypotesis test to find out!

## Test Statistic

Our test statistic is the difference in means between each group.

First, we will calculate the observed value of the test statistic from our actual dataset:

In [None]:
observed_difference = means_table.column(1).item(1) - means_table.column(1).item(0)
observed_difference

Since we will need to calculate differences in means quite often during A/B testing, it will be convenient to write a function to do it for us.

In [None]:
def difference_of_means(table, label, group_label):
    """
    Calculates the difference in means between two groups of rows in a table.
    Takes a table, the name of a column representing a numerical variable,
    and the name of a column indicating which group each row belongs to.
    """
    
    # table with the two relevant columns
    reduced = table.select(label, group_label)  
    
    # table containing group means
    means_table = reduced.group(group_label, np.average)
    
    # array of group means
    means = means_table.column(1)
    
    return means.item(1) - means.item(0)

In [None]:
# Test the function to see if it correctly computes the value of observed_difference
difference_of_means(births, 'Birth Weight', 'Maternal Smoker')

## Random Permutation (Shuffling)

We can randomly shuffle the rows of a table using the `sample` method. Let's look at the different ways that we can use this method, depending on the arguments that we supply.

In [None]:
letters = Table().with_column('Letter', make_array('a', 'b', 'c', 'd', 'e'))
letters

**Option 1**: randomly select the *original number of rows*, but with replacement (so that some rows can be repeated):

In [None]:
letters.sample()

**Option 2**: randomly select the *original number of rows*, but without replacement (so that no rows are repeated). In other words, randomly shuffle the order in which rows appear in the table:

In [None]:
letters.sample(with_replacement = False)

## Simulation Under Null Hypothesis

Let's look at our original table of data once again:

In [None]:
smoking_and_birthweight

To simulate the null hypothesis, we want to randomly shuffle the group labels (i.e., the values in the `Maternal Smoker` column).

**Question:** create an array called `shuffled_labels`, containing a random permutation of the values in the `Maternal Smoker` column.

In [None]:
shuffled_labels = ...

In [None]:
# Add the shuffled labels back to the original table
original_and_shuffled = smoking_and_birthweight.with_column(
    'Shuffled Label', shuffled_labels
)

original_and_shuffled

Now we can use the `difference_of_means` function to calculate the difference in means between these randomly assigned groups:

In [None]:
# Difference of means between the randomly-assigned Group A and Group B
difference_of_means(original_and_shuffled, 'Birth Weight', 'Shuffled Label')

This is one sample of our test statistic (the difference in means) under the null hypothesis. To approximate the distribution of the test statistic, we will have to repeat this process many times: randomly re-assign rows to Group A and Group B, then calculate the difference in means.

In [None]:
differences = make_array()

for i in np.arange(1000):
    
    # Shuffle the group labels
    shuffled_labels = smoking_and_birthweight.sample(with_replacement = False).column('Maternal Smoker')
    
    # Add the shuffled labels as a new column in the table
    shuffled_table = smoking_and_birthweight.with_column(
        'Shuffled Label', shuffled_labels)
    
    # Calculate the simulated test statistic (the difference in means)
    difference = difference_of_means(shuffled_table, 'Birth Weight', 'Shuffled Label')
    differences = np.append(differences, difference)

Now we evaluate the null hypothesis by plotting a histogram of the test statistic under the null hypothesis:

In [None]:
Table().with_column('Difference Between Group Means', differences).hist()
print('Observed Difference:', observed_difference)
plots.title('Prediction Under the Null Hypothesis');

**Question:** what is the conclusion of this test?

In [None]:
# ...

## Example: The TA's Defense

Here are some (fictional!) midterm scores of students in a class:

In [None]:
scores = Table.read_table('data/scores_by_section.csv')
scores

The students are divided into 12 sections:

In [None]:
scores.group('Section').show()

**Question:** calculate the average midterm score in each section.

In [None]:
# ...

Students in Section 3 notice that their average score is lower than other sections in the class. Is this due to random chance, or should we suspect that students in Section 3 are systematically getting lower midterm scores?

We can divide the students into two groups: Group A will consist of students in Section 3, and Group B will contain students in all of the remaining sections.

**Null hypothesis:** scores from Group A and Group B are samples from the same distribution.

**Alternative hypothesis:** distributions are not the same; in particular, Group A has a *lower* average score than Group B.

**Test statistic:** the difference in means, `np.mean(group_b) - np.mean(group_a)`. 

**Question:** do *larger* or *smaller* values of the test statistic favor the alternative hypothesis?

In [None]:
# ...

**Question:** write a function called `difference_of_means_section3` that calculates the value of the test statistic. The function should take a single argument, `scores_table`, which has a column `Section` and a column `Midterm` (just like the `scores` table). It should then calculate the difference in means between the midterm scores in Section 3, and midterm scores in all sections *except* Section 3.

In [None]:
def difference_of_means_section3(scores_table):
    """
    Calculate the statistic mean(group_b) - mean(group_a), where Group A are values of Midterm
    in Section 3, and Group B are values of Midterm in all other sections.
    """
    ...

In [None]:
# Calculate the observed value of the test statistic
observed_diff = difference_of_means_section3(scores)
observed_diff

Now we will simulate the distribution of this statistic under the null hypothesis, again by repeatedly shuffling rows between the different Section groups.

In [None]:
# Use the sample method to randomly permute the section labels for each student
shuffled_sections = scores.sample(with_replacement = False).column('Section')
shuffled_sections

In [None]:
# Construct a table with the original midterm scores, in which the section labels are replaced
# with the randomly assigned section labels
shuffled_table = Table().with_columns(
    'Section', shuffled_sections,  # use the randomly assigned sections...
    'Midterm', scores.column('Midterm'))  # ...and the original midterm scores
shuffled_table

In [None]:
difference_of_means_section3(shuffled_table)

Now we will repeat this process many times, to sample the distribution of the test statistic under the null hypothesis.

In [None]:
mean_score_differences = make_array()
for i in range(1000):
    
    # Create a table with shuffled section labels
    shuffled_sections = scores.sample(with_replacement = False).column('Section')
    shuffled_table = Table().with_columns(
        'Section', shuffled_sections, 
        'Midterm', scores.column('Midterm'))
    
    # Compute the test statistic
    mean_score_diff = difference_of_means_section3(shuffled_table)
    mean_score_differences = np.append(mean_score_differences, mean_score_diff)
    

Finally, we plot the histogram, and compare it to the observed value of the statistic:

In [None]:
Table().with_column('Section 3 Mean - All Other Mean', mean_score_differences).hist()
plots.scatter(observed_diff, 0, color='red', s=40)
plots.ylim([-0.01, 0.31])

**Question:** calculate the $p$-value for this test. Should we reject the null hypothesis in favor of the alternative?

In [None]:
# ...