In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab07.ipynb")

# Lab 7: Testing Hypotheses

**Helpful Resources:**
- [Python Reference](http://www.cs.williams.edu/~cs104/python-library-ref.html): Cheat sheet of helpful library methods.

**Recommended Readings**: 
* [Ch 11. Testing Hypotheses](https://www.inferentialthinking.com/chapters/11/Testing_Hypotheses.html)
* [Ch 12.1. A/B Testing](https://inferentialthinking.com/chapters/12/1/AB_Testing.html)

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to setup the notebook by importing some helpful libraries. Each time you start your server, you will need to execute this cell again.  For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. **Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook!** For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!

**Note: This homework has hidden tests on it. That means even though tests may say 100% passed, it doesn't mean your final grade will be 100%. We will be running more tests for correctness once everyone turns in the homework.**

In [None]:
# Run this cell to set up the notebook, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines make plots look nice and hide some messy Python warnings.
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
warnings.simplefilter('ignore', np.VisibleDeprecationWarning)

## 1. Vaccinations Across The Nation (45 pts)



A vaccination clinic has two types of vaccines against a disease. Each person who comes in to be vaccinated gets either Vaccine 1 or Vaccine 2. One week, everyone who came in on Monday, Wednesday, and Friday was given Vaccine 1. Everyone who came in on Tuesday and Thursday was given Vaccine 2. The clinic is closed on weekends.

Doctor Adhikari at the clinic said, "Oh wow, it's just like tossing a coin that lands heads with chance $\frac{3}{5}$. Heads you get Vaccine 1 and Tails you get Vaccine 2."

But Doctor Wagner said, "No, it's not. We're not doing anything like tossing a coin."

That week, the clinic gave Vaccine 1 to 211 people and Vaccine 2 to 107 people. Conduct a test of hypotheses to see which doctor's position is better supported by the data.

#### Part 1.1 (5 pts)


 Given the information above, what was the sample size, and what was the percentage of people who got **Vaccine 1?** 

*Note*: Your percent should be a number between 0 and 100.



In [None]:
sample_size = ...
percent_V1 = ...

print(f"Sample Size: {sample_size}")
print(f"Vaccine 1 Percent: {percent_V1}")

In [None]:
grader.check("q1.1")

<!-- BEGIN QUESTION -->

#### Part 1.2 (5 pts)


 State the null hypothesis. It should reflect the position of either Dr. Adhikari or Dr. Wagner. 



_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

#### Part 1.3 (5 pts)


 State the alternative hypothesis. It should reflect the position of the doctor you did not choose to represent in Question 1.2. 



_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Part 1.4 (5 pts)


 One of the test statistics below is appropriate for testing these hypotheses. Assign the variable `valid_test_stat` to the number corresponding to the correct test statistic. 

1. percent of heads - 60
2. percent of heads - 50
3. |percent of heads - 60|
4. |percent of heads - 50|



In [None]:
valid_test_stat = ...
valid_test_stat

In [None]:
grader.check("q1.4")

#### Part 1.5 (5 pts)


 Using your answer from Questions 1.1 and 1.4, find the observed value of the test statistic and assign it to the variable `observed_statistic`. 



In [None]:
observed_statistic = ...
observed_statistic

In [None]:
grader.check("q1.5")

#### Part 1.6 (5 pts)


 In order to perform this hypothesis test, you must simulate the test statistic. From the three options below, pick the assumption that is needed for this simulation. Assign `assumption_needed` to an integer corresponding to the assumption. 

1. The statistic must be simulated under the null hypothesis.
2. The statistic must be simulated under the alternative hypothesis.
3. No assumptions are needed. We can just simulate the statistic.



In [None]:
assumption_needed = ...
assumption_needed

In [None]:
grader.check("q1.6")

<!-- BEGIN QUESTION -->

#### Part 1.7 (5 pts)


 Simulate 20,000 values of the test statistic under the assumption you picked in Question 1.6.  

As usual, start by defining a function that simulates one value of the statistic. Your function should use `sample_proportions`. Then write a `for` loop to simulate multiple values, and collect them in the array `simulated_statistics`. 

Use as many lines of code as you need. We have included the code that visualizes the distribution of the simulated values. The red dot represents the observed statistic you found in Question 1.5.



In [None]:
def one_simulated_statistic():
...

print(one_simulated_statistic())

num_simulations = 20000

simulated_statistics = ...  
for ... in ...:  
    ...  
simulated_statistics

In [None]:
# Run this cell to produce a histogram of the simulated statistics

Table().with_columns('Simulated Statistic', simulated_statistics).hist()
plt.scatter(observed_statistic, -0.002, color='red', s=40);

<!-- END QUESTION -->

#### Part 1.8 (5 pts)


 Using `simulated_statistics`, `observed_statistic`, and `num_simulations`, find the empirical p-value based on the simulation. 



In [None]:
p_value = ...
p_value

In [None]:
grader.check("q1.8")

#### Part 1.9 (5 pts)


 Assign `correct_doctor` to the number corresponding to the correct statement below. Use the 5% cutoff for the p-value. 

1. The data support Dr. Adhikari's position more than they support Dr. Wagner's.
2. The data support Dr. Wagner's position more than they support Dr. Adhikari's.

As a reminder, here are the two claims made by Dr. Adhikari and Dr. Wagner:
> **Doctor Adhikari:** "Oh wow, it's just like tossing a coin that lands heads with chance $\frac{3}{5}$. Heads you get Vaccine 1 and Tails you get Vaccine 2."

>**Doctor Wagner:** "No, it's not. We're not doing anything like tossing a coin."



In [None]:
correct_doctor = ...
correct_doctor

In [None]:
grader.check("q1.9")

## 2. Using TVD as a Test Statistic (25 pts)


Before beginning this section, please read [this section](https://inferentialthinking.com/chapters/11/2/Multiple_Categories.html#a-new-statistic-the-distance-between-two-distributions) of the textbook on TVD!

Total variation distance (TVD) is a special type of test statistic that we use when we want to compare two distributions of categorical data. It is often used when we observe that a set of observed proportions/probabilities is different than what we expect under the null model. 

Consider a six-sided die that we roll 6,000 times. If the die is fair, we would expect that each face comes up $\frac{1}{6}$ of the time. By random chance, a fair die won't always result in equal proportions (that is, we won't get exactly 1000 of each face). However, if we suspect that the die might be unfair based on the data, we can conduct a hypothesis test using TVD to compare the expected [$\frac{1}{6}$, $\frac{1}{6}$, $\frac{1}{6}$, $\frac{1}{6}$, $\frac{1}{6}$, $\frac{1}{6}$] distribution to what is actually observed.

**Run the cell below to load in the `covid` table.**  This table is based on data downloaded from the [CDC Data Catalog](https://data.cdc.gov/browse) in May, 2022, when COVID-related deaths in the U.S. reached one million.  The table shows the Race and Hispanic Origin of a random sample of 100 people who have passed away from COVID.  It also shows the distribution of the U.S. population across those categories.

In [None]:
# I like this question, but it's a bit morbid...  Better categorical data?
covid = Table.read_table("covid-deaths.csv")
covid.show()

Example Table

<!-- BEGIN QUESTION -->

#### Part 2.1 (5 pts)


We wish to test whether some groups listed in the table have disproportionate death rates.  Define the null hypothesis, alternative hypothesis, and test statistic in the cell below.

*Note:* Please format your answer as follows:
- Null Hypothesis: ...  
- Alternative Hypothesis: ...  
- Test Statistic: ...  



_Type your answer here, replacing this text._

<!-- END QUESTION -->

#### Part 2.2 (5 pts)


 Write a function `calculate_tvd` that takes in the observed distribution (`obs_dist`) and expected distribution under the null hypothesis (`null_dist`) and calculates the total variation distance. Use this function to set `observed_tvd` to be equal to the observed test statistic. 



In [None]:
def calculate_tvd(obs_dist, null_dist):
    ...
    
observed_tvd = ...
observed_tvd

#### Part 2.3 (5 pts)


 Create an array called `simulated_tvds` that contains 10,000 simulated values under the null hypothesis.

*Hint:* The `sample_proportions` function may be helpful to you. Refer to the [Python Reference sheet](http://data8.org/fa21/python-reference.html#:~:text=sample_proportions(sample_size%2C%20model_proportions)) to read up on it!

In [None]:
num_simulations = 5000

simulated_tvds = ...
for ... in ...:  
    ...  

Run the cell below to plot a histogram of your simulated test statistics, as well as the observed value of the test statistic.

In [None]:
Table().with_column("Simulated TVDs", simulated_tvds).hist()
plt.scatter(observed_tvd, -0.02, color='red', s=70, zorder=2);
plt.show();

#### Part 2.4 (5 pts)


 Use your simulated statistics to calculate the p-value of your test. Make sure that this number is consistent with what you observed in the histogram above. 



In [None]:
p_value_tvd = ...
p_value_tvd

<!-- BEGIN QUESTION -->

#### Part 2.5 (5 pts)


 What can you conclude about COVID deaths from this question? Explain your answer using the results of your hypothesis test. Assume a p-value cutoff of 5%. 



_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 3. Who is Older? (55 pts)


Data scientists have drawn a simple random sample of size 500 from a large population of adults. Each member of the population happened to identify as either "male" or "female". Data was collected on several attributes of the sampled people, including age. The table `sampled_ages` contains one row for each person in the sample, with columns containing the individual's gender identity.

In [None]:
sampled_ages = Table.read_table('age.csv')
sampled_ages.show(5)

#### Part 3.1 (5 pts)


 How many females were there in our sample? Keep in mind that `group` sorts categories in alphabetical order. 



In [None]:
num_females = ...
num_females

In [None]:
grader.check("q3.1")

#### Part 3.2 (5 pts)


 Complete the cell below so that `avg_male_vs_female` evaluates to `True` if the sampled males are older than the sampled females on average, and `False` otherwise. Use Python code to achieve this. 



In [None]:
group_mean_tbl = ...
group_means = ...
avg_male_vs_female = ...
avg_male_vs_female


In [None]:
grader.check("q3.2")

#### Part 3.3 (5 pts)


 The data scientists want to use the data to test whether males are older than females or, in other words, whether the ages of the two groups have the same distribution. One of the following statements is their null hypothesis and another is their alternative hypothesis. Assign `null_statement_number` and `alternative_statement_number` to the numbers of the correct statements in the code cell below. 

1. In the sample, the males and females have the same distribution of ages; the sample averages of the two groups are different due to chance.
2. In the population, the males and females have the same distribution of ages; the sample averages of the two groups are different due to chance.
3. The age distributions of males and females in the population are different due to chance.
4. The males in the sample are older than the females, on average.
5. The males in the population are older than the females, on average.
6. The average ages of the males and females in the population are different.



In [None]:
null_statement_number = ...
alternative_statement_number = ...


In [None]:
grader.check("q3.3")

#### Part 3.4 (5 pts)


 The data scientists have decided to use a permutation test. Assign `permutation_test_reason` to the number corresponding to the reason they made this choice. 

1. Since a person's age can't be related to their gender under the null hypothesis, it doesn't matter who is labeled "male" and who is labeled "female", so you can use permutations.
2. Under the null hypothesis, permuting the labels in the `sampled_ages` table is equivalent to drawing a new random sample with the same number of males and females as in the original sample.
3. Under the null hypothesis, permuting the rows of `sampled_ages` table is equivalent to drawing a new random sample with the same number of males and females as in the original sample.



In [None]:
permutation_test_reason = ...
permutation_test_reason

In [None]:
grader.check("q3.4")

#### Part 3.5 (5 pts)


 To test their hypotheses, the data scientists have followed our textbook's advice and chosen a test statistic where the following statement is true: Large values of the test statistic favor the alternative hypothesis.

The data scientists' test statistic is one of the two options below. Which one is it? Assign the appropriate number to the variable `correct_test_stat`. 

1. "male age average - female age average" in a sample created by randomly shuffling the male/female labels
2. "|male age average - female age average|" in a sample created by randomly shuffling the male/female labels



In [None]:
correct_test_stat = ...
correct_test_stat

In [None]:
grader.check("q3.5")

#### Part 3.6 (5 pts)


 Complete the cell below so that `observed_statistic_ab` evaluates to the observed value of the data scientists' test statistic. Use as many lines of code as you need, and remember that you can use any quantity, table, or array that you created earlier. 



In [None]:
observed_statistic_ab = ...
observed_statistic_ab

#### Part 3.7 (5 pts)


 Assign `shuffled_labels` to an array of shuffled male/female labels. The rest of the code puts the array in a table along with the data in `sampled_ages`. 



In [None]:
shuffled_labels = ...
original_with_shuffled_labels = sampled_ages.with_columns('Shuffled Label', shuffled_labels)
original_with_shuffled_labels

#### Part 3.8 (5 pts)


 [Pretend this is a midterm problem and solve it without doing the calculation in a code cell.] The comparison below uses the array `shuffled_labels` from Question 3.7 and the count `num_females` from Question 3.1. 

For this comparison, assign the correct number from one of the following options to the variable `correct_q8`.

`comp = np.count_nonzero(shuffled_labels == 'female') == num_females`

1. `comp` is set to `True`.
2. `comp` is set to `False`.
3. `comp` is set to `True`, or `False`, depending on how the shuffle came out.



In [None]:
correct_q8 = ...
correct_q8

#### Part 3.9 (5 pts)


 Define a function `simulate_one_statistic` that takes no arguments and returns one simulated value of the test statistic. We've given you a skeleton, but feel free to approach this question in a way that makes sense to you. Use as many lines of code as you need. Refer to the code you have previously written in this problem, as you might be able to re-use some of it. 

In [None]:
def simulate_one_statistic():
    "Returns one value of our simulated test statistic"
    shuffled_labels = ...
    shuffled_tbl = ...
    group_mean_tbl = ...
    group_means = ...
    return group_means.item(1) - group_means.item(0)

After you have defined your function, run the following cell a few times to see how the statistic varies.

In [None]:
simulate_one_statistic()

#### Part 3.10 (5 pts)


Complete the cell to simulate 3,000 values of the statistic. We have included the code that draws the empirical distribution of the statistic and shows the value of `observed_statistic_ab` from Question 3.6. 

*Note:* This cell will take around a minute to run.

In [None]:
repetitions = 3000

simulated_statistics_ab = ...

Table().with_columns('Simulated Statistic', simulated_statistics_ab).hist()
plt.scatter(observed_statistic_ab, -0.002, color='red', s=70);

#### Part 3.11 (5 pts)


 Use the simulation to find an empirical approximation to the p-value. Assign `p_val` to the appropriate p-value from this simulation. Then, assign `conclusion` to either `null_hyp` or `alt_hyp`.  

*Note:* Assume that we use the 5% cutoff for the p-value.



In [None]:
# These are variables provided for you to use.
null_hyp = 'The data are consistent with the null hypothesis.'
alt_hyp = 'The data support the alternative more than the null.'

p_val = ...
conclusion = ...

p_val, conclusion # Do not change this line

## 4. Generalizing Our Code for TVD Tests (25 pts)



In the examples above, you wrote specialized code to answer each question we asked.  As you may have noticed, the code you write to answer similar questions is often quite similar.  A key aspects of developing programming skills is to write code that is reusable so that you never have to start from scratch or repeat code you've already written.  We'll develop some general code for working with TVDs here, in the context of M&Ms...

M&Ms come in six colors -- red, orange, yellow, green, blue, and brown.  However, the distributed of colors in a bag is not uniform.  Some colors are present more frequently.  Further, the color distribution has changed over time, and the factories now produce M&M bags with different distributions.  **Run the cell below to load the `colors` table.**

In [None]:
colors = Table.read_table("m-and-ms.csv")
colors.show()

This table shows the distribution of M&M's published by the Mars company in 2008, as well as the distributions recently used by two M&M factories located in Cleveland, TN (CLV) and Hackettstown, NJ (HKP).  The last column (Observed) is the observed distribution of colors for 712 M&Ms from a bag.  We like to determine which of the following statements is supported by the data.

1.  The bag is an old bag from 2008.
2.  The bag was produced at the CLV factory.
3.  The bag was produced at the HKP factory.

Those statements lead to three null hypotheses that we can test with our TVD method.  We don't want to repeat similar code three times to do this, so instead we will develop a set of general functions that work for any of these questions, and, indeed, a great many more.

#### Part 4.1 (5 pts)


The first function we'll write is one we've already seen -- `calculate_tvd`.  So we have all the code in one place, please create a new copy of your solution here.  We include two sample calls

In [None]:
def calculate_tvd(obs_dist, null_dist):
    ...

print(calculate_tvd(make_array(0.5, 0.5), make_array(1/2, 1/2)))    # should be 0
print(calculate_tvd(make_array(0.45, 0.55), make_array(1/2, 1/2)))  # should be ~0.1

In [None]:
grader.check("q4.1")

#### Part 4.2 (5 pts)


Now, write a function `generate_simulated_tvds` that creates `n` samples of size `sample_size` drawn from the given `distribution`.  The function returns the TVDs for each sample in an array.  You have written something very similar above, but now we're create a standalone function for this operation.  We include some sample code that uses your function to build a table of TVDs for rolls of a fair dice with different sample sizes.

_Type your answer here, replacing this text._

In [None]:
def generate_simulated_tvds(n, distribution, sample_size):
    ...

In [None]:
# Dice have six sides and a uniform distribution of values when rolled.
dice_distribution = make_array(1/6,1/6,1/6,1/6,1/6,1/6)   
tvd_table = Table()
for sample_size in [ 1, 100, 10000, 1000000, 100000000 ]:
    tvd_table = tvd_table.with_column(str(sample_size), generate_simulated_tvds(10, dice_distribution, sample_size))
tvd_table.show()

#### Part 4.3 (5 pts)


Now write a general function that will compute the p-value for any `distribution` and `observed` value.  The `distribution` will be an array, and you can compute its length with `len(distribution)`.

In [None]:
def pvalue(distribution, observed):
    ...

# Small tests
print(pvalue(make_array(1,2,3,4,5), 1))
print(pvalue(make_array(1,2,3,4,5), 4))
print(pvalue(make_array(1,2,3,4,5), 10))

#### Part 4.4 (5 pts)


We'll now combine everything together into a single function `pvalue_for_tvd_test` the pvalue for obtaining the `observed_distribution` of size `observed_sample_size` from the `null_distribution` according to our TVD test statistic.  You should use your helper functions you just wrote.  You may generate a TVD distribution of any reasonable size.  We suggest at least a couple thousand.  We also give you one additional function to plot the data you will compute.

_Type your answer here, replacing this text._

In [None]:
def plot_results(title, simulated_distribution, observed):
    Table().with_column(title, simulated_distribution).hist()
    plt.scatter(observed, -0.02, color='red', s=70, zorder=2);
    plt.show();            

def pvalue_for_tvd_test(null_distribution, observed_distribution, observed_sample_size):
    observed_tvd = ...
    simulated_tvds = ...
    plot_results("Simulated TVDs", simulated_tvds, observed_tvd) # show plot to help debugging...
    ...

In [None]:
# 1. The bag is not an old bag from 2008:
pvalue_not_from_2008 = pvalue_for_tvd_test(colors.column("2008"), colors.column("Observed"), 712)
print(f"1.  The bag is not an old bag from 2008: {pvalue_not_from_2008}")

# 2. The bag was not produced at the CLV factory:
pvalue_not_from_CLV = ...
print(f"2.  The bag was not produced at the CLV factory: {pvalue_not_from_CLV}")

# 3. The bag was not produced at the HKP factory:
pvalue_not_from_HKP = ...
print(f"3.  The bag was not produced at the HKP factory: {pvalue_not_from_HKP}")

In [None]:
grader.check("q4.4")

#### Part 4.5 (5 pts)


With those pvalues, please assign True to each of the following statements if they are supported by the data.

In [None]:
bag_from_2008 = ...
bag_from_CLV = ...
bag_from_HKP = ...

In [None]:
grader.check("q4.5")

Now that you have written these functions, test similar hypothesis is just a single line of code.  For example, we can revisit Question 2 and compute the p-value for that test as follows:

In [None]:
pvalue_for_tvd_test(covid.column("Population (%)"), covid.column("COVID-19 deaths (%)"), 100)

You're done with Lab 7!  

**Important submission steps:** 
1. Run the tests and verify that they all pass.
2. Choose **Save** from the **File** menu, then **run the final cell**. 
3. Click the link to download the zip file.
4. Go to [Gradescope](https://www.gradescope.com/) and submit the zip file to the corresponding assignment. The name of this assignment is "Lab 7 Autograder". 

**Be sure your work is saved before running the last cell.**

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export()