# Notebook 9: Testing COVID-19 explanations

In the previous notebook, we explored the *correlation* of various explanatory variables with a region's positivity rate. In this notebook, we perform a formal $Chi^2$ analysis (our old friend) to determine whether some of the explanations that seem to be driving these correlations are, in fact, statistically significant!


<br>

<table><tr>
    <td> <img src="https://raw.githubusercontent.com/uchicago-dsi/2023-data4all/main/imgs/covid.jpeg?raw=true" alt="Drawing" width="600"> </td>
</tr></table>

<br>


Remember from your John Snow exploration that $Chi^2$ tests are conducted as follows:
1. Create a contingency table containing your outcome variable *and* one or more explanatory variables
2. Plug the contingency table into a $Chi^2$ function
3. Determine whether the p-value is low enough to show that the relationship between your outcome and explanatory variables is statistically significant

... but before we do anything else, let's load our data and libraries!

In [None]:
import pandas as pd

# Next we load our data into a usable format
frame = pd.read_csv("https://raw.githubusercontent.com/uchicago-dsi/2023-data4all/main/Datasets/cov_chi_with_positivity_lite.csv?raw=true")


# Why do we drop nan (missing) values? Because in this case, it makes data-visualization difficult!
frame = frame.dropna()
print(f"How many locations are in our data?: {len(frame)}")

## Task 1. Create a contingency table

In this section, we will create a contingency table of our outcome variable (positivity rate) and the explanatory variable that your group selected as part of your proposed explanation. **For a refresher on filtering and contingency tables, feel free to refer back to your completed `Notebook_4.ipynb`!**

In [None]:
# This is a *function* that allows us to visualize our table.

def visualize_contingency_table(contingency_table, top_labels, left_labels):
    # print("\t\t  Close | Far ")
    print('{:<20s} {:<20s} {:<10s}'.format(top_labels[0], top_labels[1], top_labels[2]))

    i = 0
    for line in contingency_table:
        print('{:<20s} {:<20s} {:<10s}'.format(left_labels[i], str(line[0]), str(line[1])))
        i += 1
    print("\n")

In [None]:
# Enter the outcome and explanatory variables you chose as part of your proposed explanation.
# Enter the names (as a string, in quotes) of a variable listed in Notebook 8.

outcome_var = ???
explan_var = ???

Now let's put it all together into a contingency table with the following shape!

|             | above_med_pos | below_med_pos     |
| :---        |    :----:   |          ---: |
| above_med_explan      | A       | B   |
| below_med_explan      | C        | D      |

In [None]:
# First we can create separate dataframes for 'above' and 'below' median explanatory variable

median_explan = frame[???].median() # median explanatory value to measure against
above_med_explan = ???[???[???] ??? ???] # Data frame containing all regions with above median explanatory variable value
below_med_explan = ???  # The rest of the data frames (below median value)

In [None]:
# Now we want to count the number of locations with above-and-below median positivity rates (within each)
# For simplicity, we label our values according to the table above.

# Hint: you can use .shape[0] to determine how many regions belong to a data frame!

med_positivity = ???  # Calculate median positivity (across all of Chicago) here!

# For example "A" should be the **count** of all `above_med_explan` variables that ALSO have above-median positivity!
# Hint: Use a conditional
A_val = above_med_explan[???[???] ??? ???].shape[0]
B_val = ???
C_val = ???
D_val = ???

In [None]:
# Place the variables for each of the counts in the contingency table and run the function to create it.

left_labels = [f'above_med_{explan_var}', f'below_med_{explan_var}']
top_labels = [" ", f'above_med_{outcome_var}', f'below_med_{outcome_var}']

contingency_table = [
    [???, ???],
    [???, ???]
]

visualize_contingency_table(contingency_table, top_labels, left_labels)

## Task 2. Plug contingency table into $Chi^2$ function

In [None]:
from scipy.stats import chi2_contingency
# Now let us get our p-value!
# ... when doing data science in Python, it is common convention to use
#.    "_" characters to mark variables whose values we don't need.
chi_square, p_value, _, _ = chi2_contingency(contingency_table)


## Task 3. Find, print(), and interpret the p-value.

In [None]:
print(f"Our p-value: {???}")
print(f"Our Chi-squared value: {???}")

<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=20px/> <font size="4">**Journal 9a:** Interpreting the test </font>

**Based on this $Chi^2$ test, what claims can you make about the relationship between a region's positivity rate, and your explanatory variable?**
> Write your answer here!

-------------------------

<br>

# Now it is up to you...

You explored the correlation of *many* explanatory variables in Notebook 8. Thus far, you have only explored the statistical significance of *one*. As your proposed explantion evolves, you will want to test other explanatory variables while keeping a record of the ones you have already investigated.

In the following cells, determine the statistical significance of the relationship between a region's positivty rate and *other* explanatory variables as your investigation proceeds!


<br>

<table><tr>
    <td> <img src="https://raw.githubusercontent.com/uchicago-dsi/2023-data4all/main/imgs/covid_soapbox.png?raw=true" alt="Drawing" width="600"> </td>
</tr></table>

<br>

## In other words, it is time for you to show the world what exactly you think causes people to contract COVID-19!

Use as many cells below this as you need!