# Notebook 4: Testing Explanations 


## 4.1: Experimenting with our Data

At this point, we have developed all sorts of 'hypotheses' or 'guesses' of the factors causing people to contract cholera. We noted that some explanatory variables correlate to our death outcome variable... what we want to know now is to determine which of these relationships, if any, are ***statistically significant***. 

__In this notebook__, we deep-dive into our hypotheses about how cholera spread in London. We seek to separate signal from the noise. That is, we will show that some hypotheses are likely a better fit for the data and are harder to reject, in a statistically significant way, than others. 

<br>

<table><tr>
    <td> <img src="imgs/santa_p.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

<br>

**By the end of this notebook, you should be able to**: 
- Understand the idea of experiments
- Create and interpret contingency tables and the $Chi^2$ statistic
- Apply this to two theories about how cholera was transmitted
- Create data visualizations
<br><br>

Let's begin by loading our data...


In [None]:
import pandas as pd
from matplotlib import pyplot as plt

house_data = pd.read_csv('https://raw.githubusercontent.com/uchicago-dsi/2023-data4all/main/Datasets/deaths_by_house.csv')
house_data

This data looks different than our prior data. This is because people in charge of the city’s sewers went door-to-door in a neighborhood hard hit by cholera deaths to assess the claim that toxic fumes from its sewers were causing the deaths. They collected data from 1,852 households in total, described as follows: 
- **deaths_r:** the number of deaths of **r**esidents of the house. 
- **deaths_nr:** the number of deaths of **n**on-**r**esidents (visitors) of the house. 
- **deaths:** the total deaths of both residents and non-residents. 
- **pestfield:** houses near the pest field, which some believed was emitting toxic air from people buried there after dying of the pest.
- **dis_pestf:** distance (in meters) from the nearest pest field (1m ~ 3.3 ft). 
- **dis_sewers:** distance (in meters) from the nearest sewer. 
- **dis_bspump:** distance (in meters) from the Broad St pump.

<br>

<table><tr>
    <td> <img src="imgs/doors.jpeg" alt="Drawing" style="width: 500px;"/> </td>
</tr></table>
<br>

<br>

Here, we should pause to discuss an important aspect of data science: ***Data problems like errors, bias, or omissions.***

<br>

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4a:** Identifying Problems in our Data </font>

**What data problems (e.g., errors, bias, or ommissions) might be present in John Snow's manually-collected household data set?** 

> Write your answer here! 

## 4.2: Introducing the Idea of Experiments

You used correlations before to explore whether there is a positive, negative (or no!) relationship between two variables. You also assessed how strong this relationship is and whether it is statistically significant. In this case, the variables had values that ranged across a whole continuum of numbers, i.e. they were continuously distributed. 

What you’ll do in this notebook is convert continuously distributed data into categories (= categorical data). Why? Because you can  group your data in a way that allows you to compare the outcome in one group to that of another group, contingent on a condition. The condition will be the potential explanation you want to explore across groups to see if it has a differential impact. 

For instance, you could create groups based on whether people died of a disease or not – and whether they lived in high-density areas or not. That gives you a 2x2 crosstab and four groups (also called a ***contingency table***). After you add the count of the number of people in each of the four crosstab cells, you are ready to make a two-way comparison: Did more people die in the high- vs low-density areas and were there more survivors in low- than high-density areas? The idea here is that you have one group that is exposed to a condition (density) more so than another group. When you run an actual experiment, the exposed group is often called an impact or treatment group while the unexposed group is the control.


<img src="imgs/image 4.2a.png" style="width: 800px;"/>
    

In the case of correlations, `pandas` gave you the correlation coefficient $r$ that indicated the direction and strength of the linear relation between two variables. A $p$-value indicated whether or not this relation is **statistically significant**, i.e. different from what you would get for random data patterns.

---------------------------------------------
### A quick dive into p-values... 

A $p$-value measures how probable it is that what you observe differs from what you would expect, *e.g.,* compared to random chance. You want to quantify how certain you can be that your results are less probable than what could have gotten randomly. Hence, a lower $p$-value is more statistically significant than a higher one.

Statisticians often select a ‘threshold’ value that indicates whether a p-value is significant-enough. While this threshold can vary across science discipline or use cases, a common threshold is 0.05. This means you expect to differ from randomness 95% of the time — and are willing to be wrong 5% of the time.

As a specific example, say you run a $Chi^2$ test to determine whether people who eat more gummy bears are also more likely to die of cholera — compared to people who eat fewer gummy bears. If your threshold is 0.05 and the observed $p$-value is 0.09, then we state that the two groups are not statistically significant, because 0.09 > 0.05.

---------------------------------------------


To assess whether the relation between an outcome and explanatory variable differs between the groups in the contingency table, we need a new statistic and significance test for categorical data. The test statistic we will use is called >> $Chi^2$ <<. While we won't go into too much detail here, those who are interested can find more information about $Chi^2$ at the end of this notebook! 


## 4.3: Investigating the Sewers

The first theory we will explore assumes that cholera is airborne and that people get infected by inhaling toxic fumes from localized sources. In this case, the source is fumes emitted from sewage lines through gully holes. 

If this theory was true, then closer proximity to sewers would make it more likely to inhale the toxic air and contract cholera. For simplicity, let us assume someone is 'close' to a sewer if they are at most 40 feet (12.2 meters) from it... otherwise they are 'far'.

<table><tr>
    <td> <img src="imgs/channel-sewer.jpeg" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

Now we will learn how to conduct a $Chi^2$ analysis using a contingency table in order to determine if there are statistically significant differences in "death" outcomes between those who are close-to versus far-from a sewer!


##### Let us first filter our data to create our contingency table!

In [None]:
# This is a *function* that allows us to visualize our table. We do not yet discuss functions in detail. 

def visualize_contingency_table(contingency_table, top_labels, left_labels):
    # print("\t\t  Close | Far ")
    print('{:<20s} {:<20s} {:<10s}'.format(top_labels[0], top_labels[1], top_labels[2]))

    i = 0
    for line in contingency_table:
        print('{:<20s} {:<20s} {:<10s}'.format(left_labels[i], str(line[0]), str(line[1])))
        i += 1
    print("\n")

In [None]:
# We first want to 'filter' our data frame to see only people who are close/far from sewer. 
sewer_close_deaths_df = house_data.loc[house_data['dis_sewers'] <= 12.2] 
sewer_far_deaths_df = house_data.loc[house_data['???'] > ???] 

# We next want to calculate the number of deaths in each class. 
sewer_close_deaths = sewer_close_deaths_df['deaths'].???() # what function is this? 
sewer_far_deaths = sewer_far_deaths_df['deaths'].???() # what function is this?


# We next want to 'filter' for non-deaths by determining if no residents OR non-residents died at a house. 
# ... we'll provide this one as it's a bit trickier!
sewer_close_nondeaths = sum((sewer_close_deaths_df.deaths == 0))
sewer_far_nondeaths = sum((sewer_far_deaths_df.deaths == 0))


print(f"Number of deaths close to sewers: {sewer_close_deaths}")
print(f"Number of deaths far from sewers {sewer_far_deaths}\n")

print(f"Number of nondeaths close to sewers: {sewer_close_nondeaths}")
print(f"Number of nondeaths far from sewers {sewer_far_nondeaths}\n\n")

In [None]:
# Now let's put it all together into a contingency table with the following shape! 
#            | Deaths | Non Deaths |
# Close      |    A   |     B      |
# Far        |    C   |     D      |

contingency_table = [
    [sewer_close_deaths, sewer_far_deaths],
    [sewer_close_nondeaths, sewer_far_nondeaths]
]

left_labels = ["Close", "Far"]
top_labels = [" ", "Deaths", "Non Deaths"]

print("Our contingency table...")
visualize_contingency_table(contingency_table, top_labels, left_labels)

In [None]:
from scipy.stats import chi2_contingency
# Now let us get our p-value! 
# ... when doing data science in Python, it is common convention to use
#.    "_" characters to mark variables whose values we don't need. 
chi_square, p_value, _, _ = chi2_contingency(contingency_table)
print(f"Our p-value: {p_value}")
print(f"Our Chi-squared value: {chi_square}")

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4b:** Interpreting p-value for $Chi^2$ test: Sewers</font>

**Based on the $p$-value of your $Chi^2$ test, is the relationship you observe between deaths and closeness to sewers significantly different from what you would expect if equal numbers of people were in each of the four groups? (at a 95% confidence level)** 

> Write your answer here! 


An important part of data science is not only determining statistical significance of hypotheses, but also communicating your findings to people without a statistics background. 

Imagine reading a newspaper headline (like below) that says ’The $p$-value was below $0.05$’... the average person does not know what this means! Visualizing your results is an important step in convincing others that your evidence is compelling! In the following, we create (and interpret) data visualizations that make it easier to understand your statistical results.

<table><tr>
    <td> <img src="imgs/funny_paper.jpeg" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>



We first explore a **histogram** -- a type of bar graph used to show differences in the frequency (or count) of various events. (In this case, the events are deaths and non-deaths of people close and far from the sewer). 

In [None]:
# Histogram

# Let's calculate the percentages of deaths that are 'close' versus 'far'. 
#    Close Deaths + Far Deaths should sum to 1! 
# (then we can do the same for non-graphs)
sewer_close_deaths_pct = sewer_close_deaths / (sewer_close_deaths+sewer_far_deaths)
sewer_far_deaths_pct = 1 - sewer_close_deaths_pct

sewer_close_nondeaths_pct = sewer_close_nondeaths / (sewer_close_nondeaths+sewer_far_nondeaths)
sewer_far_nondeaths_pct = 1 - sewer_close_nondeaths_pct


# 1. Let's first view the CLOSE deaths vs nondeaths. 
plt.bar(x=['close_deaths', 'close_nondeaths'], 
        height=[sewer_close_deaths_pct, sewer_close_nondeaths_pct], color='purple', label='close')

# 2. Let's first view the FAR deaths vs nondeaths.
plt.bar(x=['far_deaths', 'far_nondeaths'], 
        height=[sewer_far_deaths_pct, sewer_far_nondeaths_pct], color='gold', label='far')
plt.ylim((0,1))
plt.title("Deaths and Nondeaths (Close and Far): HISTOGRAM")
plt.legend()

As an alternative to histograms, line graphs can be used to display the same data. See below....

In [None]:
# Line graphs

plt.plot(['Close', 'Far'], [sewer_close_deaths_pct, sewer_far_deaths_pct], label='deaths')
plt.plot(['Close', 'Far'], [sewer_close_nondeaths_pct, sewer_far_nondeaths_pct], label='nondeaths')
plt.legend()
plt.title("Deaths and Nondeaths (Close versus Far): LINE GRAPH")

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4c:** Interpreting the Visualizations</font>

**Do the histogram and line graph reach the same conclusion as your $Chi^2$ analysis? Why or why not?** 

> Write your answer here! 

<br>

<br>

--------------

### The 3 second rule
The 3 Second Rule (https://stephanieevergreen.com/the-3-second-rule/) states that one gets 3 seconds to grab someone’s attention and flag the take-home point of a data visualization. 

--------------

<br>

<br>


<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4d:** The BETTER visualization</font>

**Which visualization (histogram or line graph) better follows the 3-second rule?** 

> Write your answer here! 

## 4.4: Investigating the Broad Street Pump

Next, we want to explore the theory that cholera was transmitted through contaminated water. At the time, John Snow guessed that the water of a particular pump, the Broad Street Pump, might have carried pieces of poisonous sewage. Was this true? 

<table><tr>
    <td> <img src="imgs/pump3.jpeg" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

Now that you have experience performing not one, but TWO $Chi^2$ analyses, this one will be largely independent! In the following cells, you will need to: 
- filter your data
- create a contingency table
- generate (and comment on) the statistical significance
- provide one visualization (two, if time!) to convince your audience that your conclusion is plausible! 

**Pro-tip:** the most important tool in any data scientist's toolbox (even more important than python) is copy-and-paste! 

In [None]:
# Filter data here. 
# For pump closeness, use a distance of 160 meters! 


In [None]:
# Create contingency table here. 


In [None]:
# Find p-value here. 


<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4e:** Interpreting p-value: Sewers</font>

**Based on your $p$-value, is the relation you observe between deaths and closeness to sewers significantly different from what you would expect if the number of people in each group was equal? (at a 95% confidence interval?)** 

> Write your answer here! 

In [None]:
# Create histogram here. 


In [None]:
# Create line graph here. 


<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 4f:** Interpreting the Visualizations</font>

**Do the histogram and line graph reach the same conclusion as your $Chi^2$ analysis? Why or why not?** 

> Write your answer here! 

<br>

<br>

--------------

## 4.5: Reflection
<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 4g:** Reflection </font>

At the end of each notebook in Data4All, we will take time to reflect on what we learned! You can write as much or as little as you like, but please answer the following three questions... 

**What do you understand better after this notebook than before?**
> Write your answer here! 

**Please fill out the Notebook survey here!**
> https://forms.gle/54KHEbPGsRxQU3Bh9

<br>

--------------------------------

<br>

<img src="imgs/save-icon.jpeg" alt="Drawing" align=left style="width: 20px;"/> <font size="4">     **&ensp;&ensp;&ensp;Last step: save your work!** </font>