# Notebook 2: Exploring the Data I


## 2.1: Seeing the Problem in Data

Finding the origins of cholera's spread was a controversial issue in the 1800s. Before even seeing the problem in the data, the people of that time could see cholera **all around them**. As friends and relatives grew gravely ill, it became urgent to discover **why** this was happening in hopes of putting a stop to it.

People, including the local media, had different ideas about what could be causing cholera. For instance, take a look at the following political cartoon of the time:
<br>

<table><tr>
    <td> <img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/king_cholera.png?raw=true" alt="Drawing" width= 600px;"/> </td>
</tr></table>

<br>

<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=10px/> <font size=4> **Journal 2a:** Interpret the Cartoon </font>

**What do you think is the underlying message of this cartoon?**

> Write your answer here!

<br>

-------------------------------------------------------------------------------------------------

<br>

**By the end of this notebook, you should be able to:**
- Understand what the problem is through the data
- Normalize data and understand why it’s important to do so
- Create an outcome variable with real-world data

<br>

### Data Science 101: Finding the Problem

Data scientists have curious minds; when confronted with a problem in the 'real world' they first try to better understand that problem with data (before, of course, looking for answers and solutions).

After all, it's a lot easier for superheroes to solve a mystery when there's a signal illuminating what is driving the problem.

<img src = "https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/bat-signal.jpeg?raw=true" width="700"/>


<br><br>

Using the Pandas toolkit, let's see if we can dig into our data to gain any insight as to ***when*** cholera outbreaks have occurred. The data below show, for a given year, how many people lived in London and how many people died there (of any cause).

In [1]:
# Load our Pandas data science library
import pandas as pd

In [2]:
# Load data about London
London = pd.read_csv("https://github.com/uchicago-dsi/2023-data4all/blob/main/Datasets/London.csv?raw=true")
London

Unnamed: 0,year,population,deaths
0,1840,1842458,46281
1,1841,1877963,45284
2,1842,1916860,45272
3,1843,1953787,48574
4,1844,2033816,50423
5,1845,2073298,48332
6,1846,2113535,49089
7,1847,2195401,60442
8,1848,2238703,57628
9,1849,2282858,68432


<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width= 20px/> <font size=4> **Journal 2b:** Thinking about the Data </font>

**In 1-3 sentences, comment on what you see in these data. Is there enough here to determine when cholera outbreaks occurred? Why or why not?**

> Write your answer here!

### The importance of normalization

`deaths` are higher in 1854 than in 1840. Is this because of cholera? *Maybe*. Is this because of population growth? *Possibly*. Simply put, if you have more people in a city, then more people die.

For instance, if `40,000` people die in Chicago each year (pop. 3,000,000), and only `50` people die in the small town Salem, NJ each year (pop. 5,000) ... then are you `40,000 / 50 = 800` times less likely to die in Salem?!


<table><tr>
    <td> <img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/chicago.jpeg?raw=true" alt="Drawing" width="500"> </td>
    <td> <img src="https://raw.githubusercontent.com/uchicago-dsi/2023-data4all/published/imgs/salemNJ.png?raw=true" alt="Drawing" width="500"> </td>
</tr></table>

This example highlights the importance of ***normalization***: adjusting the values of data so that they are on the **same scale**  – in this case, so you can compare the chance of dying in the much larger city of Chicago vs. the much smaller town of Lonsdale.


<br><br>
<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=20px/> <font size=4> **Journal 2c**: How to Normalize
    
**For the London data, how might you normalize to find years when cholera is particularly fatal?**</font>

> Write your answer here!

<br><br>

## 2.2: Creating an Outcome Variable
An **outcome variable** is the variable that we want to explain using other variables! You can **normalize an outcome variable** to avoid the influence of population discussed earlier.

Let's return to the London example...

Since we are interested in reasons why people die of cholera, `deaths` seems like a logical choice for our outcome variable!

BUT different years have different populations, which leads us to the "Chicago-vs-Lonsdale" dilemma from before...

We can easily normalize `deaths` by using the `population` variable to create a `death rate`, also known as "Mortality Rate".

This is done with the following calculation:
$$death \ rate = {deaths \over population} \times 1000$$

which is the same as

$$death \ rate = {deaths \div population} \times 1000$$

**Task:** complete the following cell. The code is going to repeat the death rate calculation for each row (15 times, total). If done correctly, you should have a new column called "deaths_per_1000", with a value for each year.

In [None]:
# Calculate death (mortality) rate per 1,000 people.
# The "/" means that we divide every item in the "deaths" column by every item in the "population" column.
# The "*" means that we multiply every value (for our new outcome variable) by 1000.

London['deaths_per_1000'] = London['deaths'] / London['population'] * 1000

London

Unnamed: 0,year,population,deaths,deaths_per_1000
0,1840,1842458,46281,25.119161
1,1841,1877963,45284,24.113361
2,1842,1916860,45272,23.617792
3,1843,1953787,48574,24.861461
4,1844,2033816,50423,24.792312
5,1845,2073298,48332,23.311651
6,1846,2113535,49089,23.226017
7,1847,2195401,60442,27.531189
8,1848,2238703,57628,25.741691
9,1849,2282858,68432,29.976459


<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=10px/> <font size=4> **Journal 2d**: The magic of 'per'
    
**Explain in your own words why `deaths_per_1000` is a better outcome variable than `deaths`?**</font>

> Write your answer here!


<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=10px/> <font size=4> **Journal 2e**: Identifying the Outbreak(s)
    
**What years, based on your new outcome variable, are there cholera outbreaks?**</font>

> Write your answer here!

----------------------------------------------

<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/pencil.png?raw=true" alt="Drawing" align=left width=10px/> <font size="4">**Journal 2f:** Reflection </font>

**As you look at the data, briefly describe the "problem" facing the people of London.**
> Write your answer here!

**Please fill out the Notebook survey here!**
> https://forms.gle/54KHEbPGsRxQU3Bh9

<br>

--------------------------------

<br>

<img src="https://github.com/uchicago-dsi/2023-data4all/blob/main/imgs/save-icon.jpeg?raw=true" alt="Drawing" align=left width=20px/> <font size="4">     **&ensp;&ensp;&ensp;Last step: save your work!** </font>