
# Notebook 1: Seeing the Problem as Data


## 1.1: What is Data?

Like the word **information**, **data** is such a general word that it refers to lots of things.

In this workshop, we’ll focus on data as something **collected** or **recorded** about the **real world**, to better observe and understand the world. 


### Data in the real world
Data can be found all around our daily lives. Data about weather is collected by weather stations and used by meteorologists to determine weather outcomes. Data about your Netflix viewing history can be used by Netflix's algorithms to recommend new movies and shows that better match your interests. 

<br>

<table><tr>
    <td> <img src="imgs/netflix-image.jpeg" alt="Drawing" style="width: 400px;"/> </td>
    <td> <img src="imgs/weather-image.jpeg" alt="Drawing" style="width: 400px;"/> </td>
</tr></table>

<br>

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4> **Journal 1a:** Data in YOUR world </font>

**Give an example of data in your daily life, e.g. data that you collect or that’s collected about you. Explain where it comes from, how it's collected/recorded, and for what it's used.** 

> Write your answer here! 


### Data from the 1854 London cholera outbreak

Cholera is a disease that we now know how to treat but before this treatment was discovered at a time of your great-great- grandparents, only 1 in 2 infected people had a chance to survive. At that time, in mid-19th century London, doctors, health inspectors, pastors and many others desperately tried to find the cause of cholera to stop people from dying. One of them was **John Snow**, a physician who started recording the following information about London households:

![](imgs/img-ledger.png)

<!-- As you can see, there are data about the following (in order):
- House number
- Neighborhood
- Occupation
- Age
- Symptoms
- Water Supplier -->

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size=4>**Journal 1b:** Taking inventory: what do we have?</font>

**Describe the data you see in John Snow's notes (the photo). What types of data are present? Are there other types of data you would want to collect if you were trying to find out how cholera is transmitted?** 

> Write your answer here! 

**Now we will embark on our own journey to trace the origins of the Cholera epidemic, but instead by using a 21st century data science toolkit!**
<br><br>
In the next cell, please add your information to get started! 
<br><br>

In [1]:
# Change this to be your name! 
first_name = "John"
last_name = "Snow"

# print is the simplest Python function -- it allows us to view our data! 
print(f">>> Hello world, my name is {first_name} {last_name}!")

>>> Hello world, my name is John Snow!


## 1.2: Representing Data on the Computer

In this **Jupyter notebook**, we will begin our data science journey via the **Python programming language**. 

<img src="imgs/python-image.png" alt="Drawing" style="width: 400px;"/>


**By the end of this notebook, you should be able to**: 
- Represent data from the real world in Python
- See how data are stored, grouped, and organized for data science
- Understand and manipulate variables and lists
<br><br>

**Data representation in Python**
The following contains some of the data from above represented in Python code. All of the *{}* and *:* symbols aside, this should map somehow to John Snow's notes shown above. 

In [2]:

person_0 = {"house_number": 7, 
            "neighborhood": "Layton's Buildings",
            "date": "July 29", 
            "occupation": "tailor",
            "age": 20, 
            "symptoms": "cholera 17 hours",
            "water_supplier": "Southwark & Vauxhall"}

person_1 = {"house_number": 2, 
            "neighborhood": "Dobb's Cross",
            "date": "July 30", 
            "occupation": "son of a shop-keeper",
            "age": 10, 
            "symptoms": "cholera Asiatic 24 hours",
            "water_supplier": "Southwark & Vauxhall"}

person_2 = {"house_number": 81, 
            "neighborhood": "Ann Street",
            "date": "July 29", 
            "occupation": "son of a labourer",
            "age": 12, 
            "symptoms": "cholera 8 hours",
            "water_supplier": "Southwark & Vauxhall"}

# We can 'group' all of these data together by using a 'list'!
people_list = [person_0, person_1, person_2]
print(f"Our list of people: {people_list}")


Our list of people: [{'house_number': 7, 'neighborhood': "Layton's Buildings", 'date': 'July 29', 'occupation': 'tailor', 'age': 20, 'symptoms': 'cholera 17 hours', 'water_supplier': 'Southwark & Vauxhall'}, {'house_number': 2, 'neighborhood': "Dobb's Cross", 'date': 'July 30', 'occupation': 'son of a shop-keeper', 'age': 10, 'symptoms': 'cholera Asiatic 24 hours', 'water_supplier': 'Southwark & Vauxhall'}, {'house_number': 81, 'neighborhood': 'Ann Street', 'date': 'July 29', 'occupation': 'son of a labourer', 'age': 12, 'symptoms': 'cholera 8 hours', 'water_supplier': 'Southwark & Vauxhall'}]


<img src="imgs/break.jpeg" alt="Drawing" align=left style="width: 100px;"/> <font size="4">**WE ARE GOING TO TAKE A SHORT BREAK HERE!**</font> 

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 1c:** The data whisperer...</font>

## 1.3: Let's *slow down* and smell the... data types?

So now we beg the question... ***what was actually going on in the previous code?***

In this section, we are going to see if we can understand the basic units of Python data storage: 
- types
- variables
- lists
- lists of lists


--------------
There are several **types** in Python that can be used to represent data values. I outline a handful of types in the following:


| Type | Description | Examples |
| :-- | :-- | :-- |
| String ('str') | A sequence of characters; stored within "" | "cat", "London", "27" |
| Integer ('int')| Whole positive or negative numbers *without* decimal points | -50, 0, 27 |
| Float ('float') | Real numbers that *can* have multiple decimal points | -50.0, 0.75, 3.14159 |
| List ('list') | A sequence of any Python data type; stored within \[ \] | \["a", "b", "c"\] |

<br><br>
We can see the type of data in Python by using the `type()` function as shown in the following cell: 

In [3]:
# Types
print(type("London"))
print(type(37))
print(type("37"))
print(type(37.0))
print(type(['a', 'b', 'c']))

<class 'str'>
<class 'int'>
<class 'str'>
<class 'float'>
<class 'list'>


<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 1d:** Learning about types!</font>

**In the following cell, write some example lines that answer the following questions:**

c1. **What is the type of a string that uses '' instead of ""? **
> Write your answer here! 

c2. **Can we have different types in the same list? **
> Write your answer here!

c3. **What happens if we turn "London" into London (no quotes) and try to run the cell? **
> Write your answer here!

In [4]:
# Example for c1 here! 

# Example for c2 here! 

# Example for c3 here!

A **variable** is a Python storage container for a **single type** of data. For instance, in algebra class, you might be familiar with math equations like the following:  

$2x=50$

... where $x$ is an **integer variable** that represents some value ***(which is???)***. 

In Python, we can set variables to be almost anything that we want. Earlier in this notebook, for instance, we used two string variables to represent your first and last names using two string variables: 
```
first_name = "John"
last_name = "Snow"
```

### Coding Exercise: 
Create variables (of appropriate types!) that suit the following prompts. Print their values at the end of the cell!

In [5]:
# Your ZIP code. 

# Your dream college.

# A list of your top 5 favorite musical artists.


# ... Now use the print() command on all 3 variables to see their values! 

<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/>  <font size="4">**Journal 1e:** Variables</font>

**Why did you choose the 'type' you did for each variable?**

> Write your answer for ZIP code here!

> Write your answer for dream college here!

> Write your answer for musical artists here!

## 1.4: Expanding our Data Science Toolbox with 'lists of lists'. 

As mentioned before, a **list** in Python is a sequence of other Python types, such as `[1, 2, 3, 4, 5]` or `['a', 'b', 'c']`. Interestingly, you can also create a list of lists in Python to create **tabular** data, or data in row-column format. 

We can very easily rewrite our data into a 'list of lists' format. See as follows...


In [6]:

headers = ["house_number", "neighborhood", "date", "occupation", "age", "symptoms", "water_supplier"]
person_list_0 =  [7, "Layton's Buildings", "July 29", "tailor", 
                  20, "cholera 17 hours", "Southwark & Vauxhall"]

person_list_1 = [2, "Dobb's Cross", "July 30", "son of a shop-keeper", 
                 10, "cholera Asiatic 24 hours", "Southwark & Vauxhall"]

person_list_2 = [81, "Ann Street", "July 29", "son of a labourer",
                 12, "cholera 8 hours", "Southwark & Vauxhall"]

# person_list_3 = ... 

person_2d_list = [person_list_0, person_list_1, person_list_2]
print(f"This is our list of lists: {person_2d_list}")

This is our list of lists: [[7, "Layton's Buildings", 'July 29', 'tailor', 20, 'cholera 17 hours', 'Southwark & Vauxhall'], [2, "Dobb's Cross", 'July 30', 'son of a shop-keeper', 10, 'cholera Asiatic 24 hours', 'Southwark & Vauxhall'], [81, 'Ann Street', 'July 29', 'son of a labourer', 12, 'cholera 8 hours', 'Southwark & Vauxhall']]


### Coding Exercise: 
Add `person_3` to our list-of-lists above! 


In order to store, explore, and manipulate these **tabular** data, we use **Pandas** --- one of the most well-known data science toolkits for Python. While we won't dig too deeply into this 

<img src="imgs/pandas.svg.png" alt="Drawing" style="width: 400px;"/>

**We will use lots of Pandas later in this workshop!** For now, let's load the Pandas library into Python and put our list-of-lists into a Pandas **Dataframe**. 


In [7]:
import pandas as pd

df = pd.DataFrame.from_records(person_2d_list, columns=headers)
df

Unnamed: 0,house_number,neighborhood,date,occupation,age,symptoms,water_supplier
0,7,Layton's Buildings,July 29,tailor,20,cholera 17 hours,Southwark & Vauxhall
1,2,Dobb's Cross,July 30,son of a shop-keeper,10,cholera Asiatic 24 hours,Southwark & Vauxhall
2,81,Ann Street,July 29,son of a labourer,12,cholera 8 hours,Southwark & Vauxhall



<img src="imgs/pencil.png" alt="Drawing" align=left style="width: 20px;"/> <font size="4">**Journal 1f:** Reflection </font>

**What did you learn in this notebook?**
> Write your answer here! 

**What would you like to learn in the upcoming Data4All lessons?**
> Write your answer here! 