<a href="https://colab.research.google.com/github/unfamiliarplace/acse-integration/blob/main/data_science/data_science_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Funfamiliarplace%2Facse-integration&branch=main&subPath=data_science%2Fdata_science_2.ipynb&depth=2"  target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Data Science Part 2


### Three data science notebooks

Here's what we've covered and will cover in this series:

Part 1: We examined Python's core data structures and saw some simple visualizations.

**Part 2: We will explore and learn to use Python's dedicated data science tools in more depth.**

Part 3: We will apply our knowledge to a project with multiple steps and visualizations.

### Outline of this notebook

The goal of this notebook is to learn how to use Python's dedicated data science tools, in a way that you can take and reuse with students.

**Introduction**

**1. Representing data:** Choosing the right structures and labels for a dataset.

**2. Analyzing data:** Transforming, selecting, and calculating statistics on data.

**Conclusion**


## Introduction

**Data science** may seem like a complex topic, but it boils down to this:

> There's a ton of data out there! How do I find the information I'm looking for? How do I understand it? And how do I make it relevant?

In other words, data science is the science of handling data in order to *answer a specific question*.

Python allows us to do this by providing us with tools for representing data, manipulating or selecting from it, analyzing it, and creating charts, graphs, maps, and other visualizations that display it. In this notebook, we are learning how to use these tools. (In the next notebook, we will take a closer look at gathering and cleaning data from various real-world sources so that we can put our skills together for a specific purpose.)

### Learning goals

* A1. demonstrate the ability to use different data types, including one-dimensional arrays, in computer programs.

* A2. demonstrate the ability to use control structures and simple algorithms in computer programs.

### Success criteria

* I can choose and implement a structure to represent a dataset in code.

* I can manipulate a data structure in code to select, organize, and analyze data.

> [Source: Ontario Curriculum (2008)](https://www.edu.gov.on.ca/eng/curriculum/secondary/computer10to12_2008.pdf#page=41)


## 1. Representing data

As we saw in the last notebook, often the greatest power in Python comes from the use of external libraries. Two core libraries for data science are `numpy` (Numerical Python) and `pandas` (Python for Data Analysis).

We will focus on `pandas` for this lesson to keep things simple. (However, note that `pandas` is actually built on `numpy`, and uses `numpy` datatypes like an `ndarray` instead of Python's built-in `list`.)

Run this code block to ensure you have `pandas` installed:

In [None]:
# Install numpy and pandas to the current environment
# N.B. pandas should install numpy automatically since it's a prerequisite

%pip install pandas

Next, run this block to import it. Remember that in Jupyter Notebooks, all cells share a memory pool, so if we import it at the start then they'll be available for all future blocks.

In [None]:
# Import pandas
import pandas as pd

### Series

There are two core data structures in `pandas`: `Series` and `DataFrame`.

A `Series` is essentially a one-dimensional array, like a `list`. Run this code block to see an example.

In [None]:
# A Series representing some mysterious data
s = pd.Series([200, 210, 215, 230, 291])
print(s)

#### What makes a Series unique?

That output is a little surprising for a one-dimensional array! We notice that it's presented like a table, with one column for the index and another for the data.

That gives us a hint about the first unique quality of a `Series`: You can use any kind of index, not just a numerical count from `0`.

Let's try making a custom index. Run this code block. P.S. Can you guess what this data represents?

In [None]:
# Using strings for the index
s = pd.Series([200, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])
print(s)

Now the "rows" of our table each have a name. You might be thinking now that this is kind of like a dictionary, with key-value pairs, and that's a valid observation.

**Understanding Check:** Notice that we have the same number of values and labels. What if those two lists were different lengths? Make a guess and then try it out.

Also, there's a weird thing there at the end of our `Series` output: a `dtype`! This datatype is given because a `Series` can only contain *one* type of data for *all* its elements. This is one of the key principles of having clean datasets: a restriction like this helps us avoid comparing apples and oranges, or rather 5s and oranges.

In this case, `pandas` guessed the datatype we wanted, a 64-bit integer. This is a very generous guess because an `int64`'s max value is `2^64` or about 9 quintillion. We can optimize our `Series` by specifiying the datatype if we know the bounds of our dataset. Let's use `int16`, which gives us an overheard of `2^16` or `65,536`.

In [None]:
# Specifying a datatype
s = pd.Series([200, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'],
              dtype='int16')
print(s)

**Understanding Check:** What would happen if you used an even slimmer datatype of `int8`? Make a guess and then try it out.

#### What can I do with a Series?

Here's a [full reference](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), but let's take a look at some common tasks.

You can run each of these code blocks to see how they work:

In [None]:
# Get all the values
s.values

In [None]:
# Get a specific item
s.loc['Skor']

In [None]:
# Get the length of the Series
s.size

In [None]:
# Get the highest value
s.max()

In [None]:
# Get the lowest value
s.min()

This next one will show you something a Python list can't easily do. We found the maximum and minimum values; what if we want to know which chocolate bars those were? Simple:

In [None]:
# Get the name of the chocolate bar with the highest value
s.idxmax()

In [None]:
# Get the name of the chocolate bar with the lowest value
s.idxmin()

We can also sort a `Series` [by its values](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html) or [by its indices](https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_index.html). (Note that unlike a list, this is *not* done in place unless you specify the keyword argument `inplace=True`.)

In [None]:
# Sorting a Series by value
print(s.sort_values())

In [None]:
# Sorting a Series by index
print(s.sort_index())

We can also do some very cool operations on the whole `Series` at once. Can your grandma's `list` do this?

In [None]:
# Add to all items
s = s + 100
print(s)

In [None]:
# Multiply all items
s = s * 5
print(s)

In fact, we can carry out any function on all elements of a `Series`. Let's get the square root of each value:

In [None]:
# Map a function
from math import sqrt
s = s.map(sqrt)
print(s)

Hey, those decimals aren't very pretty. Let's round them to 2 decimal places:

In [None]:
# Round a Series
s = s.round(2)
print(s)

**Understanding Check:** Try a few more operations on this `Series`: subtraction, division, exponentiation, negation.

Feel free to run this block to reset your `Series` if you need to:

In [None]:
s = pd.Series([200, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])

There is much more we can do with a `Series`, including:

* Filtering data
* Correlating data
* Analyzing data (e.g. averages)

We'll save these for the next section, though.

### DataFrame

This is the other major structure in `pandas`. Once we understand a `Series`, a `DataFrame` is not too complicated: it's just a bunch of `Series` glued together into a table. In other words, it's two-dimensional.

Again, [here's the full reference](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), but we'll just look at a few aspects of it.

Let's start by simply throwing our existing `Series` into a `DataFrame`. When we do so, we'll give it a name, so I'll now reveal what those values represent.

In [None]:
# Creating a DataFrame from a Series
s = pd.Series([200, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'],
              name='Calories per package')
df = pd.DataFrame(s)
print(df)

So far it looks the same. But we can now add another `Series` to it and grow our table.

Note that we don't actually need to name our new `Series` because the `DataFrame` needs a column name anyway. We do need to make sure the indices match, though.

In [None]:
# Add another Series to a DataFrame
s2 = pd.Series([39, 39, 44, 51, 58],
               index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])
df['Grams per package'] = s2
print(df)

There! Now it looks more like a table.

**Understanding Check:** Can you add a column that lists the deliciousness level, on a scale from 1 to 10, for each bar? (You can make up the values, but Mars is definitely more delicious than Snickers.)

In [None]:
# Add a third column for 'Deliciousness'
# TODO ADD YOUR CODE HERE

#### Other ways of making DataFrames

We made a `DataFrame` this way because we looked at `Series` first, but we could also have created one from scratch like this.

In [None]:
# Making a DataFrame without first making a Series

df = pd.DataFrame({
    'Calories per package': [200, 210, 215, 230, 291],
    'Grams per package': [39, 39, 44, 51, 58],
    'Deliciousness': [5, 4, 8, 6, 2]
}, index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])
print(df)

By the way, an alternative — and equally good — way to represent this data would be to take the name of the chocolate bar as an additional column.

If we do so, the indices could be `0, 1, 2, 3, 4`, or we could even use the chocolate bar name in both places. It's a matter of preference.

In [None]:
# Default 0-based numerical index, data as column

df = pd.DataFrame({
    'Chocolate bar': ['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'],
    'Calories per package': [200, 210, 215, 230, 291],
    'Grams per package': [39, 39, 44, 51, 58],
    'Deliciousness': [5, 4, 8, 6, 2]
})
print(df)

In [None]:
# Same data as both index and column

bars = ['Skor', 'Heath', 'Snickers', 'Mars', 'Twix']
df = pd.DataFrame({
    'Chocolate bar': bars,
    'Calories per package': [200, 210, 215, 230, 291],
    'Grams per package': [39, 39, 44, 51, 58],
    'Deliciousness': [5, 4, 8, 6, 2]
}, index=bars)
print(df)

## 2. Analyzing data

All right, let's try to juggle this thing a bit and see what we can make come out of it.

First, in order to do more interesting things, we'll want a bigger dataset. We'll use `pandas`' `read_csv` function to open a comma-separated values file from GitHub that contains some data on the weather as measured in January 2023 at Pearson Airport in Toronto ([source](https://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=51459));

In [None]:
# Reading from a remote CSV
import pandas as pd
url = "https://raw.githubusercontent.com/unfamiliarplace/acse-integration/main/data_science/data/2023-01_51459.csv"
df = pd.read_csv(url)

*P.S. You can also use your own files for data. If you're using notebooks in an IDE like VSCode, you can just use local files. If you're using Google Colab or Calysto, you can upload files to the instance. Then you would read them using standard I/O functions like this:*

In [None]:
# Reading from a local CSV
import pandas as pd
path = "data/2023-01_51459.csv"
df = pd.read_csv(path)

### Viewing



Before we go ahead and just `print(df)`, we should realize that this could be a long file. If we want to see what sort of thing it contains, we don't need to look at all of it — the first couple of rows should be enough.

To do that, we use `DataFrame.head()`:

In [None]:
# Show the head or first few rows of a DataFrame
print(df.head())

We can see that this DataFrame is numerically indexed, and it has 5 columns: a date, a maximum and minimum temperature in degrees Celsius, rain in millimetres, and snow in centimetres.

First of all, let's set the date as the index. It's not really weather data; it's the label that each row of data should be known by.

In [None]:
# Changing date from column to index
df = df.set_index('Date', drop=True)

In [None]:
print(df.head())

Suppose we want to navigate to some particular data. You might remember `loc` from when we used `Series` above. We can use that here too. I remember that I moved to a new apartment on January 8, but I forget if it was raining that day. Let's find out:

In [None]:
# Getting one row of the frame
print(df.loc['2023-01-08'])

It was not. But it was pretty cold!

**Understanding Check:** Take a closer look at that output. What type do you think it has?

<details><summary>Click to reveal</summary>

It's a `Series`! It has a `dtype` and a `name`, which is the column header. This might seem strange if you remember that earlier, a `Series` was a column of a `DataFrame`. In other words, the data must have been transposed here. Sure enough, the `DataFrame`'s column headers become the row indices!

</details>

Maybe I only care about the minimum temperature and not all that other stuff. Can I get that? Yup — via a pair of indices in a `loc`, much like the coordinates of a point.

In [None]:
# Getting one cell of the frame
print(df.loc['2023-01-08', 'Min Temp (C)'])

Can I get just one *column*? Yes, by directly indexing instead of using `loc`:

In [None]:
# Getting one column of the frame
print(df['Min Temp (C)'])

Notice that this, too, comes back as a `Series` with the row indices intact.

By the way, we can sort `DataFrames` just as we can `Series` [by values](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html) or [by indices](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html), with the additional step that we need to specify which column to sort by.

In [None]:
# Sorting DataFrame by column
print(df.sort_values(by='Min Temp (C)'))

In [None]:
# Sorting DataFrame by index
print(df.sort_index())

### Modifying

Let's add a column of our own that might be handy to work with. We have a maximum and a minimum temperature; let's add a range column.

Here's where some of that `Series` magic we saw earlier comes into play! No looping is needed; we can just do an operation on a whole column:

In [None]:
# Insert a new column based on two old ones
range_series = df['Max Temp (C)'] - df['Min Temp (C)']
df['Temp Range (C)'] = range_series
print(df.head())

I might want to place that new column after the other temperature ones. I can reassign the columns like so:

In [None]:
# Reassign column order
df = df[['Max Temp (C)', 'Min Temp (C)', 'Temp Range (C)', 'Rain (mm)', 'Snow (cm)']]
print(df.head())

Next, how about the fact that snow is in cm but rain is in mm? Let's convert one to the other. I think we'll change snow to mm so that we don't have to worry about losing precision.

Hmmm... first, though, I notice that it didn't snow at all in the first few days of January. All the values are `0.0`. That means we won't be able to check our calculation. But we can also use `tail` instead of `head` to preview the last few days, which did have snow.

In [None]:
print(df.tail())

In [None]:
# Updating a column's data
df['Snow (cm)'] = df['Snow (cm)'] * 10

In [None]:
print(df.tail())

Nice — it worked. We'd better update the column label too. How do we do that?

In [None]:
# Rename a column
df = df.rename(columns={'Snow (cm)': 'Snow (mm)'})
print(df.head())

**Understanding Check:** Can you add a final column for total precipitation? You won't need to reorder the columns since it'll already come after rain and snow.

In [None]:
# Add a column for total precipitation
# TODO ADD YOUR CODE HERE

<details>
<summary>Click to reveal solution</summary>

```
df['Total precipitation (mm)'] = df['Rain (mm)'] + df['Snow (mm)']
```
</details>

### Analyzing

Let's start to look around this data a little. While you could print the whole `DataFrame` and inspect it by hand, it might be hard to get a sense of the trends by doing that.

At this stage, we are trying to do an important thing: **identify interesting things we might want to investigate further**. In other words, we want to know some questions we can ask of this data.

We can begin with some stabs in the dark based on intuitions such as:

* It snows a lot in January.

* January is very cold.

#### Does it snow more than it rains?

This one should be pretty straightforward; let's total the rainfall and snowfall amounts and see which is greater.

To do that, we can sum each `Series` and compare them.

In [None]:
# Summing Series
total_rain = df['Rain (mm)'].sum()
total_snow = df['Snow (mm)'].sum()

print(f'Total rain: {total_rain:>5} mm')
print(f'Total snow: {total_snow:>5} mm')

Sure enough, there's almost 9 times as much snow as rain! I wonder how many more *days* are snowy than rainy?

To find that out, we can filter the `DataFrame` by whether the value in the relevant column is above `0`. This introduces us to a powerful feature of `Series`: applying a boolean operation to each element in one pass.

In [None]:
# Filtering a DataFrame
rainy_days = df[df['Rain (mm)'] > 0]
snowy_days = df[df['Snow (mm)'] > 0]

# In DataFrames, size refers to both dimensions.
# To get just the number of rows, get the "shape" of the first column.
n_rainy_days = rainy_days.shape[0]
n_snowy_days = snowy_days.shape[0]

print(f'Rainy days: {n_rainy_days:>2}')
print(f'Snowy days: {n_snowy_days:>2}')

That's surprising! Even though there was much more snow than rain, there were almost as many rainy days as snowy ones. That raises interesting questions that we could pursue for a deeper understanding.

Before we ask a different question, though, let's check one more factor. I noticed when we looked at the `tail` of the `DataFrame` that some days had both snow *and* rain. How many of those were there?

We can filter like before,  and we can use the usual logic to combine conditions (both snow and rain are above `0`). *However*, in `pandas`, we will use slightly different operators when we are trying to combine multiple conditions on individual elements:

| Operation | Normal Python | pandas |
| --- | --- | --- |
| both must be true | `and` | `&` |
| at least one must be true | `or` | `|` |
| must be false | `not` | `~` |

In [None]:
# Filtering for days when it rained and snowed
both_days = df[(df['Rain (mm)'] > 0) & (df['Snow (mm)'] > 0)]
n_both_days = df.shape[0]

print(f'Days when it both snowed and rained: {n_rainy_days:>2}')

Hmm... can you make anything of that fact? Or do you need more info?

#### How cold is January?

A simple way to start might be to try averaging the high and low temperatures across the whole month. We can do this by isolating the respective `Series` and using `mean`.

In [None]:
# Averaging the high
print(df['Min Temp (C)'].mean())

In [None]:
# Averaging the high
print(df['Max Temp (C)'].mean())

To get a more fulsome picture, let's also get the median, range, and mode. We'll define a function to simply doing this for both low and high.

The mode is a little more complex since (1) we have to round first, and (2) there can be more than one mode, so we should anticipate a list.

In [None]:
# Function for showing all four averages
def show_averages(s: pd.Series, word: str) -> None:
    s_mean = s.mean()
    s_median = s.median()
    s_range = s.max() - s.min()
    s_modes = s.round(0).mode().values

    print(f'{word} mean:\t{s_mean}')
    print(f'{word} median:\t{s_median}')
    print(f'{word} range:\t{s_range}')
    print(f'{word} mode(s):\t{s_modes}')

In [None]:
# More complete averages of the low temperatures
show_averages(df['Min Temp (C)'], 'Low')

In [None]:
# More complete averages of the high temperatures
show_averages(df['Max Temp (C)'], 'High')

Remember that each average tells a slightly different story. One thing we might notice is that for the lows, the mean and median are much closer to the mode, suggesting coherence: a random day in January probably went down to about -3 degrees Celsius. But the high had a lot of days that were 2-3 degrees warmer than the average.

Also, we notice that the ranges are quite large for both, but especially the low! That suggests that there was at least one very warm or very cold day for January. Can we remove the outliers so that our picture of the average is more accurate?

We'll look at two ways in which we could remove outliers. But first, in order to make this data more interesting, we're going to switch over to a dataset with the full 2022 data to ensure our sample size is sufficient.

Run this next code block to catch up a big dataset on all we've done so far.

In [None]:
# Load larger dataset and catch up the changes
import pandas as pd
url = "https://raw.githubusercontent.com/unfamiliarplace/acse-integration/main/data_science/data/2022_51459.csv"
df_big = pd.read_csv(url)
df_big = df_big[['Max Temp (C)', 'Min Temp (C)', 'Rain (mm)', 'Snow (cm)']]
df_big['Snow (cm)'] = df_big['Snow (cm)'] * 10
df_big = df_big.rename(columns={'Snow (cm)': 'Snow (mm)'})
df_big['Total precipitation (mm)'] = df_big['Rain (mm)'] + df_big['Snow (mm)']

#### Outliers

First, we could take advantage of the `Series.quantile` method, which takes a decimal representing a percentage and returns the cutoff value for that quantile:

In [None]:
# Function for removing top and bottom 10 quantiles
def remove_outliers_quantile(s: pd.Series) -> pd.Series:
    # Determine the value at the 90th and 10th quantiles
    q_upper = s.quantile(0.9)
    q_lower = s.quantile(0.1)

    # Create another Series by filtering for elements that are between the upper and lower bounds
    return s[(s < q_upper) & (s > q_lower)]

Second, we could take advantage of the `Series.std` method, which returns the standard deviation (average distance between a given data point and the mean). We could then cut off datapoints further than `n` standard deviations away from the mean:

In [None]:
# Function for removing values outisde n standard deviations
def remove_outliers_std(s: pd.Series, max_deviations: float) -> pd.Series:
    # Determine the cutoffs as a multiple of standard deviations    
    max_distance = s.std() * max_deviations

    # Create another Series by filtering for elements that are between the upper and lower bounds
    mean = s.mean()
    return s[(s < (mean + max_distance)) & (s > (mean - max_distance))]

Let's use the standard deviation method and see how that changes the averages:

In [None]:
# Low seriesa averages without outliers
low_without_outliers = remove_outliers_std(df_big['Min Temp (C)'], 1)
show_averages(low_without_outliers, 'Low')

In [None]:
# High seriesa averages without outliers
low_without_outliers = remove_outliers_std(df_big['Max Temp (C)'], 1)
show_averages(low_without_outliers, 'High')

We notice that the low mean is not as cold, the low modes no longer include positive 2 degrees, and the high mode has sunk by 3 degrees and is now much closer to its mean and median. That suggests that January had a significant number of warm-ish days where the temperature hovered between 2 and 4 degrees or so. Of course, this is not yet a conclusion; it's just a direction to investigate more closely.

#### Correlations

What are some other intuitive ideas you might have about January?

Maybe you believe it generally snows if the temperature stays at 0 or below, but rains if the temperature stays above 0. Let's just see!

One way to do this is by checking **correlation**, which is a feature of `Series`. A correlation ranges from `-1` (the two events happen at opposite times) to `1` (the two events always co-occur).

First, we'll binarize one of the columns: we'll take our snowfall amount and turn it into `True/False` to represent whether it snowed. That gives us a continuous variable (daily high temperature) and a binary variable (snowfall). That means we're essentially doing a [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), where the temperature will be treated as a predictor of whether it snowed.

In [None]:
# Adding new columns
df_big['Did it snow?'] = df_big['Snow (mm)'] > 0
print(df_big.head())

In [None]:
# Check correlation
corr = df_big['Did it snow?'].corr(df_big['Min Temp (C)'])
print(corr)

We get a decent negative correlation. That means it tends to snow when the temperature is lower, which is what we'd expect.

However, this correlation could actually be stronger. Consider that there must be many days when it neither snows nor rains, and we don't want to count those: we know that being cold doesn't itself cause precipitation! So let's start again, but first we'll filter for days that had any precipitation. We can filter our `DataFrame` using the `Total precipitation (mm)` column.

In [None]:
# Creating more logical subset
df_big_with_precip = df_big[df_big['Total precipitation (mm)'] > 0].copy()
df_big_with_precip['Did it snow?'] = df_big_with_precip['Snow (mm)'] > 0

print(df_big_with_precip)

In [None]:
# Redoing the correlation
corr = df_big_with_precip['Did it snow?'].corr(df_big_with_precip['Min Temp (C)'])
print(corr)

Now there is a negative correlation that's another 50% greater. This is a good sign and matches our intuitions.

**Understanding Check:** Can you test the inverse correlation: whether rain can be predicted by a high temperature? (You should find a positive correlation between temperature and whether it rained.)

In [None]:
# Check the "rainy on warm days" correlation
# TODO ADD YOUR CODE HERE

## Conclusion

We've built up a lot of tools throughout this notebook. Specifically, we've learned:

* What a `Series` is: linear data that can have labels
* What a `DataFrame` is: a table where each column is a `Series`
* How to view, sort, filter, and modify these data structures
* How to calculate basic statistics on data
* How to manipulate the data into a form where we can start to use statistics to answer questions

In the last notebook in this series, we'll put these skills together along with some data finding and cleaning techniques in order to answer a more complex real-world question.

See you then!

![Panda waving goodbye](https://raw.githubusercontent.com/unfamiliarplace/acse-integration/main/data_science/assets/panda_wave.jpg)
<sub>*Panda image from [VCG Photo](https://news.cgtn.com/news/3d3d414e32557a4e30457a6333566d54/share_p.html)*</sub>