<a href="https://colab.research.google.com/github/unfamiliarplace/acse-integration/blob/main/data_science/data_science_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Funfamiliarplace%2Facse-integration&branch=main&subPath=data_science%2Fdata_science_2.ipynb&depth=2"  target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Data Science Part 2


### Three data science notebooks

Here's what we've covered and will cover in this series:

Part 1: We examined Python's core data structures and saw some simple visualizations.

**Part 2: We will explore and learn to use Python's dedicated data science tools in more depth.**

Part 3: We will apply our knowledge to a coherent project with multiple steps.

### Outline of this notebook

The goal of this notebook is to learn how to use Python's dedicated data science tools, in a way that you can take and reuse with students.

**Introduction**

**1. Representing data:** Choosing the right structures and labels for a dataset.

**2. Working with data:** Transforming, selecting, and extracting data.

**3. Analyzing data:** Performing calculations and gathering statistics on data.

**4. Visualizing data:** Selecting and creating appropriate visuals in code.

**Conclusion**


## Introduction

**Data science** may seem like a complex topic, but it boils down to this:

> There's a ton of data out there! How do I find the information I'm looking for? How do I understand it? And how do I make it relevant?

In other words, data science is the science of handling data in order to *answer a specific question*.

Python allows us to do this by providing us with tools for representing data, manipulating or selecting from it, analyzing it, and creating charts, graphs, maps, and other visualizations that display it. In this notebook, we are learning how to use these tools. (In the next notebook, we will take a closer look at gathering and cleaning data from various real-world sources so that we can put our skills together for a specific purpose.)

### Learning goals

* A1. demonstrate the ability to use different data types, including one-dimensional arrays, in computer programs.

* A2. demonstrate the ability to use control structures and simple algorithms in computer programs.

* D2.2 demonstrate an understanding of an area of collaborative research between computer science and another field.

### Success criteria

* I can choose and implement a structure to represent a dataset in code.

* I can manipulate a data structure in code to select, organize, and analyze data.

* I can create a suitable visualization of a dataset in code in order to better understand a question.

> [Source: Ontario Curriculum (2008)](https://www.edu.gov.on.ca/eng/curriculum/secondary/computer10to12_2008.pdf#page=41)


## 1. Representing data

As we saw in the last notebook, often the greatest power in Python comes from the use of external libraries. Two core libraries for data science are `numpy` (Numerical Python) and `pandas` (Python for Data Analysis).

We will focus on `pandas` for this lesson to keep things simple. (However, note that `pandas` is actually built on `numpy`, and uses `numpy` datatypes like an `ndarray` instead of Python's built-in `list`.)

Run this code block to ensure you have `pandas` installed:

In [None]:
# Install numpy and pandas to the current environment
# N.B. pandas should install numpy automatically since it's a prerequisite

%pip install pandas

Next, run this block to import it. Remember that in Jupyter Notebooks, all cells share a memory pool, so if we import it at the start then they'll be available for all future blocks.

In [None]:
# Import pandas
import pandas as pd

### Series

There are two core data structures in `pandas`: `Series` and `DataFrame`.

A `Series` is essentially a one-dimensional array, like a `list`. Run this code block to see an example.

In [None]:
# A Series representing some mysterious data
s = pd.Series([209, 210, 215, 230, 291])
print(s)

#### What makes a Series unique?

That output is a little surprising for a one-dimensional array! We notice that it's presented like a table, with one column for the index and another for the data.

That gives us a hint about the first unique quality of a `Series`: You can use any kind of index, not just a numerical count from `0`.*

Let's try making a custom index. Run this code block. P.S. Can you guess what this data represents?

In [None]:
# Using strings for the index
s = pd.Series([209, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])
print(s)

Now the "rows" of our table each have a name. You might be thinking now that this is kind of like a dictionary, with key-value pairs, and that's a valid observation.

**Understanding Check:** Notice that we have the same number of values and labels. What if those two lists were different lengths? Make a guess and then try it out.

Also, there's a weird thing there at the end of our `Series` output: a `dtype`! This datatype is given because a `Series` can only contain *one* type of data for *all* its elements. This is one of the key principles of having clean datasets: a restriction like this helps us avoid comparing apples and oranges, or rather 5s and oranges.

In this case, `pandas` guessed the datatype we wanted, a 64-bit integer. This is a very generous guess because an `int64`'s max value is `2^64` or about 9 quintillion. We can optimize our `Series` by specifiying the datatype if we know the bounds of our dataset. Let's use `int16`, which gives us an overheard of `2^16` or `65,536`.

In [None]:
# Specifying a datatype
s = pd.Series([209, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'],
              dtype='int16')
print(s)

**Understanding Check:** What would happen if you used an even slimmer datatype of `int8`? Make a guess and then try it out.

#### What can I do with a Series?

Here's a [full reference](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), but let's take a look at some common tasks.

You can run each of these code blocks to see how they work:

In [None]:
# Get all the values
s.values

In [None]:
# Get a specific item
s.loc['Skor']

In [None]:
# Get the length of the Series
s.size

In [None]:
# Get the highest value
s.max()

In [None]:
# Get the lowest value
s.min()

This next one will show you something a Python list can't easily do. We found the maximum and minimum values; what if we want to know which chocolate bars those were? Simple:

In [None]:
# Get the name of the chocolate bar with the highest value
s.idxmax()

In [None]:
# Get the name of the chocolate bar with the lowest value
s.idxmin()

We can also do some very cool operations on the whole `Series` at once. Can your grandma's `list` do this?

In [None]:
# Add to all items
s = s + 100
print(s)

In [None]:
# Multiply all items
s = s * 5
print(s)

In fact, we can carry out any function on all elements of a `Series`. Let's get the square root of each value:

In [42]:
# Map a function
from math import sqrt
s = s.map(sqrt)
print(s)

Skor        3.802214
Heath       3.806754
Snickers    3.829214
Mars        3.894323
Twix        4.130221
dtype: float64


Hey, those decimals aren't very pretty. Let's round them to 2 decimal places:

In [43]:
# Round a Series
s = s.round(2)
print(s)

Skor        3.80
Heath       3.81
Snickers    3.83
Mars        3.89
Twix        4.13
dtype: float64


**Understanding Check:** Try a few more operations on this `Series`: subtraction, division, exponentiation, negation. Feel free to run this block to reset your `Series` to what it was:

In [None]:
s = pd.Series([209, 210, 215, 230, 291],
              index=['Skor', 'Heath', 'Snickers', 'Mars', 'Twix'])

There is much more we can do with a `Series`, including:

* Filtering data
* Correlating data
* Analyzing data (e.g. averages)

We'll save these for the next section, though.

### DataFrame



## 2. Working with data

## 3. Analyzing data

## 4. Visualizing data

## Conclusion