# The Python Ecosystem

Here are some extra resources for learning Python:

**Getting Started with Python**:

* https://www.codecademy.com/learn/python
* http://docs.python-guide.org/en/latest/intro/learning/
* https://learnpythonthehardway.org/book/
* https://www.codementor.io/learn-python-online

**Learning Python in Notebooks**:

* http://mbakker7.github.io/exploratory_computing_with_python/

This is handy to always have available for reference:

**Python Reference**:

* https://docs.python.org/3/reference/


There are also many Python courses avilable via Datacamp. You can access all their courses with the invite link on our [resources page](https://www.mdst.club/resources)!

## 0. Jupyter Notebook

Welcome to Jupyter Notebook! Jupyter lets you develop documents that combine codes, visualizations and explanatory texts. 

At MDST, we use Jupyter Notebooks for: 
- data cleaning and transformation
- statistical modeling
- data visualization
- machine learning
- ...

Cells are the basic units of organization in Jupyter Notebooks. You can start editing each cell by pressing ENTER or double clicking.

All our cells so far are _Markdown_ cells, meaning they just contain text! 

What is [Markdown](https://en.wikipedia.org/wiki/Markdown), you ask. It is a kind of text file where the information about the file's formatting is stored in the file itself. 

That means (enter edit mode to see the actual Markdown text):

- To make something bold, put two asterisks on each side, **like so**.
- To italicize something, put an asterisk on each side, *like so*.
- To cross something out, put two tildes on each side, ~~like so~~.
- To embed a link in words, put the words in square brackets and put the link immediately after that in parenthesis, [like so](https://www.yout-ube.com/watch?v=dQw4w9WgXcQ).

Most crucially, simply pressing `ENTER` once does NOT do anything in Markdown. You have to leave an empty line before every new paragraph.

Whereas these operations are done by clicking a button in MS Word or Google Docs, they are a part of the text in Markdown. 

Here is a Markdown [cheatsheet](https://www.markdownguide.org/cheat-sheet/). 

In [44]:
# Jupyter also has code cells for writing and running Python codes. 
# What's in this cell are not Python codes; they are comments. You can start comments by putting an asterisk at the beginning of lines.
# Comments are for other humans only. The computer will ignore them when executing your programs.
# Pro Tip: You can comment and uncomment many lines at once by highlighting them and pressing CTRL + / or CMD + /

You can run a cell by pressing `CTRL + ENTER` or `SHIFT + ENTER`. Running a cell will either render the contained Markdown to nice-looking text or execute the contained codes.

If you are new to Python, you should run every cell in this notebook. If you already have some familarity, you can use the Table of Contents to skip ahead.

# Table of Contents

- [1. Data Types](#1.-Data-Types)
    - [1.0 Your First Python Program](#1.0-Your-First-Python-Program)
    - [1.1 Data Type](#1.1-Data-Type)
    - [1.2 Container (list, tuple, dictionary, set)](#1.2-Container)
- [2. Control Flow](#2.-Control-Flow)
- [3. Iterating](#3.-Iterating)
    - [3.0 For Loops](#3.0-For-Loops)
    - [3.1 List Comprehension](#3.1-List-Comprehension-(Optional))
- [4. Functions](#4.-Functions)
    - [4.0 Import & Library](#4.0-Import-&-Library)
    - [4.1 Built-in Function](#4.1-Built-in-Function)
    - [4.2 Custom Function](#4.2-Custom-Function)
    - [4.3 Lambda Function](#4.3-Lambda-Function-(Optional))
    - [4.4 Type Hinting](#4.4-Type-Hinting-(Optional))
- [5. Numpy](#5.-Numpy)
    - [5.0 Array](#5.0-Array)
    - [5.1 Math](#5.1-Math)
- [6. Pandas](#6.-Pandas)
    - [6.0 Dataframes & Series](#6.0-Dataframes-&-Series)
    - [6.1 Indexing](#6.1-Indexing)
    - [6.2 Data Transformation](#6.2-Data-Transformation)
    - [6.3 Grouping & Aggregating](#6.3-Grouping-&-Aggregating)
    - [6.4 Lambda Functions in Pandas](#6.4-Leveraging-Lambda-Functions-In-Pandas-(Optional))

The checkpoint proceeds in the same order so you can follow along.

## 1. Data Types

### 1.0 Your First Python Program

In [45]:
# Tradition demands that we do this
# Try running this cell

print("Hello World")

Hello World


The `print()` function is how you output things for people to see in Python.

In [46]:
# Notebooks will automatically print the output of the last line of each cell when they are ran

413 * 5791

2391683

### 1.1 Data Type

#### 1.1.0 Ints and Floats

Python distinguishes between integers and decimal numbers (floats).

In [47]:
type(0)

int

In [48]:
type(0.0)

float

Basic arithmetic is straight forward in Python.

In [49]:
3 + 2

5

In [50]:
1.1 - 9.0

-7.9

In [51]:
3 * 5

15

In [52]:
# When two numbers, regardless of whether they are int or float, are divided, Python returns the result as if the operation is done on a calculator
# This is known as float division
print(1/2)
print(1.5/2.4)

0.5
0.625


In [53]:
# There is also integer division that can be done between two int
# In Python, the behavior is always to round the float divison result down to the nearest integer
14 // 5

2

In [54]:
# You can also find the remainders of divisions
# Also known as taking the modulus
13 % 5

3

In [55]:
# exponent
10 ** 3

1000

ints and floats are mostly interchangeable and can also be cast (i.e. converted) to the other type.

In [56]:
float(3)

3.0

In [57]:
int(2.9)

2

#### 1.1.1 Strings

Strings are Python's internal representation of texts.

In [58]:
# They can either be surrounded by double quotes...
type("apple")

str

In [59]:
# or single quotes
type('apple')

str

In [60]:
# You can piece two strings together (aka concatenate) using the plus sign
"Hello" + " World"

'Hello World'

Python provides many functions for manipulating strings. 

In [61]:
# Capitalize
"like so".upper()

'LIKE SO'

In [62]:
# Lowercase
"LIKE SO".lower()

'like so'

In [63]:
# Title case
"like so".title()

'Like So'

In [64]:
# Count the number of characters, including whitespace
len("like so")

7

In [65]:
# Remove spaces on either side of a string
"    like so  ".strip()

'like so'

In [66]:
# Split a string into a list of words 
"like so".split()

['like', 'so']

You can find a comprehensive list of these functions [here](https://www.w3schools.com/python/python_ref_string.asp).

#### 1.1.2 Boolean Values

There are two boolean values in Python `True` and `False`. They are case sensitive and must be typed exactly as such.

Now time for some basic [boolean algebra](https://en.wikipedia.org/wiki/Boolean_algebra).

You can flip a boolean value to its opposite with `not`.

In [67]:
print(not True)
print(not False)

False
True


In [68]:
# and, or conjunction, only evaluates to True when every boolean value involved is True
print(True and True)
print(True and False)
print(False and False)

True
False
False


In [69]:
# or, or disjunction, evaluates to True whenever at least one involved boolean value is True
print(True or True)
print(True or False)
print(False or False)

True
True
False


In [70]:
# All the non-zero numbers are treated as True
print(bool(1 and True))
print(bool(0 and True))

True
False


In [71]:
# All non-empty strings, even if the string is all whitespaces, are treated as True
print(bool(True and ""))
print(bool(True and "    "))
print(bool(True and "False"))

False
True
True


We will use boolean values much more extensively when we encounter control flow and `if` statements.

#### 1.1.3 Variables

You can store data inside named variables, and refer back to the data with its name.

Variable names cannot begin with a digit.

In [72]:
# Python automatically figures out what type your variables are
# Once the cell is ran, the variables are made available everywhere else in the notebook

x = 4
y = 5

In [73]:
# We can do arithmetic with those variables in another cell
4*x + 5*y

41

In [74]:
# There are some shorthands for updating variables
# Instead of x = x + 2
# We can simply do:

x += 2
x

# You can do the same for -, *, and /

6

In [75]:
# In Python, snake case is the norm for multi-word variable names

michigan_data_science_club_abbreviation = "MDST"

In [76]:
# The values stored inside the variables can be overwritten later by referring back to the variable name
# Python allows changing the data type of the variable when it is overwritten

x = "like"
y = " so"

x + y

'like so'

### 1.2 Container

#### 1.2.0 List

A list is a collection of data. A list can contain different types of data.

In [77]:
# You can create (aka initialize) an empty list with the square brackets
empty_list = []

# or with the list() command
another_empty_list = list()

In [78]:
# Or you can create lists by listing the elements it should contain
nonempty_list = [32, 'MDST', True]

Once a list is created, you can retrieve elements inside with its index.

Python uses 0-indexing, meaning the first element is on index 0. 

In [79]:
# Retrieve an element by putting its index in a square bracket after the list's name
nonempty_list[1]

'MDST'

In [80]:
# This works similarly for strings
mdst = "MDST"
mdst[2]

'S'

In [81]:
# You can chain indices as well
nonempty_list[1][2]

'S'

Negative numbers index from the end. Think of it as -1 wrapping around to the last element in the list. -2 is then the second last element in the list etc.

In [82]:
nonempty_list[-2]

'MDST'

Be careful to not use an index that doesn't exist in a list. Python won't know what to do and will throw an error.

In [83]:
# Getting the first element in an empty list doesn't make sense.

print(empty_list[0])


IndexError: list index out of range

In [None]:
# Neither does finding the fifth element in a three-element list

print(nonempty_list[4])

You can use indexing to get subarrays/substrings.

syntax: [start:end:step]

The subarray will include the start index (inclusive) but not the end (exclusive).

In [None]:
sample_list = [0, 1, 2, 3 , 4, 5, 6, 7, 8, 9, 10]

In [None]:
# Getting the fourth to eighth element
# If you don't specify the step, Python assumes you want every element in the range

sample_list[3:8]

In [None]:
# When end is not specified, Python includes everything including and after the start index
sample_list[5:]

In [None]:
# Similarly, when start is not specified, Python includes everything before the end index but excludes the end index itself
sample_list[:-5]

In [None]:
# When neither start nor end is specified, Python applies the step argument to the entire list
# step = 2 means to take 2 steps forward each time an element is selected. In other words, it selects every other element

sample_list[::2]

In [None]:
# A neat trick for reversing a list, try to understand what it's doing
sample_list[::-1]

You can add element to an existing list ...

In [None]:
# at the end ...
sample_list.append(11)

# or somewhere in the middle
# syntax: insert(index, new_value)
sample_list.insert(1, 0.5)

print(sample_list)

or remove an element ...

In [None]:
# remove the first instance of a given value in the list
sample_list.remove(0.5)

# or remove the element on a specified index
sample_list.pop(0)

sample_list

or change an element using its index ...

In [None]:
sample_list[-1] = 12
sample_list

or many other things ...

See the full range of possibility [here](https://www.w3schools.com/python/python_ref_list.asp).

If you thought typing out every number from 0 to 10 was an inefficient way of creating a list, you will be glad to learn about the `range()` function. 

Syntax: `range(start (inclusive), end (exclusive), step)`

Pro tip: if you only specify `end`, Python will give you every integer from 0 up to the one before `end`.

In [None]:
# let's recreate the list of numbers from 0 to 10 using range()
# The output of range()'s type is range, not list. We need to convert it with list()
sample_list = list(range(11))
sample_list

##### 1.2.1 Tuple

Python tuples are list-like data structures with one important difference.

In [None]:
# You can create them with parenthesis

empty_tuple = tuple()

sample_tuple = (1, 2, 3, 4)

print(empty_tuple, sample_tuple)

Indexing tuples is just like indexing lists

In [None]:
print(sample_tuple[1], sample_tuple[-3])

Crucially, tuples can NOT be modified once created. 

Tuples are *immutable*. While this property makes them less versatile than lists, it sometimes come in handy. For example, tuples can be used as keys in dictionaries (next section).

In [None]:
# try to overwrite an item in a tuple

try:
    sample_tuple[-1] = 10
except TypeError as e:
    print(e)

##### 1.2.2 Dictionary

Dictionary is a way to store pairs of values, known as keys and values, with some associations to each other.

In [None]:
# You can create an empty dictionary in two ways
empty_dict1 = dict()
empty_dict2 = {}

print(empty_dict1, empty_dict2)

In [None]:
# You can also create dictionaries by specifying the key:value pairs
panda_express_pricing = {"Bowl":5.80, "Plate":6.80, "Bigger Plate":8.30}

You index a dictionary with a key and gets its associated value.

In [None]:
bowl_price = panda_express_pricing["Bowl"]
bowl_price

Be careful to not index a key that doesn't exist in the dictionary because that will cause an error.

If you are not sure whether a key is in the dictionary or not, use the [get](https://www.w3schools.com/python/ref_dictionary_get.asp) method to be safe.

In [None]:
# try to eat buffet at Panda express
buffet_price = panda_express_pricing["Buffet"]

It follows that you can change the value associated with a key.


In [None]:
# let's say Panda Express has a sale on the bowls
panda_express_pricing["Bowl"] = 5.00
bowl_price = panda_express_pricing["Bowl"]
bowl_price

There is, however, no easy way to modify the key associated with a value.

In [None]:
# You can see a list of all the keys in a dictionary
panda_express_pricing.keys()

In [None]:
# Or a list of all values
panda_express_pricing.values()

In [None]:
# Or a list of key value pairs, represented as tuples
panda_express_pricing.items()

See everything you can do with dictionaries [here](https://www.w3schools.com/python/python_ref_dictionary.asp).

##### 1.2.3 Set

Sets store unique elements.

In [None]:
# You can only create sets with set(); (), [], {} are all taken

s = set([1,2,3,1,2,3])
s

In [None]:
# Add new elements to a set
s.add(3)
s.add(4)
s

In [None]:
# Remove elements in the set 
s.discard(1)
s.discard(2)

There are many set operations that can be performed between two sets. We will not go into them here. You can see a list on this [page](https://www.w3schools.com/python/python_ref_set.asp).

#### 1.2.4 Container Utilities

You can use `len()` to find the number of items in each of the above four containers.

In [None]:
l = [1,2,3]
t = (1,2,3)
d = {1:'a', 2:'b', 3:'c'}
s = set([1, 2, 3])

print(len(l), len(t), len(d), len(s))

And use the `in` keyword to check if an element is in the container or not.

For dictionaries, you can only use this to check whether a key is in the dictionary or not.

In [None]:
print(1 in l)
print(4 in t)
print(2 in d)
print(0 in s)

## 2. Control Flow

You can use `if` statements to execute different actions in different scenarios.

Before we dive in, a quick aside on comparing numbers:
- Use `==` to check equality
- Use `!=` to check inequality
- Use `<`, `>`, `>=`, and `<=` to compare two numbers

In [None]:
# Here is the general idea of if statements
# if (condition evalutes to true):
#   execute code here

to_print_or_not_to_print = True

if to_print_or_not_to_print:
    # Most code editors will automatically indent the lines inside an if statement for you 
    # It doesn't matter whether you use tabs or spaces to indent or how much you indent (two or four spaces are common)
    # Just be consistent! Your code will not work without consistent indentation!
    
    print("The first block of code is executed")

to_print_or_not_to_print = False

if to_print_or_not_to_print:
    print("The second block of code is executed")


We can use more complex conditions for `if` statements.

In [None]:
if 4 < 5 and 6 >= 6 and len(list(range(3))) == 3:
    print("The first block of code is executed")

if 4 != 4 or 6 > 7 or -1 < 0:
    print("The second block of code is executed")

An `if ... else` scheme can handle both when the condition is true and false.

In [None]:
to_print_or_not_to_print = True

if to_print_or_not_to_print:
    # indented
    print("printing")
# unindented
else:
    # indented
    print("not printing")

to_print_or_not_to_print = False

if to_print_or_not_to_print:
    print("printing")
else:
    print("not printing")

`if ... elif ... else` schemes can handle many different scenarios.

You can have `elif` without `else` but all `elif` must appear before `else`.

In [None]:
uniqname = "ENTER YOUR UNIQNAME HERE"

if len(uniqname) <= 4:
    print("Short")
elif len(uniqname) < 8:
    print("Medium")
else: 
    print("Long")

## 3. Iterating

### 3.0 For Loops

Lists, tuples, sets, dictionaries, strings, and ranges are all *iterables*. That just means we can move through them in a certain order.

This property is useful for simplifying repeated actions. 

Say we have a list of numbers and we want to print each of them, doubled. 

We can use the index to access, multiply, and print each of them but that's inefficient.

For loops to the rescue.

In [None]:
nums = list(range(5))

for num in nums: 
    print(num*2)


In [None]:
# What is actually going on here?
#
# in nums specifies the iterable to go through, nums in this case
# num is what is called an iterator. i, j, and k are common iterator names but num makes more sense here
#
# for num in nums: 
#     indent!
#     num is set to an element in the nums list and the action is executed
#     print(num*2)
#     num is set to the next element in the nums list
#
# in this case, we iterated through the elements of the list

In [None]:
# Another common pattern is to iterate through the indices 
# let's print out the indices that has an even number on them

for i in range(len(nums)):
    # range(len(nums)) gives all the indices in the nums list
    # nums has 5 elements so range(len(nums)) looks like 0, 1, 2, 3, 4
    # you will see this all the time in for loops

    if nums[i] % 2 == 0:
        print(i)

One more example: 

Make a new list containing the items in nums squared.


In [None]:
squared_nums = []

for num in nums:
    squared_nums.append(num ** 2)

squared_nums

Sometimes it is useful to iterate through both the element and index at the same time. 

Look into [`enumerate`](https://realpython.com/python-enumerate/).

### 3.1. List Comprehension (Optional)

Here we present a nice feature of Python that allows creating lists using a shorthand of for loops.

In [None]:
# every letter in MDST
letters = [letter for letter in "MDST"]
letters

In [None]:
# Modify the iterator 
# Let's redo the squared_nums example from the previous section

squared_nums = [num**2 for num in nums]
squared_nums

In [None]:
# Modify the iterator differently based on some conditions
# Square the number if it is even, else cube it 

squares_and_cubes = [num**2 if num % 2 == 0 else num**3 for num in nums]
squares_and_cubes

In [None]:
# Filter the iterator 
# Triple the number only if it is odd 

triples = [num*3 for num in nums if num % 2 == 1]
triples

In [None]:
# chained comprehension 
# numbers from 1 to 20, in 3-number segments

segments = [[i for i in range(start, start+3)] for start in range(0, 20, 3)]
segments

## 4. Functions

### 4.0 Import & Library

Libraries (aka packages) are codes that other people have developed for you to use. Python has tons of cool and interesting libraries.

You can start using them in your notebooks with the `import` key word.

In [None]:
# There is always a relevant xkcd 

import antigravity

Most libraries are more elaborate and contain many functionalities.

In [None]:
# Once a library is imported, you can start using the functions and methods they have
import random
random.randint(1, 10)

In [None]:
# If you know what function you need, you can also import it specifically
from random import randint

# If you do it this way, you can use randint directly instead of typing out random.randint()

randint(1, 10)

In [None]:
# Sometimes function or library names are very long 
# You can use the as key word to rename imports

from random import randrange as r 

r(1, 10)

### 4.1 Built-in Function

We present some more built-in functions that may be useful for completing the checkpoints.

You can find documentation for all of them [here](https://docs.python.org/3/library/functions.html).

In [None]:
max([3,4,5])

In [None]:
min([-3,3,9])

In [None]:
sum([1,3,5])

In [None]:
round(3.8)

In [None]:
round(3.36394, 3)

In [None]:
abs(-3)

### 4.2 Custom Function

Functions are great ways to reduce code duplication and repetition.

Functions can be used to carry out specific actions. We will slowly build up to a function that outputs custom greeting messages.

Let's start by having the function just print "Hi".

In [None]:
# The first line in a function is the function header. It starts with the def key word, followed by the function name
def greet():
    # Indent!
    print("Hi")

greet()

Not exactly a custom message. It would be nice if we can greet people by their names.

In [None]:
# We can shape a function's behavior by adding arguments. These appear in the parenthesis after the function name.
# Note: the name on this line names an argument to the greet function
def greet(name):
    print("Hi " + name)

name = "ENTER YOUR NAME HERE"

# Note: the name on this line refers to the name variable
greet(name)

Maybe you are excited to see the person, in which case some exclamation marks are in order. 

Usually, 1 is good. 

In [None]:
# You can set default values for arguments. The function will use those defaults if the argument is not provided.
# On the contrary, arguments without default values have to be specified
def greet(name, num_exclamation=1):
    print("Hi " + name + '!'*num_exclamation)

greet(name)

In [None]:
# You can of course use different values for all your default arguments.
# Python will try to match arguments using the order listed in the header
greet(name, 3)

# or you can mix up the order by referring to the arguments by their names
greet(num_exclamation=2, name=name)

Functions don't have to interface with users directly. They can also be used to perform computations and return the results.

In [None]:
def round_to_hundreds(num):
    rounded = round(num / 100) * 100

print(round_to_hundreds(168))

Weird, we expected 2 but received `None`. 

This is because we forgot to get the function to make its output available for other parts of the program to use.

In its current state, the output of the function (`rounded`) is inaccessible.

This is where `return` comes into play.

In [None]:
def round_to_hundreds(num):
    rounded = round(num / 100) * 100

    # returning is making the output available for other codes
    return rounded

print(round_to_hundreds(168))

### 4.3 Lambda Function (Optional)

Lambda function is a shorthand way to write simple functions. It is useful in many context but you will find it a great help when you are performing data transformation.

In [None]:
# Let's define a function that applies or to two boolean values and return the opposite of that result 
def or_reverse(bool1, bool2):
    return not (bool1 or bool2)

or_reverse(True, False)

In [None]:
# This is how the same function will look in lambda notation 
lambda_or_reverse = lambda bool1, bool2: not (bool1 or bool2)

lambda_or_reverse(True, False)

### 4.4 Type Hinting (Optional)

Type hinting allows you to indicate the expected data types of variables, function arguments, and return values.

They are optional annotations that makes Python easier to read and debug; type hinting is especially nice for keeping track of data types as your codebase gets larger.

In [None]:
# These two variable assignments are equivalent:
name = "deckard"
name: str = "deckard"  # type hinted declaration (var_name: var_type)

# Python is a dynamically typed language; it does not require the type of a variable to be declared, and a variable's type can change during runtime
name = 26354
print(type(name))  # <class 'int'>

# Type hinting for functions follows a similar format. "-> None" indicates that there isn't anything to return
def greet(name: str, num_exclamation: int = 1) -> None:
    print("Hi " + str(name) + '!'*num_exclamation)

# For flexability, Python doesn't actually require passed argument types to match their respective annotated types 
greet(name)

# This line, however, causes a TypeError inside of the function due to an unexpected type mismatch
# greet(name, 1.00) 

In [None]:
# the typing library can be used for more specific type hinting
from typing import Any, Union, Optional, Callable


# Container types can be annotated using square brackets
list_of_ints: list[int] = [1, 2, 3]
tuple_of_stuff: tuple[Any, Any] = (4.0, [5])


# Union[] or | specify that a variable can be one of multiple specific types
X_1: Union[int, float, bool] = 1.982
X_2: Union[int, float, bool] = 61021
X_3: int|float|bool = False

# Optional[] specifies that a variable can also be None
delivery_charge: Optional[int|float] = 3.99

In [None]:
# This function returns a (int, float) tuple
def cottage_inn_order(n_students: int, delivery_charge: Optional[int|float] = None) -> tuple[int, float]:
    n_pizzas = (n_students // 32) + 1
    subtotal = n_pizzas * 34.98

    if delivery_charge is not None:
        subtotal += delivery_charge

    return n_pizzas, subtotal * 1.06

# We can tell at a glance that this function expects a (int, float) tuple as input
def print_pizza(info: tuple[int, float]) -> None: 
    print(info[0], " pizzas, costing $", round(info[1], 2), sep="")

print_pizza(cottage_inn_order(n_students = 150))

# Callable specifies arguments that are functions, for precise hinting you can use Callable[[arg_types], return_type]
def print_pizza(
    n_students: int, 
    order_func: Callable = cottage_inn_order,
    delivery_charge: Optional[int|float] = None
) -> None: 
    info = order_func(n_students, delivery_charge)
    print(info[0], " pizzas, costing $", round(info[1], 2), sep="")

print_pizza(150)

## 5. Numpy

Numpy is short for *numerical python*, a library built for optimized operations on large arrays and matrices. 

In [None]:
import numpy as np

### 5.0 Array

Numpy arrays can be created from a Python list.

In [None]:
a = [1,2,3,4,5,6]
b = np.array(a)
b

Right now, it looks an awful like a python list, but there are some key differences you should be aware of.

Numpy arrays are:
- homogeneous (all elements in an array have the same type)
- multidimensional

In [None]:
# Homogeneous: all numpy arrays have an associated data type
# numbers are usually ints or floats
b.dtype

In [None]:
# Multidimensional: numpy arrays can have arbitrarily many dimensions
# We can reshape b into a 3x2 matrix. This means 3 rows and 2 columns
# Note: this doesn't change b. That's why we assign it to a new variable: m
m = b.reshape(3, 2)
m

In [None]:
# Each dimension is called an axis
# The size across each axis is called the shape
# These are two very important concepts!
m.shape

In [None]:
# One numpy function worth highlighting is transpose 
# Essentially, the first row becomes the first column, the second row becomes the second column etc.

m = m.transpose()
m

### 5.1 Math

Numpy gives us a lot of math functions to work with. You can find them all in the [documentation](https://numpy.org/doc/stable/reference/routines.math.html).

In [None]:
np.sum(b)

In [None]:
np.mean(b)

In [None]:
# for convenience, you can also call
b.mean()

You can also apply these functions by axis.

In general, `axis=0` means to operate by rows and `axis=1` means to operate by columns.

In [None]:
# summing by rows
print(np.sum(m, axis=0))

# summing by columns 
print(np.sum(m, axis=1))


In [None]:
# Unlike a regular list, you can do arithmetic on numpy arrays directly
# In most cases, numpy will apply the arithmetic operations to each element

print(m*3)
print(m+3)
print(np.power(m,2))

## 6. Pandas

Pandas is another Python library which we will be using _a lot!_ It lets us handle data in tabular format and is well integrated with other libraries for plotting, machine learning, etc.

In [86]:
import pandas as pd

### 6.0 Dataframes & Series

Pandas puts data into dataframes, which are made up of series.

In [98]:
# here, we're reading in data from a 'csv', or comma-separated value, file 
df = pd.read_csv("../data/cereal.csv")
type(df)

pandas.core.frame.DataFrame

A dataframe is like a table:

In [None]:
df

We can use `head()`, `tail()`, or `sample()` to take a look at the data.

In [None]:
# head returns the first x rows of your specification in the dataframe. By default it returns the first 5
# tail returns the last rows of the dataframe
df.head(10)

In [None]:
df.sample()

You can also use `describe()` to get a feel of the distribution of numerical columns.

In [None]:
df.describe()

And use `.shape` and `.dtypes` to understand the property of the dataframe.

In [None]:
df.shape

In [None]:
# In Pandas, object can be a lot of things, such as strings in this case
df.dtypes

Each column is a pandas Series (pd.Series).

In [None]:
df["name"]

In [None]:
type(df["name"])

Series are similar to numpy arrays in many ways. They are both homogenous and share many operations.

In [None]:
df["carbo"].mean()

## 6.1 Indexing

### 6.1.0 `loc` & `iloc`

The index in a pandas series/dataframe can by any list of **unique** values (row number, ID, time, etc.)

`iloc` is used to index by row number in a dataframe

In [99]:
# The first row of the dataframe
df.iloc[0]

name        100% Bran
mfr                 N
type                C
calories           70
protein             4
fat                 1
sodium            130
fiber            10.0
carbo             5.0
sugars              6
potass            280
vitamins           25
shelf               3
weight            1.0
cups             0.33
rating      68.402973
Name: 0, dtype: object

In [100]:
# Select multiple rows at once
df.iloc[[1, 2, 3]]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912


In [101]:
# This syntax means all the list indicing methods can also be applied
# Let's get every ten row in the dataset 
df.iloc[::10]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
10,Cap'n'Crunch,Q,C,120,1,2,220,0.0,12.0,12,35,25,2,1.0,0.75,18.042851
20,Cream of Wheat (Quick),N,H,100,3,0,80,1.0,21.0,0,-1,0,2,1.0,1.0,64.533816
30,Golden Crisp,P,C,100,2,0,45,0.0,11.0,15,40,25,1,1.0,0.88,35.252444
40,Kix,G,C,110,2,1,260,0.0,21.0,3,40,25,2,1.0,1.5,39.241114
50,Nutri-grain Wheat,K,C,90,3,0,170,3.0,18.0,2,90,25,3,1.0,1.0,59.642837
60,Raisin Squares,K,C,90,2,0,0,2.0,15.0,6,110,25,3,1.0,0.5,55.333142
70,Total Raisin Bran,G,C,140,3,1,190,4.0,15.0,14,230,100,3,1.5,1.0,28.592785


We can specify a column to use as the index column. In this case, `name` makes the most sense.

Remember, the index column should contain unique values only!

In [102]:
# see how the leftmost row is now replaced with the cereal names
df_ = df.set_index('name')
df_.head()

Unnamed: 0_level_0,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


`loc` is used to index by the series/dataframe index. 

In our case, that will be `name`. If no index is specified, `loc` behaves similarly to `iloc`. 

In [103]:
df_.loc['All-Bran']

mfr                 K
type                C
calories           70
protein             4
fat                 1
sodium            260
fiber             9.0
carbo             7.0
sugars              5
potass            320
vitamins           25
shelf               3
weight            1.0
cups             0.33
rating      59.425505
Name: All-Bran, dtype: object

By default, Pandas select all the columns. You can specify which columns to select with a list of column names.

In [None]:
df_.loc["All-Bran"][["fat", "sodium", "sugars"]]

An alternate syntax for specifying which columns to select:

In [None]:
df_.loc['All-Bran', ["fat", "sodium", "sugars"]]

The first syntax works for `iloc` as well but not the second. If you want to do something similar to the second syntax for `iloc`, you need to use the column indices instead.

In [104]:
# fat is the 6th column in the table, so its index is 5 etc
df.iloc[2, [5, 6, 9]]

fat         1
sodium    260
sugars      5
Name: 2, dtype: object

### 6.1.1 Conditional Indexing

Comparison operators (`==`, `!=`, `<`, `>`, `<=`, `>=`) work on Pandas series. 

The result is a series of the same size showing the result of the comparison element-by-element

In [93]:
df["protein"] > 3

0      True
1     False
2      True
3      True
4     False
      ...  
72    False
73    False
74    False
75    False
76    False
Name: protein, Length: 77, dtype: bool

Pandas also allow selecting rows with a series of boolean values. Only the rows that corresponds to `True` will be selected.

Combining the two features:

In [94]:
# This gives us all the rows in which the protein is greater than 3.
df[df["protein"] > 3]

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,Delicious 100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
2,Delicious All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,Delicious All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
11,Delicious Cheerios,G,C,110,6,2,290,2.0,17.0,1,105,25,1,1.0,1.25,50.764999
41,Delicious Life,Q,C,100,4,2,150,2.0,12.0,6,95,25,2,1.0,0.67,45.328074
43,Delicious Maypo,A,H,100,4,1,0,0.0,16.0,3,95,25,2,1.0,1.0,54.850917
44,Delicious Muesli Raisins; Dates; & Almonds,R,C,150,4,3,95,3.0,16.0,11,170,25,3,1.0,1.0,37.136863
45,Delicious Muesli Raisins; Peaches; & Pecans,R,C,150,4,3,150,3.0,16.0,11,170,25,3,1.0,1.0,34.139765
56,Delicious Quaker Oat Squares,Q,C,100,4,1,135,2.0,14.0,6,110,25,3,1.0,0.5,49.511874
57,Delicious Quaker Oatmeal,Q,H,100,5,2,0,2.7,-1.0,-1,110,0,1,1.0,0.67,50.828392


In [None]:
# You can compare two columns
# Get all the cereals with more protein than sugar
df[df["protein"] > df["sugars"]]

You can also chain multiple conditions with `and`, `or`, and `not`. 

However, due to Python's implementation details, you need to replace `and` with `&`, `or` with `|`, and `not` with `~` to use conditional indexing on a dataframe. 

Read more about why on this [post](https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas).

In [None]:
# Let's find all the cereals with more than 3g of protein and less than 5g of sugar
# You have to put each condition in parenthesis
df[(df["protein"] > 3) & (df["sugars"] < 5)]

##  6.2 Data Transformation

When we are processing data, it is common to add new columns based on existing columns. 

In the case of cereal dataframe, most measurements are standarized to 1 weight unit but 1 weight unit is one cup for one cereal and half a cup for another.

We should probably add a column that documents how many weight units are in a cup for each cereal.

In [None]:
# Arithmetic between series in the same dataframe is quite simple
# We will create a new column called "weight_per_cup"

df["weight_per_cup"] = df["weight"] / df["cups"]
df.head()

We may also want to make changes to a specific column. We can do this with the `apply()` function.

In [90]:
# Let's add "Delicious " to the beginning of every name

# The pattern is we define a function for a single entry
def make_delicious(name):
    return "Delicious " + name

# and then call apply on the series to apply the function to each element in the series
df["name"].apply(make_delicious)

0                     Delicious 100% Bran
1             Delicious 100% Natural Bran
2                      Delicious All-Bran
3     Delicious All-Bran with Extra Fiber
4                Delicious Almond Delight
                     ...                 
72                      Delicious Triples
73                         Delicious Trix
74                   Delicious Wheat Chex
75                     Delicious Wheaties
76          Delicious Wheaties Honey Gold
Name: name, Length: 77, dtype: object

In [91]:
# this returns the changes, but doesn't apply them in place
# that means on our original dataframe, the cereals are still bland
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


In [92]:
# we can fix this by assigning the new names to the column
df["name"] = df["name"].apply(make_delicious)
df.head()

Unnamed: 0,name,mfr,type,calories,protein,fat,sodium,fiber,carbo,sugars,potass,vitamins,shelf,weight,cups,rating
0,Delicious 100% Bran,N,C,70,4,1,130,10.0,5.0,6,280,25,3,1.0,0.33,68.402973
1,Delicious 100% Natural Bran,Q,C,120,3,5,15,2.0,8.0,8,135,0,3,1.0,1.0,33.983679
2,Delicious All-Bran,K,C,70,4,1,260,9.0,7.0,5,320,25,3,1.0,0.33,59.425505
3,Delicious All-Bran with Extra Fiber,K,C,50,4,0,140,14.0,8.0,0,330,25,3,1.0,0.5,93.704912
4,Delicious Almond Delight,R,C,110,2,2,200,1.0,14.0,8,-1,25,3,1.0,0.75,34.384843


## 6.3 Grouping & Aggregating

When we have lots and lots of data, it's more useful to look at aggregate statistics like the mean or median but we may lose too much detail aggregating across the whole dataset.

The solution is to aggregate across groups. For example, maybe we're less interested in the mean calorie count of all cereals and more interested in the mean for each manufacturer.

In [None]:
# First, we can see how many (and which) unique manufacturers there are
print(df["mfr"].unique())
print(df["mfr"].nunique())

In [None]:
# Now let's group by the manufacturers
# This gives us a groupby object across the dataframe
mfrs = df.groupby("mfr")
mfrs

In [None]:
# now let's find the mean calories of each manufacturer
mfrs["calories"].mean()

You can also group by multiple columns. 

Let's get the median calorie count for each combination of manufacturer and type.

In [88]:
# The groupby functions always precede any aggregate function
df.groupby(["mfr", "type"])["calories"].median()

mfr  type
A    H       100.0
G    C       110.0
K    C       110.0
N    C        90.0
     H       100.0
P    C       110.0
Q    C       100.0
     H       100.0
R    C       110.0
Name: calories, dtype: float64

## 6.4 Leveraging Lambda Functions in Pandas (Optional)

We will quickly demonstrate two places where lambda functions are very powerful, conditional indexing and data transformation.

In [None]:
# We want to filter for rows satisfying col1 > bound1 and col2 > bound2
filter_func = lambda row, col1, bound1, col2, bound2: row.loc[col1] > bound1 and row.loc[col2] > bound2 

# You almost always want axis=1 for apply(), which means to operate row by row
df[df.apply(lambda row: filter_func(row, "sugars", 5, "calories", 150), axis=1)]

In [None]:
# We want to calculate some sort of health metric for each cereal by finding a weighted sum of its nutrients

def random_metric(row):
    return 2*row.loc["protein"] + -0.5*row.loc["fat"] + 1.5*row.loc["fiber"] - 1*row.loc["sugars"]

df["random_metric"] = df.apply(lambda row: random_metric(row), axis=1)
df.head()