# SI 330: Data Manipulation 
## 02 - Introduction to Pandas

### Dr. Chris Teplovs, School of Information, University of Michigan
<small><a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a>This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

# Reminders
* Slack (see link via Canvas announcements)

## Learning Objectives

* Pandas introduction
* explain how pandas operations differ from "traditional" python
* be able to load a CSV file into a Pandas DataFrame
* explain how to extract columns from a DataFrame
* sort a DataFrame
* assign a column as the index of a DataFrame
* filter a DataFrame according to some criteria
* explain how boolean masks work in filtering DataFrames

### IMPORTANT: Replace ```?``` in the following code with your uniqname.

In [None]:
MY_UNIQNAME = '?'

## <font color="magenta"> Exercise 1 (1 point): Based on the readings ([Chapter 4](https://learning.oreilly.com/library/view/python-for-data/9781491957653/ch04.html#numpy) of Python for Data Analysis), what is the key feature of numpy that underpins much of the functionality of pandas? </font>

Insert your answer here.

## Pandas
* high-level library to support data manipulation and analysis
* DataFrame is the primary object we’ll be dealing with
* similar to R’s dataframe
* maps onto tabular structure
* good for time series and econometric data

## Shift from "pythonic" to "pandorable"

* less looping over elements
* lots of built-in functionality
* a "paradigm shift"

# Data structures

We're all familiar with lists:

In [None]:
names = ["Charlotte", "Ingrid", "Ian", "Eric"]
scores = [80, 95, 85, 70]

Now let's say that we wanted to divide each of those scores by two and assign the results to another variable. Go ahead and write some code that does that... There are lots of ways to do this, so go ahead and write one way to do it (without importing any additional python packages) and assign the results to a 
variable called ```half```:

## <font color="magenta"> Exercise 2 (1 point): Write some python code to divide all the scores by 2.  The results should be saved to a variable called ```half```. </font>

In [None]:
# insert your code here


If you followed the above instructions, the following cell block should print
a list of floats that looks like ```
[40.0, 47.5, 42.5, 35.0]```


In [None]:
half

We can put data into an array structure that allows us to apply more powerful
functions.  The data structure that we're interested in is called an ```ndarray``` and is from the ```numpy``` package:

In [None]:
import numpy as np
ascores = np.array(scores)

In [None]:
ascores 

In [None]:
ahalf = ascores / 2

In [None]:
ahalf

Numpy arrays are powerful, but they have some limitations:  they can only 
consist of one type of data (e.g. int), etc.  pandas provides two additional
data structures that are built on numpy ndarrays.

The first are Series.  Let's create a simple pandas Series and examine it:

In [None]:
import pandas as pd

In [None]:
from pandas import Series

In [None]:
sscores = Series(scores,name='scores')

In [None]:
sscores

So you see a couple of useful things: an index (0 to 3) and a data type (dtype), which in this case is an int64.

**A Series is a one-dimensional ndarray with axis labels**

In [None]:
data = dict(zip(names,scores))

In [None]:
import pandas as pd

In [None]:
data

In [None]:
sData = Series(data=data,name='score')

In [None]:
sData

So Series are a bit friendlier than numpy arrays, but they're still only one-dimensional.  Keep in mind that our basic data abstraction is a table, which can
be thought of as a two-dimensional array.  Let's go ahead and create a simple DataFrame with just one column:

In [None]:
from pandas import DataFrame


In [None]:
sData.to_frame()

## Today's focus: filtering, slicing and dicing

We're going to use some data from the [Nutrition Facts for McDonald's Menu](https://www.kaggle.com/mcdonalds/nutrition-facts) dataset on [Kaggle](www.kaggle.com).

Go ahead and browse the file using JupyterLab.

Now let's load the file using ```read_csv```.

In [None]:
import pandas as pd

In [None]:
menu = pd.read_csv('data/menu.csv')

## <font color="magenta"> Exercise 3 (1 point): How many rows and columns are in this dataset?  Include one cell block to determine the number and one markdown block that presents the answer as a complete sentence (i.e. "The McDonald's nutrition data set contains X rows and Y columns"). </font>

In [None]:
# insert your code here

## Extracting columns 

Getting column names is easy:

In [None]:
menu.columns

Similarly, extracting a specific columns is also easy:

In [None]:
menu['Category'] 
menu.Category

And multiple columns can also be extracted by passing a list of column names

In [None]:
menu[['Item','Calories']]

## Extracting rows

In [None]:
menu.iloc[0]

In [None]:
menu

You'll notice that the index column is just a series of integers starting with 0.  Sometimes that's fine.  
Other times we want to assign a more useful row as the index.  Note that the values in the index do not need to be unique.

In [None]:
menu_i = menu.set_index('Item')

In [None]:
menu_i

In [None]:
menu_i.loc[0]

In [None]:
menu_i.iloc[0]

We can also extract a row and a slice of its columns

In [None]:
menu_i.iloc[0,0:2]

Or we can extract a column and a slice of its rows


In [None]:
menu_i.iloc[1:3,:]

## Sorting
Sorting is supported using sort_index and sort_values:


In [None]:
menu_sorted_by_cals = menu.sort_values('Calories',ascending=True)

## <font color="magenta"> Exercise 4 (2 points): Display the four menu items that have the most Saturated Fat (the absolute amount, not the % Daily Value).</font>

In [1]:
# insert your code here

## Filtering

More often than extracting a particular row, we want to extract one or more rows that match
some criteria.  For example, to find all the menu items that contain Trans Fats, we could use:


In [None]:
menu['Trans Fat' ] > 0

In [None]:
menu_trans_fats = menu[menu['Trans Fat'] > 0.0]

We're going to spend time in class explaining what just happened there.

In [None]:
menu['Trans Fat']

In [None]:
menu['Trans Fat'] > 0.0

In [None]:
menu[menu['Trans Fat'] > 0.0]

In [None]:
menu.columns

## <font color="magenta">Exercise 5 (2 points): List the top 3 breakfast items have the most Dietary Fiber.</font>

In [2]:
# insert your code here

## <font color="magenta">Exercise 6 (3 points): Show up to three of the best choices for someone who is following the "Atkin's Diet" (Google it).  Justify your choices in a markdown block that follows your code.</font>

In [None]:
# insert your code here

List and justify your choices.

# <font color="magenta">END OF NOTEBOOK</font>
## Remember to submit this file in HTML and IPYNB formats via Canvas.