# Idiomatic Pandas - NYC PyData Conference - Nov 30, 2017
[Tutorial Session from 9 a.m. to 12:30 p.m.](https://pydata.org/nyc2017/schedule/presentation/10/)

# Background Info
My name is [Ted Petrou](https://twitter.com/TedPetrou) and I am author of [Pandas Cookbook](https://www.amazon.com/Pandas-Cookbook-Ted-Petrou/dp/1784393878), which provides nearly 100 recipes with step-by-step instructions for developing powerful and efficient routines for exploring, analyzing, and visualizing real-world messy datasets. Buy my book and get a [30-minute one-on-one tutorial with me](http://tedpetrou.com/pandas-cookbook.html).

![](../images/idiomatic_pandas/pc_amazon.png)
____
I am founder of [Dunder Data](http://dunderdata.com/), a company dedicated to teaching the fundamentals of data science.

![](../images/idiomatic_pandas/dd_logo.png?1)

___
I earned a masters degree in statistics from Rice University and used these analytical skills to play poker professionally. I then taught math before becoming a data analyst and eventually a data scientist for Schlumberger in Houston, Texas. 
___
I founded the Houston Data Science Meetup group:

![](../images/idiomatic_pandas/hds2.png)


I now live in Toronto.

I really enjoy answering questions on Stack Overflow. It sharpens my ability to write idiomatic pandas

![](../images/idiomatic_pandas/my_so.png?534)


# Before Getting Started
* Use the latest version of pandas - 0.21. Update with command **`conda update pandas`** or **`pip install pandas -U`**

# Target Audience
This is not an introduction to pandas. If you want a beginners guide to pandas, please see:
* Tom Augspurger's pandas [.head to .tail tutorial](https://github.com/tomaugspurger/pydata-nyc-ph2t) scheduled Wednesday, November 29, 2017 from 1:30 - 5 p.m.
* My article [How to Learn Pandas](https://medium.com/dunder-data/how-to-learn-pandas-108905ab4955) for strategies on mastering pandas.

This tutorial assumes you have prior exposure to doing data analysis with pandas but...
* Would not feel comfortable answering questions tagged as pandas on Stack Overflow
* Don't know the difference between **`[]`**, **`.iloc`**, **`.loc`**, **`.ix`**, **`.at`**, **`.iat`**
* Use **`reset_index`** frequently because you have no idea how to deal with MultiIndexes
* Use for-loops frequently
* Use **`apply`** frequently
* Struggle with pandas, and find yourself wishing it was easy as R
* Absolutely hate pandas and wish to see it obliterated

## Pandas Overview
* Pandas is one of the most popular tools to do data analysis with. Approximately 1% of all new Stack Overflow questions are tagged as pandas.

![](../images/idiomatic_pandas/so_trends.png)
* The library has evolved substantially since it started becoming mainstream in 2012
* Many answers on Stack Overflow use older syntax that has not been updated
* There are multiple ways to accomplish the same task
* For beginners, there is not always an obvious way of doing it
* The documentation is over 2,000 pages long
* Easy to write inefficient pandas

# Attendance

In [1]:
from IPython.display import IFrame

In [2]:
IFrame('http://etc.ch/32Rr', 400, 300)

In [3]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8lsM9v4IBajClR6cnLre6kxM2M3', 300, 250)

In [5]:
IFrame('http://etc.ch/AanT', 400, 400)

In [8]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8OK6KgIliK6457pdCle7Ir3aFf7e7lJ', 600, 400)

In [8]:
IFrame('http://etc.ch/kg2C', 500, 400)

In [9]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg83aLxg811VZnaTvl0O3T2i8JxQzX', 400, 300)

# How well do you know pandas?

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('../data/sample_data.csv', index_col=0)
df

### Exercise 1
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`**.</span>

In [None]:
# your code here

### Exercise 2
<span  style="color:green; font-size:16px">Select the columns **`height`** and **`state`** along with the rows **`Niko`** and **`Penelope`**.</span>

In [None]:
# your code here

### Exercise 3
<span  style="color:green; font-size:16px">Select rows 3 and 5 and the last three columns using 0-based indexing.</span>

In [None]:
# your code here

### Exercise 4
<span  style="color:green; font-size:16px">Select all the people with **`color`** equal to red or green or with height less than 90. Only return the **`score`** column.</span>

In [None]:
# your code here

### Exercise 5
<span  style="color:green; font-size:16px">Two DataFrame are defined below. What will **`df1`** look like when displayed below?</span>

In [None]:
df1 = pd.DataFrame({'state':['Texas', 'California', 'Florida'], 
                    'oranges':[10, 5, 12]})
df2 = pd.DataFrame({'apples':[3, 4, 5]}, 
                   index=[1, 2, 3])
df1

In [None]:
df2

In [None]:
df1['apples'] = df2['apples']

In [None]:
#  Answer question before executing
df1

### Exercise 6
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [None]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])

s1

In [None]:
s2

In [None]:
# Answer question before executing
s1 + s2

### Exercise 7
<span  style="color:green; font-size:16px">What will be the output when the following two Series are added together?</span>

In [None]:
s1 = pd.Series(index=['a', 'a', 'b', 'b'], data=[1, 2, 3, 4])
s2 = pd.Series(index=['a', 'a', 'b', 'b', 'c'], data=[1, 2, 3, 4, 5])

s1

In [None]:
s2

In [None]:
# Answer question before executing
s1 + s2

# Enter your results of the quiz below

In [4]:
IFrame('http://etc.ch/pbjf', 300, 450)

In [5]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8YX2glNsS7zNMFhXmApFoUu5jClJ', 300, 250)

# What does Idiomatic Pandas mean?
Let's come up with a definition for **idiomatic**. Idiomatic code, in general, refers to the most efficient and common convention for completing a specific task. Every language and library has its own idioms. We usually use this term in pandas to refer to short expressions where there exists one good or 'better' version versus other alternatives. 

In general, idiomatic pandas will be:
* Explicit and easy to read
* Performant 
* Commonly used by pandas experts

### The college scoreboard dataset
We will use the college scoreboard dataset for the following examples. This is the US department of education data on 7,535 colleges. Only a sample of the total number of columns available were used in this dataset. Visit [the website](https://collegescorecard.ed.gov/data/) for more info. Data was pulled in January, 2017.

In [None]:
college = pd.read_csv('../data/college.csv', index_col='INSTNM')
college.head()

### College Scoreboard data dictionary
Several of the columns are difficult to decipher. Use the following data dictionary to help you understand the columns

In [None]:
pd.read_csv('../data/college_data_dictionary.csv')

# Comparisons of non-idiomatic vs idiomatic pandas (Basic)
Let's see some examples of terrible pandas code vs their more idiomatic counterparts.

## Reading in data: `read_csv` vs `read_table`
Both the **`read_csv`** and **`read_table`** functions call the exact same underlying code. There is only a single minor difference. **`read_csv`** uses a **comma** as its default delimiter, while **`read_table`** uses a **tab**. That's it. In my opinion **`read_table`** should be deprecated as it adds no additional functionality.

In [None]:
c1 = pd.read_csv('../data/college.csv')
c2 = pd.read_table('../data/college.csv', delimiter=',')
c1.equals(c2)

## Find the total count of historically black colleges

#### non-idiomatic Using a loop

In [None]:
total = 0
for i in college['HBCU']:
    total += i
total

So bad it didn't work. Let's drop the missing values and try again:

In [None]:
total = 0
for i in college['HBCU'].dropna():
    total += i
total

#### Idiomatic

In [None]:
college['HBCU'].sum()

## Find the percentage of historically black colleges

#### non-idiomatic summing and then dividing

In [None]:
college['HBCU'].sum() / college['HBCU'].count()

#### Idiomatic

In [None]:
college['HBCU'].mean()

## Find the percentage of schools with math SAT scores greater than 700

#### non-idiomatic

In [None]:
s_greater_700 = college['SATMTMID'].dropna() > 700
s_greater_700.head()

In [None]:
s_greater_700 = s_greater_700.astype(int)
s_greater_700.head()

In [None]:
s_greater_700.sum() / s_greater_700.count()

#### Idiomatic

In [None]:
college['SATMTMID'].dropna().gt(700).mean()

In [None]:
# or
(college['SATMTMID'].dropna() > 700).mean()

## Testing mutiple 'or' clauses on same column

In [None]:
states = ['AL', 'LA', 'TX', 'FL', 'GA']

#### non-idiomatic

In [None]:
college[[sa in states for sa in college['STABBR']]].shape

In [None]:
criteria = ((college['STABBR'] == 'AL') | (college['STABBR'] == 'LA') | 
            (college['STABBR'] == 'TX') | (college['STABBR'] == 'FL') | 
            (college['STABBR'] == 'GA'))
college[criteria].shape

#### Idiomatic

In [None]:
college[college['STABBR'].isin(states)].shape

## `sum(s)` vs `s.sum()` 
Using the built-in **`sum`** function returns the same result as the **`sum`** Series method. Why should you care if you write it one way or the other?

Let's find the total undergraduate population.

In [None]:
pop = college['UGDS'].dropna()
pop.shape

In [None]:
sum(pop)

In [None]:
pop.sum()

Let's time the difference between the two:

In [None]:
%timeit sum(pop)

In [None]:
%timeit pop.sum()

#### Lots of overhead with pandas

In [None]:
%timeit pop.values.sum()

#### Larger performance difference with more data

In [None]:
pop_alot = pop.sample(n=1000000, replace=True)

In [None]:
%timeit sum(pop_alot)

In [None]:
%timeit pop_alot.sum()

In [None]:
%timeit pop_alot.values.sum()

#### What about taking the absolute value? 

In [None]:
s = pd.Series(np.random.randn(1000000))

In [None]:
%timeit abs(s)

In [None]:
%timeit s.abs()

Both ways of finding the absolute value have identical performance. Why is **`sum`** an order of magnitude less performant?

#### Special method `__abs__`
The reason for this massive discrepancy is because of how much control Python gives developers. Python provides a specific protocol for its built-in **`sum`** function. In contrast, developers can implement the **`abs`** function in whichever way they choose by defining the special method **`__abs__`** for their object.

* **`sum`** - you have no control
* **`abs`** - you have complete control

The built-in Python **`sum`** function only accepts objects that are iterable. An interpreted Python loop will be used to iterate through each value in the Series to sum the up. 

The Series **`sum`** method takes advantage of NumPy's pre-compiled c-code to sum.

When the built-in Python **`abs`** function is passed a DataFrame or Series, the underlying **`__abs__`** method is invoked which also uses NumPy. So **`abs(s)`** and **`s.abs()`** are equivalent.

#### More to the story when converting data to a list
The built-in python **`sum`** function works well when converting the data from a NumPy array to a list. Summing up a list in Python happens in C and not in interpreted Python bytecode. [See this SO answer for more](https://stackoverflow.com/a/24578976/3707607)

In [None]:
v = pop_alot.tolist()

Getting closer to NumPy performance, but Python uses pointers to C primitives. NumPy stores C-primitives directly in the array and only uses homogeneous data.

In [None]:
%timeit sum(v)

In [None]:
%timeit pop_alot.sum()

NumPy is now much slower when data is in a list! 

In [None]:
%timeit np.sum(v)

Most of this time is spent converting the list to a NumPy array

In [None]:
%timeit np.array(v)

# Use pandas DataFrame/Series methods for consistency
Although the built-in **`abs`** function is identical to DataFrame/Series **`abs`** methods, it preferable to use pandas operations when available. This will get you in a habit of using Series methods which have better performance.

### Exercise 1
<span  style="color:green; font-size:16px">Take a look at the following table of all the built-in Python functions. Can you find all the functions that accept a Series and return a useful result. From these functions, can you determine if a pandas special method is being invoked?</span>

In [6]:
IFrame('https://docs.python.org/3/library/functions.html#built-in-functions', 1000, 500)

In [None]:
# define a Series
s = college['UGDS']

In [None]:
# your code here

# Summary

* Idiomatic Pandas is the most efficient, readable and effective way to write pandas
* Use **`read_csv`** and not **`read_table`** - they are same except default delimiter
* Use **`s.mean()`** on a boolean Series to find percentage of values that meet a condition 
* Use **`isin`** to test multiple 'or' conditions
* Use DataFrame/Series methods and not their Python function equivalents
