# Minimally Sufficient Pandas with Ted Petrou

* Author of Pandas Cookbook

* Founder of Dunder Data

# Collect Data

* Who knows that Pandas refers to a Python library as well as an east-Asian bear?

* Have you used Pandas before?

* Have you used Pandas in production before?

# Do these apply to you
* Don't know the difference between `[], .iloc, .loc, .ix, .at, .iat`
* Use `reset_index` frequently because you have no idea how to deal with MultiIndexes
* Use for-loops frequently
* Use `apply` frequently
* Struggle with Pandas, and find yourself wishing it was easy as R

# Pandas Quiz #1
# How do you select the food column?

In [None]:
import pandas as pd
df = pd.read_csv('data/sample_data.csv', index_col=0)
df

# Pandas Quiz #2
### How do you select the row just for Penelope?

In [None]:
df

# Pandas Quiz #3
### How would you select the food and age columns for everyone over the age of 30?

In [None]:
df

# Minimally Sufficient Pandas

* There are multiple ways to accomplish most tasks

* Often, there is not an obvious way to do things

* A small subset of the library covers nearly all of the possible tasks

* Knowing many obscure Pandas tricks is not helpful

* Developing a standard Pandas usage guide can be helpful

* Pandas can be written in a very implicit way. Be as explicit as possible.

* Ask yourself whether method B gives you more functionality than method A

* Pandas is difficult to use in production - striving for consistency and simplicity can make a big difference

* There are an incredible amount of issues/bugs and using a minimally sufficient subset of Pandas can help avoid landing on a bug

# Simple Guidelines

* Use only bracket notation and never dot notation to select single columns
    * Columns with spaces do not work
    * Column names that collide with methods do not work

* Only use string names for columns

* Avoid chained indexing, especially when assigning new values to subsets of data
    * Do not do this: `df[df['col1'] > 10]['col2'] = 10`

* Never use `.ix` for subset selection. It is deprecated.
* No reason to use `.at` and `.iat`

* Use bracket notation instead of the `query` method to do boolean selection

* Use the arithmetic and comparison operators instead of their counterpart methods (`add`, `gt`, etc...)

* Use DataFrame/Series methods when they exist
    * Avoid built-in `Python` functions
    * Avoid the `apply` method when possible

* Do not store complex data types in DataFrame/Series values - i.e. no lists, Series, or DataFrames within DataFrames/Series

* Decide on a syntax for grouping (especially when aggregating)
    * `df.groupby(['grouping', 'columns']).agg({'aggregating column': 'aggregating func'})`
    * `df.groupby(['grouping', 'columns'])['aggregating column'].aggregating_func()`

* Have a standard way of handling a multi-level Index
    * Should you reset to single level? 
    * Should you reset and rename multi-level column indexes?

* Be very careful when calling `apply` on a `groupby` - this is the slowest operation in Pandas
    * Pre-calculate anything that is independent of the group

* `melt/pivot` vs `stack/unstack` - They both do the same thing

# Chained Indexing
Occurs when consecutive subset selection. If you see back to back brackets (`][`), you have done chained indexing. 

In [None]:
df[['color', 'food', 'state']][['color', 'food']]

In [None]:
# using a single indexer
df.loc[df['age'] > 30, ['color', 'food']]

### Helpful to break apart row and column selection

In [None]:
rs = df['age'] > 30
cs = ['color', 'food']
df.loc[rs, cs]

# Two common scenarios when assigning subsets of data
1. You want to make an assignment to a particular subset of your DataFrame but want to keep doing analysis on the entire DataFrame
1. You want to select a subset of data and store it as its own variable and modify that subset without modifying your original data.

In [None]:
df1 = pd.read_csv('data/sample_data.csv', index_col=0)
df1

### No assignment!

In [None]:
df1.loc[['Aaron', 'Dean']]['color'] = 'PURPLE'
df1

### Idiomatic

In [None]:
rs = ['Aaron', 'Dean']
cs = 'color'
df1.loc[rs, cs] = 'PURPLE'
df1

# Summary of Scenario 1:
* Use exactly one set of brackets to make the assignment
* You know you've made a mistake when you see back to back brackets like this `][`
* Separate row and column selection by a comma within the same set of brackets

# Scenario 2
Scenario 2 exists when you take a subset of data and want to keep working with just that subset. You may not care at all about the original DataFrame, but you probably won't want to change its data.

In this scenario, you will use the `copy` method to create a fresh independent copy of your subset and then make changes to that.

In [None]:
df2 = pd.read_csv('data/sample_data.csv', index_col=0)
food_score = df2[['food', 'score']]
food_score

In [None]:
criteria= food_score['food'].isin(['Steak', 'Lamb'])
food_score.loc[criteria, 'score'] = 99
food_score

In [None]:
df2

### Idiomatic
Use the `copy` method:

In [None]:
food_score = df[['food', 'score']].copy()

criteria = food_score['food'].isin(['Steak', 'Lamb'])
food_score.loc[criteria, 'score'] = 99
food_score

# `.ix` is deprecated
Remove every trace of it from your code. It is ambiguous. `.loc` and `.iloc` are explicit. Use them.

### Very little reason to use `.at` and `.iat`
These two indexers select a single cell from a DataFrame/Series. There is almost never going to be a case when they are necessary. They provide a small speed-up over `.loc` and `.iloc`, but if you really wanted to select data faster then you should drop down into NumPy.

# `query` method
It is more readable but does not work with columns with spaces. It also adds no additional functionality over normal boolean indexing, so why use it?

In [None]:
df.query('age > 30')

In [None]:
df[df['age'] > 30]

In [None]:
df3 = df.copy()

In [None]:
df3 = df3.rename(columns={'food': 'fave food'})
df3

In [None]:
df3.query('fave food == "Steak"')

# Arithmetic and Comparison Operators
Use the arithmetic and comparison operators `+, -, *, /, <, >, <=, >=, ==, !=` over their counterpart methods `add, sub, mul, div, lt, gt, le, ge, eq, ne` unless you need to change the direction of an operation.

In [None]:
college = pd.read_csv('data/college.csv', index_col='instnm')
pd.options.display.max_columns = 100
college.head()

In [None]:
college_ugds = college.loc[:, 'ugds_white':'ugds_unkn']
college_ugds.head()

In [None]:
race_ugds_mean = college_ugds.mean()
race_ugds_mean

### Default is to align Series index with columns

In [None]:
college_ugds_mean_diff = college_ugds - race_ugds_mean
college_ugds_mean_diff.head(10)

In [None]:
race_school_min = college_ugds.min(axis='columns')
race_school_min.head(10)

In [None]:
# blows up due to outer join of index
college_ugds - race_school_min

Arithmetic and comparison **methods** default to `axis='columns'`. Almost all others default to axis='index'. We must use the `sub` method to change the direction of operation.

In [None]:
college_ugds.sub(race_school_min, axis='index').head(10)

# Use DataFrame/Series methods
A common mistake is to use a built-in core Python function instead of a DataFrame/Series method.

In [None]:
ugds = college['ugds'].dropna()
ugds.head(10)

In [None]:
sum(ugds)

In [None]:
ugds.sum()

## No difference except when there are missing values

In [None]:
sum(college['ugds'])

In [None]:
college['ugds'].sum()

## Large performance difference

In [None]:
ugds1 = ugds.sample(n=10**6, replace=True)

In [None]:
%timeit -n 5 sum(ugds1)

In [None]:
%timeit -n 5 ugds1.sum()

# `apply` - the method that does nothing but is used the most often
The `apply` method does basically nothing. It simply replaces a manual writing of a for loop.

In [None]:
college_ugds.head()

In [None]:
college_ugds.apply(lambda x: x.max())

In [None]:
%timeit -n 5 college_ugds.apply(lambda x: x.max())

In [None]:
college_ugds.max()

In [None]:
%timeit -n 5 college_ugds.max()

In [None]:
college_ugds.apply(lambda x: x.max(), axis='columns').head()

In [None]:
college_ugds.max(axis='columns').head()

### Huge time difference when doing `axis='columns'`
A for-loop over the rows is a very slow operations. Avoid at all costs.

In [None]:
%timeit -n 1 -r 1 college_ugds.apply(lambda x: x.max(), axis='columns')

In [None]:
%timeit -n 5 college_ugds.max(axis='columns').head()

# Acceptable usages of `apply`
Only use `apply` when a built in pandas method does not exist.

In [None]:
earnings_debt = college[['md_earn_wne_p10', 'grad_debt_mdn_supp']]
earnings_debt.head()

In [None]:
earnings_debt.dtypes

In [None]:
earnings_debt.astype('float')

In [None]:
pd.to_numeric(earnings_debt)

In [None]:
earnings_debt.apply(pd.to_numeric, errors='coerce').head()

# Storing complex objects inside DataFrames/Series
Just because Pandas allows you to do something, does not mean it is a good idea. There is not good support for non-scalar values stored within cells of DataFrames/Series. Store multiple values in separate columns.

In [None]:
# never do this
college_ugds.head(20).apply(lambda x: pd.Series({'max and min': [x.min(), x.max()]}), axis=1).head()

# Know the three components of a groupby aggregation
All groupby aggregations contain 3 components:
* Grouping Columns - Unique combinations of these for independent groups
* Aggregating Columns - The values in these columns will be aggregated to a single value
* Aggregating functions - The type of aggregation to be used. Must output a single value

# `groupby` syntax - standardize for readability
There are a number of syntaxes that get used for the `groupby` method. 

In [None]:
# syntax that I use
state_math_sat_max = college.groupby('stabbr') \
                            .agg({'satmtmid': 'max'})
state_math_sat_max.head()

In [None]:
college.groupby('stabbr')['satmtmid'].agg('max').head()

In [None]:
# no reason to use the full word aggregate. Always use agg
college.groupby('stabbr')['satmtmid'].aggregate('max').head()

In [None]:
college.groupby('stabbr')['satmtmid'].max().head()

In [None]:
college[['stabbr', 'satmtmid']].groupby('stabbr').max().head()

# Handling a MultiIndex - Usually after grouping

In [None]:
col_stats = college.groupby(['stabbr', 'relaffil']) \
                   .agg({'ugds': ['min', 'max'], 
                        'satmtmid': ['median', 'max']})
col_stats.head(10)

### I don't like MultiIndexes
Personally, I find that MultiIndexes add no value to pandas. Selecting subsets of data from them is not obvious. Instead, renaming the columns by hand is not a bad strategy. We can also reset the index.

In [None]:
col_stats.columns = ['min ugds', 'max ugds', 'median satmtmid', 'max satmtmid']
col_stats = col_stats.reset_index()
col_stats.head()

# Calling `apply` on a `groupby` object - be careful
Using `apply` within a `groupby` can lead to disastrous performance. It is one of the slowest operations in all of pandas. 

### Finding the percentage of all undergraduates represented in the top 5 most populous colleges
To accomplish this, we write a custom function to sort the values of each group from greatest to least. We then select the first 5 values with .iloc and sum them. We divide this sum by the total.

In [None]:
def top5_perc(s):
    s = s.sort_values(ascending=False)
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [None]:
college.groupby('stabbr').agg({'ugds': top5_perc}).head(10)

# Run operations that are independent of the group outside of the custom function
The best way to avoid giant performance leaks with groupby-apply is to run all operations that are independent of the group outside of the custom aggregation function. Here, we sort the entire DataFrame first.

In [None]:
def top5_perc_simple(s):
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [None]:
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

In [None]:
%timeit -n 5 college.groupby('stabbr').agg({'ugds': top5_perc})

In [None]:
%%timeit -n 5 
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

# Pandas Power User Optimization

In [None]:
college_top5 = college.sort_values('ugds', ascending=False) \
                      .groupby('stabbr').head()

In [None]:
top5_total = college_top5.groupby('stabbr').agg({'ugds': 'sum'})
top5_total.head()

In [None]:
total = college.groupby('stabbr').agg({'ugds': 'sum'})
total.head()

In [None]:
(top5_total / total).head()

In [None]:
%%timeit -n 5
college_top5 = college.sort_values('ugds', ascending=False) \
                      .groupby('stabbr').head()
top5_total = college_top5.groupby('stabbr').agg({'ugds': 'sum'})
total = college.groupby('stabbr').agg({'ugds': 'sum'})
top5_total / total

# `melt` vs `stack`
These methods are virtually identical. I prefer `melt` as it avoids a multi-level index.

In [None]:
movie = pd.read_csv('data/movie.csv')
movie.head()

In [None]:
act1 = movie.melt(id_vars=['title'], 
                  value_vars=['actor1', 'actor2', 'actor3'], 
                  var_name='actor number',
                  value_name='actor name')

In [None]:
stacked = movie.set_index('title')[['actor1', 'actor2', 'actor3']].stack()
stacked.head()

In [None]:
stacked.reset_index(name='actor name').head(10)

In [None]:
act1.pivot(index='title', columns='actor number', values='actor name').head()

In [None]:
stacked.unstack().head()

# `pivot_table` vs `groupby` then `unstack`
`pivot_table` can directly create a pivot table. You can achieve the exact same result by grouping by multiple columns and then unstacking. I prefer the pivot table as it is clearer.

In [None]:
emp = pd.read_csv('data/employee.csv')
emp.head()

In [None]:
emp.pivot_table(index='race', columns='gender', values='salary')

In [None]:
race_gen_sal = emp.groupby(['race', 'gender']).agg({'salary': 'mean'})
race_gen_sal

In [None]:
race_gen_sal.unstack('gender')