# Download material at [github.com/tdpetrou/minimally-sufficient-pandas][1]

[1]: http://github.com/tdpetrou/minimally-sufficient-pandas

# Minimally Sufficient Pandas by Ted Petrou

## 3rd Annual Global Big Data Conference, Santa Clara, CA, January 23, 2019

## About Me

* Founder of [Dunder Data][0] - Professional data science training company
![][1]
* Author of Pandas Cookbook
* Author of Master Data Analysis with Pandas
    * 3 Volume Set, 1,000 pages, 500 exercises, 10+ projects
* Author of Exercise Python
    * Introduction to Python with over 100 exercises and several projects
* Author of Dexplo and Dexplot
    * Data exploration and visualization libraries
* Twitter - [TedPetrou][2]
    
[0]: https://dunderdata.com/
[1]: images/pc.png?a=4
[2]: https://twitter.com/tedPetrou

## Target Audience

This tutorial targets those that have used Pandas before and want to have a much more streamlined and efficient approach to using the library.

## What does minimally sufficient Pandas mean?
Pandas is the most popular Python library for doing data analysis. Unfortunately, it is also one of the most difficult libraries to use properly. There are several reasons for this:

* There is often more than one way to get the same result
* Many tutorials will show different ways to do the same thing
* There are over 300 total attributes and methods
* Some of these methods do the exact same thing (are aliases of each other)
* Some of these methods are very similar and could be condensed into one
* There are many tutorials (as well as the documentation) that show highly inefficient and non-idiomatic approaches to Pandas

### My definition for minimally sufficient Pandas
* A small subset of the Pandas library is sufficient to accomplish nearly everything that it has to offer. 
* Focus on doing data analysis and don't get bogged down with syntax
* The whole point of being an analyst is to analyze data, not to learn every single possible method and trick that the library has to offer.

### Minimize Complex Code

With a minimally sufficient subset ...

* Your code will be simple, explicit, straightforward, and boring
* You will choose one obvious way to accomplish a task
* Use this obvious way every single time
* You won't have to retain as many commands in working memory
* Your code will be easier to read by others and by you after a break

### Do any of these apply to you?

* Get anxiety because of the enormous number of methods and feel you might be missing out on some special part of the library
* Wonder why there are so many methods that do the exact same thing
* Work with team members that use Pandas code that is difficult to understand
* Don't know the difference between `[], iloc, loc, ix, at, iat` and why there are so many different ways to select subsets of data
* Have trouble dealing with the index
* Have even more trouble dealing with multi-level indexes (MultiIndex)
* Have no idea what to do with the `SettingWithCopy` warning
* Use the `apply` method frequently
* Use any for-loops at all
* Get confused by all the different `groupby` syntaxes
* Write custom `groupby` functions that are extremely slow
* Find yourself wishing it was more like R



### No Tricks
Eliminating much of the library will come with some (good) limitations. 

* Knowing many obscure Pandas tricks might impress your friends, but it doesn't usually lead to good code. 
* Knowing more tricks can lead to very long lines of code
* Longer line of code may be harder to debug
* Ask yourself whether method B gives you more functionality than method A
* Pandas is difficult to use in production - striving for consistency and simplicity can make a big difference
* There are an incredible amount of issues/bugs and using a minimally sufficient subset of Pandas can help avoid landing on a bug

## Tutorial Objectives

### Core objectives

* Know why having a single method for doing a particular task is good practice
* Have guidance on how to approach very common data analysis tasks with a single suggested way to accomplish it
* Have my complete list of attributes and methods that allow you to accomplish nearly all tasks

### More specific objectives
* Idiomatic subset selection
* Handling the `SettingWithCopyWarning`
* Avoiding aliases
* Know how to avoid `apply` and for-loops
* Know how to handle the MultiIndex after a `groupby`
* Know the equivalence of `groupby`, `pivot_table`, and `pd.crosstab`
* Know why having Pandas guidelines for productions is a very good thing

In [5]:
c = pd.read_csv('data/college.csv')

In [6]:
len([m for m in dir(c) if not m.startswith('_')])

243

## Attendance and Skill Level

In [None]:
from IPython.display import IFrame
IFrame('https://directpoll.com/v?XDVhEtE5iKBCVu81bhI3a9iCToKziCmYz', 600, 400)

In [None]:
from IPython.display import IFrame
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8gcNzkffHI6YeLC2OAj8TtSsViOxH3', 600, 400)

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 100)
college = pd.read_csv('data/college.csv')
college.head()

## Pandas Skills Test

### Exercise 1
<span  style="color:green; font-size:16px">What is the median SAT Math score (satmtmid) for University of Arkansas?</span>

### Exercise 2
<span  style="color:green; font-size:16px">What state (stabbr) has the 10th highest total undergraduate population (ugds) and what is that population?</span>

## Pandas Skills Test Results

In [None]:
IFrame('https://directpoll.com/v?XDVhEtjm0FjufI7yoKzYsetJkwpN8VAh', 600, 400)

In [None]:
IFrame('https://directpoll.com/r?XDbzPBd3ixYqg8BJfuINpfH3lH2hKtLqPbOqEe5nDf', 600, 400)