#Pandas

##What is Pandas?
A Python library providing data structures and data analysis tools.

##Why
- Alternative to Excel or R
- Based on Data Frames (think of it like a table) and Series (single column table / time series)

##Learning Pandas
* Almost anything you want to do is already a built-in function in Pandas.
* Before you decide to write a function to do some kind of operation on a Pandas object, scour the Pandas docs and StackOverflow
* http://pandas.pydata.org/pandas-docs/stable/index.html

#Objectives

- Create/Understand Series objects
- Create/Understand DataFrame objects
- Create and destroy new columns, apply functions to rows and columns
- Join/Merge Dataframes
- Use DataFrame grouping and aggregation
- Perform high-level EDA using Pandas

### Standard Imports

In [None]:
# By convention import pandas like:
import pandas as pd
import numpy as np

# For fake data.
from numpy.random import randn

#Series

Think of a Pandas Series as a _labeled_ one-dimensional vector. In fact, it need not be a numeric vector, it can contain arbitrary python objects.

Integer valued series:

Real valued series:

String valued series:

#Indexes.

Notice how each series has an index (in this case a relatively meaningless default index). Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index---which helps optimize many operations.

In [None]:
# Sample index - each data point is labelled with a state.
index1 = ['California', 'Alabama', 'Indiana', 'Montana', 'Kentucky']
index2 = ['Washington', 'Alabama', 'Montana', 'Indiana', 'New York']

Labelled numeric series:

The index is used to line up arithmetic operations.

Create a series indexed by month

Resample by week

#DataFrames
Data frames extend the concept of Series to table-like data.

From a dictionary of series

From a numpy array

Dataframes can be indexed (selected) by label, numeric index (avoid if possible), and boolean.

Each column is a series


So are the rows


The columns all have the same index:


What's the index for the rows?


#DataFrame basics

In [None]:
df

New column


Delete a column


#Applying functions

Mean of each column


Mean of each row


#Merging/Joining

In [None]:
state_df = pd.DataFrame({'governor':['Robert Bentley',
                                    'Bill Walker',
                                    'Doug Ducey',
                                    'Asa Hutchinson'],
                         'state':['Alabama', 
                                  'Alaska', 
                                  'Arizona', 
                                  'Arkansas']}
                        )
#Note merge is most useful when you want to merge on something other than the index.
#Default is to merge on common column names.

Concat allows joining along the axes.

Join also works. Just like merge, but default is to join on indexes.

#Load some data from disk

In [None]:
df = pd.read_csv('data/playgolf.csv', sep='|')
df

In [None]:
# Describe
# Note: This treats the Boolean Windy variable as a series of 0's and 1's.
#       With this representation mean, etc. represent percentages.


Date as the index


Some subsets


Averages of the numeric variables


Apply an arbitrary function to each column


Or each row


#split-apply-combine

Get averages for each outlook


Or using the index


Initialize a groupby object---and iterate through the groupings


Result: one row for each group.


Result: same shape as the original


Result: different index than I started with.


#Exploratory Data Analysis with Pandas

We can make use of df.plot() to produce simple graphs that calls on the more extensive Matplotlib library.

Side-by-side histograms:


Box plot


Scatterplots for examining bivariate relationships (kind=scatter)

##Categorical variables: Frequency tables and relative frequency tables

Frequencies:

Apply will get you the value counts for multiple columns at once:

Contingency Tables: bivariate relationships between two categorical variables

Percentage by row:

Or by column: