Author: Ming Huang

- Last updated: 02/10/2016
- By: Ming Huang

Pandas
====

## What is Pandas?

A Python library providing data structures and data analysis tools.

# Objectives

- Create and Use Series objects
- Create and Use DataFrame objects
- Create, Reset, and Use Indices
- Join/Merge Dataframes
- Use DataFrame grouping and aggregation
- Read and write data
- Perform high-level EDA using Pandas

## Benefits

- Alternative to Excel or R
- Includes many built in functions for data transformation, aggregations, and plotting
- Based on Data Frames (think of it like a table) and Series (single column table / time series)
- Great for exploratory work
- Is essentially a wrapper of Numpy arrays.

## Not so greats

- Generally much slower to iterate through
- Does not scale terrible well

## Documentation:

- http://pandas.pydata.org/pandas-docs/stable/index.html

#### Before we get started, lets import some essential modules

In [None]:
import pandas as pd
import numpy as np
from numpy.random import randn, randint
from random import choice

# Series and Dataframes 

Series and Dataframes are the fundamental data structure in pandas. 

## Series

Think of a Pandas Series as a _labeled_ one-dimensional vector. In fact, it need not be a numeric vector, it can contain arbitrary python objects.

#### You can create a series from lists, tuples:

In [None]:
pd.Series(xrange(10))

#### You can also create a series from dictionaries:

In [None]:
pd.Series({'01_setup': 'This', '02_pause': 'is', '03_epicness': 'Sparta!'})

## DataFrames

Data frames extend the concept of Series to table-like data structure.  Each column in a data frame is a series.

#### You can create a dataframe using a list:

In [None]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])

#### Or a dictionary:

In [None]:
some_series = pd.Series(randn(5))

df = pd.DataFrame({'Col1': some_series, 'Col2': randn(5)})

df

#### You can extract a series from a dataframe by indexing the column name

In [None]:
df['Col1']

#### You can also extract series from each row

In [None]:
df.ix[0]

# Indices

Dataframes and Series by default are assigned indices from 0 to n - 1. These indices allows for quick lookup of data, which simplifies several operations we may need to explore the data.  

Some key things to remember about indices:

- Does not have to be unique
- Can be manually assigned
- Can be resetted to 0 and 1s
- The column headers for Dataframes are essentially the indices for each row of the data.

In [None]:
def create_fake_data(n_samples=10):
    job_lst = ['Cop', 'Fireman', 'Doctor', 'Soldier', 'Accountant', 'Bird']
    
    data = {'Name': ['User{0}'.format(i) for i in xrange(1, n_samples + 1)],
            'Age': randint(18, 70, n_samples),
            'Donation': randint(1, 20000, n_samples),
            'Job': [choice(job_lst) for i in xrange(n_samples)]}

    return data

## Ways to Set Indices

In [None]:
df = pd.DataFrame(create_fake_data())

#### You can set indices while creating the dataframe

In [None]:
dateindex = pd.date_range('2016-01-01', '2016-01-31')
df = pd.DataFrame(create_fake_data(), index=dateindex[:10])
df

#### Or after you created the dataframe

In [None]:
df = pd.DataFrame(create_fake_data())
df.index = dateindex[:10]
df

#### You can also make an existing column the index

In [None]:
df = df.set_index('Name')
df

#### You can also reset the index to the default of 0 to n - 1

In [None]:
df = df.reset_index()
df

## Using Indices

In [None]:
df = pd.DataFrame(create_fake_data(), index=[1, 1, 2, 3, 4, 5, 6, 7, 8, 9])

df

#### Grab specific columns

In [None]:
df[['Age', 'Donation']]

#### loc returns records where the label matches the given index

In [None]:
df.loc[1]

#### iloc returns records where the location index matches the given index

In [None]:
df.iloc[1]

#### ix is a general indexer, it tries to be label indexer, but will use positional if it fails

In [None]:
df.ix[1]

#### at is a label indexer that can only return scalars, but is exceptionally faster

In [None]:
df.at[1, 'Age']

#### iat is a positional indexer that can only return scalars, but is exceptionally faster

In [None]:
df.iat[1, 0]

## Index alignment

#### Combining series

In [None]:
index1 = ['California', 'Alabama', 'Indiana', 'Montana', 'Kentucky']
index2 = ['Washington', 'Alabama', 'Montana', 'Indiana', 'New York']
series1 = pd.Series(randn(5), index=index1)
series2 = pd.Series(randn(5), index=index2)

In [None]:
series1 + series2

## Concatenating dataframes

Concatenate allows you to combine records or columns of dataframes.

In [None]:
df1 = pd.DataFrame({'Col1': randn(5), 'Col2': randn(5), 'Col3': randn(5)}, index=index1)
df2 = pd.DataFrame({'Col1': randn(5), 'Col2': randn(5), 'Col4': randn(5)}, index=index2)

#### Vertically

In [None]:
pd.concat([df1, df2], join='outer', axis=0)

#### Horizontally

In [None]:
pd.concat([df1, df2], join='outer', axis=1)

In [None]:
pd.concat([df1, df2], join='outer', axis=1)

## Joins

You can use joins similar to SQL joins

In [None]:
df1.merge(df2, how='left', left_index=True, right_index=True)

# Filtering

Select subsets of the dataframe by placing conditional (boolean) requirements on columns.

In [None]:
df[(df['Age'] > 30) & (df['Donation'] > 10000)]

# Applying Functions

In [None]:
bracket_of_five = lambda x: x / 5 * 5

df['Age'].apply(bracket_of_five)

# Split, Apply, Combine

In [None]:
df = pd.DataFrame(create_fake_data())

#### Grouping

Use group to split the data into chunks.

In [None]:
groups = df.groupby('Job')

In [None]:
for group in list(groups):
    print group 
    print '\n'

#### Aggregate

Use aggregate to apply some aggregation function to the each group and recieve an compressed dataframe.

In [None]:
groups.aggregate(sum)

#### Transfrom

Use transform to apply aggregation function to each group and recieve an non-compressed dataframe.

In [None]:
groups.transform(sum)

#### Apply

You can also use apply on groupby to apply dataframe function on the each group.

In [None]:
groups.apply(sum)

In [None]:
groups['Donation'].apply(lambda x: pd.rolling_sum(x, 2))

In [None]:
groups.apply(lambda x: x.head(1))

# Resampling with Date Indices

Date indices allows for date range aggregation through the use of resample

In [None]:
dateindex = pd.date_range('2016-01-01', '2016-01-31')
df = pd.DataFrame(create_fake_data(), index=dateindex[:10])
df

In [None]:
df.resample('W', how='mean')

# Reading/Loading Data

#### Read from csv

In [None]:
pd.read_csv('playgolf.csv', delimiter='|')

#### Read from postgres

In [None]:
import psycopg2 as pg2

conn = pg2.connect(dbname='readychef', user='minghuang', host = 'localhost')

pd.read_sql('select * from events limit 10;', conn)

In [None]:
conn.close()

# Writing Data

In [None]:
df = pd.read_csv('playgolf.csv', delimiter='|')

#### Write to csv

In [None]:
df.to_csv('new_playgolf.csv')

#### Write to sql

In [None]:
from sqlalchemy import create_engine

engine = create_engine('postgresql://minghuang@localhost:5432/readychef')

In [None]:
df.to_sql('playgolf', engine)

# EDA

Pandas is excellent for doing Exploratory Data Analysis (EDA).  Here's a brief example.

#### Read in data

In [None]:
%matplotlib inline

df = pd.read_csv('playgolf.csv', delimiter='|')

#### Lets just look at some of the data

In [None]:
df.head(5)

#### Lets get a sense of the distribution

In [None]:
df.describe().T

#### What about the column types?

In [None]:
df.info()

#### Use crosstab to quickly check the frequency of combination of playing golf and weather outlook

In [None]:
pd.crosstab(df['Outlook'], df['Result'])

#### Lets roughly see if there's any two variable that may be correlated

In [None]:
_ = pd.scatter_matrix(df, diagonal='kde')

#### What is the general distribution of each of our feature

In [None]:
_ = df.hist()

#### I would like to use my date feature like a date, so lets change it from string to datetime object

In [None]:
df['Date'] = df['Date'].apply(pd.to_datetime)

#### I also don't like the string representation of Result, so lets change it to a 0 or 1

In [None]:
df['Play'] = df['Result'].apply(lambda x: 1 if x == 'Play' else 0)

#### Lets look at weekly summaries, so lets make date the index in preparation

In [None]:
df.set_index('Date', inplace=True)

#### To get the count sunny, rainy, or overcast days.  I need to dummify my data.

In [None]:
outlook_dummies = pd.get_dummies(df['Outlook'])

outlook_dummies

#### Lets get our transformed data by combining the dummies with the usable data.

In [None]:
new_df = pd.concat([df[['Temperature', 'Humidity', 'Windy', 'Play']], outlook_dummies], axis=1)

#### Now we can look at the weekly summary.

In [None]:
new_df.resample('W', how={'Temperature': np.mean, 'Humidity': np.mean, 'Windy': np.sum, 'Play': np.sum,
                          'overcast': np.sum, 'rain': np.sum, 'sunny': np.sum})