Author: Ming Huang

- Last updated: 05/25/2016
- By: Ming Huang

Pandas
====

## Objectives

* How to load data into Pandas Series and DataFrames
* How to extract and use information from Pandas
* How to aggregate data in Pandas
* How to join multiple DataFrames

## What is Pandas?

A Python library providing data structures and data analysis tools.  Think of it as R for Python.

## Benefits

* Includes many built in functions for data transformation, aggregations, and plotting
* Based on Data Frames (think of it like a table) and Series (single column table / time series)
* Great for exploratory work
* Is essentially a wrapper of Numpy arrays.

## Not so greats

* Generally much slower to iterate through
* Does not scale terribly well

## Documentation:

* http://pandas.pydata.org/pandas-docs/stable/index.html

#### Before we get started, lets import some essential modules into our namespace

In [None]:
import pandas as pd
import numpy as np
# this allows plots to show inside iPython Notebook
%matplotlib inline

## Loading Data into Pandas

### DataFrames

You can think of DataFrames as labeled and indexed matrices, thus we can create DataFrames from numpy arrays and list of lists with the provided labels and indices.

In [None]:
pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'], index=['foo', 'bar'])

Alternatively, you can think of DataFrames as a combination of column vectors, thus we can create DataFrames from a dictionary of column vectors.  The keys are the labels, and the values are the vectors.

In [None]:
pd.DataFrame({'Col1': xrange(5), 'Col2': xrange(10, 15)})

### Series

If DataFrames are labeled and indexed matrices, then Series are labeled and indexed column vectors.

In [None]:
pd.Series(xrange(10), index=xrange(10, 20), name='Numbers')

If you create a Series using a dictionary, the keys are treated as indices instead.

In [None]:
pd.Series({'Star': 'Wars', 'Is': 'Boring', 'Please': 'Stop'})

You can take out a Series from a DataFrame.

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'], index=['foo', 'bar'])
df['a']

... or put a Series into a DataFrame as long as you have matching index.

In [None]:
df['d'] = pd.Series([4, 5], index=['foo', 'bar'])
df

It might be easier to just put a list/vector into a DataFrame though if it's a completely new column.

In [None]:
df['e'] = [1, 2]
df

### Load data from SQL

Since we spent so much time on SQL, we might as well learn how to use Pandas with SQL.

In [None]:
import psycopg2 as pg2

conn = pg2.connect(dbname='socialmedia', user='minghuang')

df = pd.read_sql('select * from registrations', conn)

#conn.close()

### Load data from csv

But we will usually be working with data dumps like CSV.

In [None]:
df = pd.read_csv('playgolf.csv', delimiter='|')

## Extracting information from Pandas DataFrames

We can extract individual values by taking the series out of the matrix, then treating it like a list.

In [None]:
df['Temperature'][0]

Fortunately, Pandas has easier ways of let us scan through rows and columns.

In [None]:
df.ix[0, 'Temperature']

If you try to index Pandas like a list, it will only operate on the rows.

In [None]:
short_df = df[:5]
short_df

Interestingly we can try to index our rows by passing in a list of True and Falses

In [None]:
short_df[[True, False, True, False, True]]

We can also create a Boolean Series/List by using comparisons on a Series

In [None]:
df['Temperature'] > 70

So why not combine them!?

In [None]:
df[df['Temperature'] > 70]

### Transforming data

We can easily create a new Series by combining two existing Series 

In [None]:
df['Temperature'] + df['Humidity']

We can create a new Series by applying functions to an existing Series

In [None]:
df['Result'].apply(lambda x: 1 if x == 'Play' else 0)

We can also apply function to each row of the DataFrame by specifying the column and axis equals 1.

In [None]:
df.apply(lambda x: x['Temperature'] + x['Humidity'], axis=1)

We can also apply function to each column of the DataFrame by specifying the row index and axis equals 0.

In [None]:
df.apply(lambda x: x[0] + x[0], axis=0)

### Aggregating data

We can do something like the group by statement in SQL.

In [None]:
groups = df.groupby('Outlook')

We can see that it creates a tuple for each Outlook with a segmented DataFrame

In [None]:
for group in list(groups):
    print group 
    print '\n'

We can then apply some sort of aggregation function to complete the process

In [None]:
groups.aggregate(sum)

### Joining DataFrames

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

Lets create a fake DataFrame to join with first.

In [None]:
df2 = df[['Date', 'Temperature']]
df2.ix[:3, 0] = ['07-15-2014', '07-16-2014', '07-17-2014', '07-18-2014']
df2

We can do joins using the merge command.

In [None]:
df.merge(df2, how='inner', on='Date')

We can also join on indices instead of on columns.

In [None]:
df.merge(df2, how='inner', left_index=True, right_index=True)

## Concatenating dataframes

This is the equivalent of Unions in SQL, but a little more flexible.

In [None]:
df1 = pd.DataFrame({'Col1': range(5), 'Col2': range(5), 'Col3': range(5)})
df2 = pd.DataFrame({'Col1': range(5), 'Col2': range(5), 'Col4': range(5)}, index=range(5, 10))

#### Vertically

This is like a Union All.

In [None]:
pd.concat([df1, df2], axis=0)

#### Horizontally

This is pretty much a simple join on indices.  While concat is capable of doing joins, it is far less flexible.

In [None]:
pd.concat([df1, df2], join='outer', axis=1)

# Some extra kind of useful stuffs

The info class method is useful for checking column types and quickly seeing if you have NaN in the data.

In [None]:
df.info()

The describe method will give you a quick sense of the quartiles and distribution.  I often transpose these to make it easier to read.

In [None]:
df.describe().T

Sometimes we want to loop through columns or indices, we can retrieve them through class variables.

In [None]:
df.columns

In [None]:
df.index

Pandas' crosstab function will allow us to quickly take a look at the frequency count between two columns.

In [None]:
pd.crosstab(df['Outlook'], df['Result'])

We can turn strings of dates into datetime types by using Pandas' to_datetime function.

In [None]:
df['Date'] = df['Date'].apply(lambda x: pd.to_datetime(x))

We can also set the index to be an existing columns.

In [None]:
new_df = df.set_index('Date')

If we have an index of datetime types, we can use the resample to quickly look at time based aggregations.

In [None]:
new_df.resample('W', how='mean')

We can quickly draw scatter matrices.

In [None]:
pd.scatter_matrix(new_df)

Or even histograms.

In [None]:
df.hist()

# Writing Data

We can write data into csv (useful for Kaggle).

In [None]:
df.to_csv('new_playgolf.csv', index=False)

We can also pickle our data.

In [None]:
pd.to_pickle(df, 'playgolf.pkl')

# EDA Practice

Find 3 insights about the data!