#Pandas

##What is Pandas?
A Python library providing data structures and data analysis tools.

##Why
- Alternative to Excel or R
- Based on Data Frames (think of it like a table) and Series (single column table / time series)

##Learning Pandas
* Almost anything you want to do is already a built-in function in Pandas.
* Before you decide to write a function to do some kind of operation on a Pandas object, scour the Pandas docs and StackOverflow
* http://pandas.pydata.org/pandas-docs/stable/index.html

#Lecture Objectives

- Create/Understand Series objects
- Create/Understand DataFrame objects
- Create and destroy new columns, apply functions to rows and columns
- Join/Merge Dataframes
- Use DataFrame grouping and aggregation

### Standard Imports

In [None]:
# By convention import pandas like:
import pandas as pd
import numpy as np

# For fake data.
from numpy.random import randn

#Series

Think of a Pandas Series as a _labeled_ one-dimensional vector. In fact, it need not be a numeric vector, it can contain arbitrary python objects.

Integer valued series:

In [None]:
pd.Series(range(10))

Real valued series:

In [None]:
pd.Series(randn(10))

String valued series:

In [None]:
pd.Series(list('Hello'*5))

#Indexes.

Notice how each series has an index (in this case a relatively meaningless default index). Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index---which helps optimize many operations.

In [None]:
# Sample index - each data point is labelled with a state.
index1 = ['California', 'Alabama', 'Indiana', 'Montana', 'Kentucky']
index2 = ['Washington', 'Alabama', 'Montana', 'Indiana', 'New York']

Labelled numeric series:

In [None]:
series1 = pd.Series(randn(5), index=index1)
series2 = pd.Series(randn(5), index=index2)
print series1, '\n'*2, series2

The index is used to line up arithmetic operations.

In [None]:
series1 * series2

Aggregation by index labels is easy (and optimized)

In [None]:
long_index = index1*3
long_series = pd.Series(randn(15), index=long_index)
print long_series

In [None]:
long_series.groupby(level=0).mean()

Create a series indexed by dates

In [None]:
dt_index = pd.date_range(start='2015-01-01', end='2015-02-01', freq='D')
dt_index

In [None]:
dt_series = pd.Series(randn(len(dt_index)), index=dt_index)
dt_series

Resample by week

In [None]:
dt_series.resample('W', how='mean')

Subsetting the data

In [None]:
dt_series[dt_series.index > '2015-01-15']

#DataFrames
Data frames extend the concept of Series to table-like data.

From a dictionary of series or list

In [None]:
s1 = pd.Series(randn(10), index=dt_index[:10])
s2 = randn(10)
pd.DataFrame({'Col1': s1, 'Col2': s2}, index=dt_index[:10])

In [None]:
df = pd.DataFrame(randn(10, 5), index=dt_index[:10], columns=list('abcde'))
df

Dataframes can be indexed (selected) by label, numeric index (avoid if possible), and boolean.

Using labels

In [None]:
df[['a', 'b']]

Using indexes

In [None]:
df[:'2015-01-05']

Using booleans

In [None]:
cond1=(df.a > 0) & (df.b < 0)
cond2=(df.c > 1)
df[cond1 | cond2]

Each column is a series

In [None]:
print df.a
print type(df.a)

Each row is a series

In [None]:
print df.ix['2015-01-01']
print type(df.ix['2015-01-01'])

Every column shares the same index

In [None]:
print df.a.index
print df.b.index

The index for each row is the column headers

In [None]:
df.ix['2015-01-01'].index

#DataFrame basic operations

Load data from a delimited file.

In [None]:
file_path = 'playgolf.csv'
df = pd.read_csv(file_path, delimiter=',')
df.head(5)

In [None]:
df.columns

Create a new column.

In [None]:
df['new'] = df.Temperature + df.Humidity
df.head(5)

Drop a column

In [None]:
df = df.drop('new', axis=1)
df

Delete a row

In [None]:
df = df.drop(13, axis=0)
df

Add a row

In [None]:
new_row = pd.DataFrame([['07-14-2014', 'foggy', 62, 90, False, 'Play']], index=[13], columns=df.columns)

In [None]:
pd.concat([df, new_row])

#Applying functions

Using existing functions

In [None]:
df.mean()

Creating and using an arbitrary function

In [None]:
fun = lambda x: x**2
df['Temp2']=df.Temperature.apply(fun)
df.head()

Using multiple values on each row

In [None]:
fun2 = lambda x, y: x + y
df.apply(lambda x: fun2(x.Temperature, x.Humidity), axis=1)

#Summarizing dataframe

In [None]:
df.describe()

In [None]:
df.info()

#Index manipulation

Set index

In [None]:
df = df.set_index(['Date'])

In [None]:
df

Reset index

In [None]:
df = df.reset_index()
df

#Grouping (split-apply-combine)

Get averages for each outlook

In [None]:
df.groupby('Outlook').mean().reset_index()

Initialize a groupby object---and iterate through the groupings

In [None]:
grouped = df.groupby(['Windy', 'Result'])
for name, group in grouped:
    print name
    print group, '\n'

Get the mean for each group

In [None]:
grouped[['Temperature','Humidity']].aggregate(sum)

Normalize by the group each date belonged in

In [None]:
grouped.transform(lambda x: (x - x.mean()) / x.std())

Summarize by each group

In [None]:
grouped.describe().T

#Cross Comparison

In [None]:
pd.crosstab(df.Outlook, df.Result)

#Using pandas with sql

In [None]:
import psycopg2 as pg2

In [None]:
conn = pg2.connect(dbname='golf', user='clayton.schupp', host = 'localhost')

In [None]:
query = '''
        SELECT *
        FROM play_golf
        ;
        '''

In [None]:
pd.read_sql(query, conn).head(10)

In [None]:
conn.close()