#Objectives

- Explain the relationship of DataFrame and Series
- set and reset indexes, and use index alignment
- Use iloc, loc, ix, and iat appropriately
- Create and destroy columns
- Use groupby
- Read and write data to pandas
- Plot wiht pandas

#Pandas

##What is Pandas?
A Python library providing data structures and data analysis tools.

##Why
- Alternative to Excel or R
- Based on Data Frames (think of it like a table) and Series (single-column table)

##Learning Pandas
* Almost anything you want to do is already a built-in function in Pandas
* Before you decide to write a function to do some kind of operation on a Pandas object, scour the Pandas docs and StackOverflow
* Many methods are shorthand for more complex syntax
* http://pandas.pydata.org/pandas-docs/stable/index.html

### Standard Imports

In [None]:
# By convention import pandas like:
import pandas as pd
import numpy as np

# For fake data.
from numpy.random import randn

# Numpy arrays


In [None]:
a1 = np.array([[1,2,3],[4,5,6]])
a2 = np.array([[7,8,9],[10,11,12]])
print a1
print a2

Arrays can be combined with each other or scalars or lists:

In [None]:
print a1 * a2

In [None]:
print a1 + 100

In [None]:
print a1 * [1, 10, 100]

And accessed and manipulated in various ways.

In [None]:
print a1[0,1] # row, column, zero-based

In [None]:
print a1[::-1, 0:2] # using slices

In [None]:
print a1.mean()
print a2.sum(axis=0)

#Series

Think of a Pandas Series as a _labeled_ one-dimensional vector. In fact, it need not be a numeric vector, it can contain arbitrary python objects.

Integer valued series:

In [None]:
pd.Series(range(10))

Real valued series:

In [None]:
pd.Series(randn(10))

String valued series:

In [None]:
pd.Series(list('Hello world!'))

#Indexes

Notice how each series has an index (in this case a relatively meaningless default index). Pandas can make great use of informative indexes. Indexes work similarly to a dictionary key, allowing fast lookups of the data associated with the index---which helps optimize many operations.

In [None]:
# Sample index - each data point is labelled with a state.
index1 = ['California', 'Alabama', 'Indiana', 'Montana', 'Kentucky']
index2 = ['Washington', 'Alabama', 'Montana', 'Indiana', 'New York']

Labelled numeric series:

In [None]:
series1 = pd.Series(randn(5), index=index1)
series2 = pd.Series(randn(5), index=index2)
print series1, '\n'*2, series2

The index is used to line up arithmetic operations.

In [None]:
series1 * series2

Aggregation by index labels is easy (and optimized)

In [None]:
long_index = index1*3
long_series = pd.Series(randn(15), index=long_index)
print long_series

In [None]:
long_series.groupby(level=0).mean()

Create a series indexed by dates

In [None]:
dt_index = pd.date_range(start='2016-09-01', end='2016-10-1', freq='D')
dt_series = pd.Series(randn(len(dt_index)), index=dt_index)
dt_series

Resample by week

In [None]:
dt_series.resample('W').mean()

In [None]:
dt_series['2016-09-15']

#DataFrames
Data frames extend the concept of Series to table-like data.

From a dictionary of series or list

In [None]:
s1 = pd.Series(randn(10), index=dt_index[:10])
s2 = randn(10)
pd.DataFrame({'Col1': s1, 'Col2': s2}, index=dt_index[:10])

In [None]:
df = pd.DataFrame(randn(10, 5), index=dt_index[:10], columns=list('abcde'))
df

## Indexing

There are many ways to index data

* `.iat[]` (single value, by position)
* `.at[]` (single value, by label)
* `.iloc[]` (single value/list/slide, by position, or boolean array)
* `.loc[]` (single value/list/slide, by label, or boolean array)
* `.ix[]` (like .loc with .iloc as fallback)
* `[]` (value/list for column, slice/boolean array for row)
* `.` (columns labels as attribute)

Documentation at http://pandas.pydata.org/pandas-docs/stable/indexing.html

In [None]:
df.iat[0,0]

In [None]:
df.at[pd.datetime(2016, 9, 1), 'a']

Using a slice and a list:

In [None]:
df.iloc[0:4,[0,2,4]]

Using a boolean list:

In [None]:
df.loc[[True, True, True, True, False, False, False, False, False, False] ,['a', 'b', 'c']]

`.ix` can use either label or position.

Will this include row 5 and column d?

In [None]:
df.ix[1:5,'a':'d']

Using `[]` alone to select rows...

In [None]:
df[0:2]

...or columns.

In [None]:
df['a']

Using attributes to get columns. (WARNING: this doesn't work it the labels aren't valid attribute).


In [None]:
df.a

Back to boolean lists (or arrays). These are very useful, but watch your parentheses.

In [None]:
df[(df.a > 0) & (df.b < 0) | (df.c > 1)]

Each column is a series

In [None]:
print df.a
print type(df.a)

Each row is a series

In [None]:
print df.ix[0]
print type(df.ix[0])

Every column shares the same index

In [None]:
print df.a.index
print df.b.index

The index for each row is the column headers

In [None]:
df.ix[0].index

#DataFrame basic operations

Load data from a delimited file.

In [None]:
file_path = 'playgolf.csv'
df = pd.read_csv(file_path, delimiter='|')
df.head(5)

Create a new column.

In [None]:
df['new'] = df.Temperature + df.Humidity
df.head(5)

Drop a column

In [None]:
df = df.drop('new', axis=1)
df

Delete a row

In [None]:
df = df.drop(13, axis=0)
df

Add a row

In [None]:
new_row = pd.DataFrame([['07-14-2014', 'foggy', 62, 90, False, 'Play']], index=[13], columns=df.columns)

In [None]:
pd.concat([df, new_row])

#Applying functions

Using existing functions

In [None]:
df.mean()

Creating and using an arbitrary function

In [None]:
fun = lambda x: x**2
df.Temperature.apply(fun)

Using multiple values on each row

In [None]:
fun2 = lambda x, y: x + y
df.apply(lambda x: fun2(x.Temperature, x.Humidity), axis=1)

#Summarizing dataframe

In [None]:
df.describe()

In [None]:
df.info()

#Index manipulation

Set index

In [None]:
df = df.set_index(['Date'])

In [None]:
df

Reset index

In [None]:
df = df.reset_index()
df

#Grouping (split-apply-combine)

Groupby is a big topic; more documentation at http://pandas.pydata.org/pandas-docs/stable/groupby.html

Get averages for each outlook

In [None]:
df.groupby('Outlook').mean().reset_index()

Initialize a groupby object---and iterate through the groupings

In [None]:
grouped = df.groupby(['Windy', 'Result'])
for name, group in grouped:
    print name
    print group, '\n'

Get the mean for each group

In [None]:
grouped.aggregate(sum)

Note there are shorthand functions for many things, so this will give the same result.

In [None]:
grouped.sum()

Normalize by the group each date belonged in

In [None]:
grouped.transform(lambda x: (x - x.mean()) / x.std())

Summarize by each group

In [None]:
grouped.describe().T

#Cross Comparison

In [None]:
pd.crosstab(df.Outlook, df.Result)

#Charting

In [None]:
#this enables charts to show inside iPython Notebook
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
df[['Humidity', 'Temperature']].hist(bins=10)

In [None]:
df[['Humidity', 'Temperature']].plot(kind='box')

In [None]:
df.plot('Humidity', 'Temperature', kind='scatter')