# Pandas

pandas extends the numpy ndarray to allow for a data-structure that labels the columns (called a data frame)

In this manner, its main competition is R--the data frame provides the functionality for data analysis that R natively presents

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## series

A series is a labeled array.  It looks superficially like a dictionary, but is fixed size, and can handle missing values.  It also can also be operated on with any numpy operation or the standard operators (a dictionary cannot)

Some examples from: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -0.703337
b   -0.325814
c   -0.566975
d   -1.108292
e    0.459929
dtype: float64

In [3]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
pd.Series(np.random.randn(5))

0    1.529094
1    0.192635
2   -0.989706
3    0.335290
4   -0.321017
dtype: float64

you can initialize from a dictionary

In [5]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0
b    1
c    2
dtype: float64

In [6]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b     1
c     2
d   NaN
a     0
dtype: float64

Note that NaN indicates a missing value

you can operate on a series as you would any ndarray

In [8]:
s

a   -0.703337
b   -0.325814
c   -0.566975
d   -1.108292
e    0.459929
dtype: float64

In [9]:
s[0]

-0.70333685096756748

In [10]:
s[:3]

a   -0.703337
b   -0.325814
c   -0.566975
dtype: float64

In [11]:
s

a   -0.703337
b   -0.325814
c   -0.566975
d   -1.108292
e    0.459929
dtype: float64

In [12]:
s[s > s.median()]

b   -0.325814
e    0.459929
dtype: float64

In [13]:
np.exp(s)

a    0.494931
b    0.721940
c    0.567239
d    0.330122
e    1.583962
dtype: float64

you can also index by label

In [14]:
s['a']

-0.70333685096756748

In [15]:
s['e']

0.45992941214662875

In [16]:
'e' in s

True

In [17]:
s.get('f', np.nan)

nan

In [18]:
s + s

a   -1.406674
b   -0.651627
c   -1.133950
d   -2.216583
e    0.919859
dtype: float64

In [19]:
s * 2

a   -1.406674
b   -0.651627
c   -1.133950
d   -2.216583
e    0.919859
dtype: float64

note that operations are always done on like labels, so the following is not exactly the same as numpy arrays.  In this sense, pandas results respect the union of indices 

In [20]:
s[1:] + s[:-1]

a         NaN
b   -0.651627
c   -1.133950
d   -2.216583
e         NaN
dtype: float64

a series can have a name

In [None]:
s = pd.Series(np.random.randn(5), name='something')
s

## DataFrame

The dataframe is like a spreadsheet -- the columns and rows have labels.  It is 2-d

you can initialize from Series

In [None]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [None]:
df = pd.DataFrame(d)
df

In IPython, tab completion for column names is enabled

In [None]:
df.one

In [None]:
pd.DataFrame(d, index=['d', 'b', 'a'])

or from lists / ndarrays

In [None]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

In [None]:
pd.DataFrame(d)

In [None]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

there are lots of other initialization methods, e.g, list of dicts

In [None]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data2, index=['first', 'second'])

Working with the dataframe

you can index it as it it were Series objects

In [None]:
df['one']

In [None]:
df

In [None]:
type(df['one'])

In [None]:
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
df

you can delete or pop columns

In [None]:
del df['two']

In [None]:
three = df.pop('three')

In [None]:
df

In [None]:
three

initialize with a scalar

In [None]:
df['foo'] = 'bar'

In [None]:
df

## CSV

you can also read from CSV

Note, if there is stray whitespace in your strings in the CSV, pandas will keep it.  This is a bit annoying, and you might need to investigate converters to get things properly formatted.

There are similar methods for HDF5 and excel

In [None]:
grades = pd.read_csv('sample.csv', index_col="student", skipinitialspace=True)

In [None]:
grades

In [None]:
grades.index

In [None]:
grades.columns

In [None]:
grades.ix["A"]

In [None]:
grades['hw 1']

In [None]:
grades['hw average'] = (grades['hw 1'] + grades['hw 2'] + grades['hw 3'] + grades['hw 4'])/4.0

In [None]:
grades

this didn't handle the missing data properly

In [None]:
g2 = grades.fillna(0)

In [None]:
g2['hw average'] = (g2['hw 1'] + g2['hw 2'] + g2['hw 3'] + g2['hw 4'])/4.0

In [None]:
g2

For big dataframes, we can view just pieces

In [None]:
g2.head()

In [None]:
g2.tail(2)

### statistics

we get lots of statistics

In [None]:
g2.describe()

want to sort by values?

In [None]:
g2.sort_values(by="exam")

In [None]:
g2.mean()

In [None]:
g2.median()

In [None]:
g2.max()

In [None]:
g2

In [None]:
g2.apply(lambda x: x.max() - x.min())

### access

Pandas provides optimizes methods for accessing data: .at, .iat, .loc, .iloc, and .ix

The standard slice notation works for rows, but note *when using labels, both endpoints are included*

In [None]:
g2["E":"I"]

In [None]:
g2.loc[:,["hw 1", "exam"]]

`at` is a faster access method

In [None]:
g2.at["A","exam"]

The `i` routines work in index space, similar to how numpy does

In [None]:
g2.iloc[3:5,0:2]

In [None]:
g2.iloc[[1,3,5], [1,2,3,4]]

In [None]:
g2.iat[2,2]

### boolean indexing

In [None]:
g2[g2.exam > 90]

### np arrays

In [None]:
g2.loc[:, "new"] = np.random.random(len(g2))

In [None]:
g2

resetting values

In [None]:
a = g2[g2.exam < 80].index

In [None]:
g2.loc[a, "exam"] = 80

In [None]:
g2

## histogramming

In [None]:
g2["exam"].value_counts()

## plotting

In [None]:
%matplotlib inline

In [None]:
g2.plot()

In [None]:
g2.plot.scatter(x="hw average", y="exam", marker="o")

A lot more examples at: http://pandas.pydata.org/pandas-docs/stable/visualization.html