# Pandas

pandas extends the numpy ndarray to allow for a data-structure that labels the columns (called a data frame)

In this manner, its main competition is R--the data frame provides the functionality for data analysis that R natively presents

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## series

A series is a labeled array.  It looks superficially like a dictionary, but is fixed size, and can handle missing values.  It also can also be operated on with any numpy operation or the standard operators (a dictionary cannot)

Some examples from: http://pandas.pydata.org/pandas-docs/stable/dsintro.html

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.339273
b   -0.648103
c   -0.061944
d    1.416584
e   -0.874530
dtype: float64

In [3]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
pd.Series(np.random.randn(5))

0    0.411537
1   -0.791442
2   -1.757569
3    0.234559
4   -2.398888
dtype: float64

you can initialize from a dictionary

In [5]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0
b    1
c    2
dtype: float64

In [6]:
pd.Series(d, index=['b', 'c', 'd', 'a'])

b     1
c     2
d   NaN
a     0
dtype: float64

Note that NaN indicates a missing value

you can operate on a series as you would any ndarray

In [7]:
s[0]

0.33927320734018018

In [8]:
s[:3]

a    0.339273
b   -0.648103
c   -0.061944
dtype: float64

In [9]:
s

a    0.339273
b   -0.648103
c   -0.061944
d    1.416584
e   -0.874530
dtype: float64

In [10]:
s[s > s.median()]

a    0.339273
d    1.416584
dtype: float64

In [11]:
np.exp(s)

a    1.403927
b    0.523037
c    0.939935
d    4.123011
e    0.417058
dtype: float64

you can also index by label

In [12]:
s['a']

0.33927320734018018

In [13]:
s['e']

-0.87453029066187893

In [15]:
'e' in s

True

In [16]:
s.get('f', np.nan)

nan

In [17]:
s + s

a    0.678546
b   -1.296207
c   -0.123888
d    2.833168
e   -1.749061
dtype: float64

In [18]:
s * 2

a    0.678546
b   -1.296207
c   -0.123888
d    2.833168
e   -1.749061
dtype: float64

note that operations are always done on like labels, so the following is not exactly the same as numpy arrays.  In this sense, pandas results respect the union of indices 

In [19]:
s[1:] + s[:-1]

a         NaN
b   -1.296207
c   -0.123888
d    2.833168
e         NaN
dtype: float64

a series can have a name

In [20]:
s = pd.Series(np.random.randn(5), name='something')
s

0    1.373504
1    0.715016
2   -0.236134
3   -0.746722
4   -0.995908
Name: something, dtype: float64

## DataFrame

The dataframe is like a spreadsheet -- the columns and rows have labels.  It is 2-d

you can initialize from Series

In [21]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [24]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [25]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


or from lists / ndarrays

In [26]:
d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

In [27]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [28]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1
