# Chapter 5: Intro to pandas Data Structures
## Series
Series are one dimensional array-like objects containing an array of data (any NumPy data type) and associated labes called and *index*. Let's import NumPy and Pandas and create some data objects.

In [21]:
# import pandas and NumPy
import pandas as pd
import numpy as np
import pprint as pp

In [23]:
# create list as input for series
lst = [num ** 2 for num in xrange(1, 64) if num % 2 == 0 and num % 3 == 0]

# print this list
pp.pprint(lst)

[36, 144, 324, 576, 900, 1296, 1764, 2304, 2916, 3600]


In [25]:
# define series
series_1 = pd.Series(lst)

# print out series
series_1

0      36
1     144
2     324
3     576
4     900
5    1296
6    1764
7    2304
8    2916
9    3600
dtype: int64

Now I'd like to get each component out of the `Series`, the `index` to the left and the `array` values on the right. I'll do this using the `index` and `values` attributes.

In [26]:
# get the index
series_1.index

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

In [27]:
# get the values
series_1.values

array([  36,  144,  324,  576,  900, 1296, 1764, 2304, 2916, 3600])

In [28]:
# create a series and specify the index
series_2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

# print
series_2

d    4
b    7
a   -5
c    3
dtype: int64

In [29]:
# Use the index to subset the series
series_2['a']

-5

In [30]:
# use the index to subset and reassign the value
series_2['d'] = 6

In [31]:
# one more subsetting example with a list of values
series_2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

### NumPy array operations on Series
Filtering with a boolean array, scalar multiplication, and applying a math function all preserve the `index`-`value` link.

In [32]:
# Boolean array
series_2[series_2 > 0]

d    6
b    7
c    3
dtype: int64

In [33]:
# scalar multiplication
series_2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [34]:
# math function
np.exp(series_2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

Another way to think about a `Series` is as a fixed-length, ordered dict. It is a mapping of index values to data values. It can be substituted into many functions that expect a dict.

In [35]:
# testing membership
'b' in series_2

True

In [36]:
# membership example 2
'e' in series_2

False

When you have a Python dict, you can create a Series from it by passing the dict.

In [37]:
# create the dictionary to be passed to Series
dict_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [38]:
# now pass the dictionary to Series
series_3 = pd.Series(dict_data)

# print it out
series_3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [41]:
# pass a new index to series_3
states = ['California', 'Ohio', 'Oregon', 'Utah', 'Texas']

# Series 4 using the dictionary and the new index (states)
series_4 = pd.Series(dict_data, index=states)

# print it out
series_4

California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
dtype: float64

Notice that California is inserted into the Series but with no data. Now, let's look at few ways to detect missing values: `isnull` and `notnull`, both of the `pandas` library.

In [42]:
# is the value Null?
pd.isnull(series_4)

California     True
Ohio          False
Oregon        False
Utah          False
Texas         False
dtype: bool

In [43]:
# is the value valid?
pd.notnull(series_4)

California    False
Ohio           True
Oregon         True
Utah           True
Texas          True
dtype: bool

In [44]:
# the Series also has these two library functions as instance methods
series_4.isnull()

California     True
Ohio          False
Oregon        False
Utah          False
Texas         False
dtype: bool

A nice feature of series: automatically aligns differently indexed data in arithmetic operations

In [52]:
pp.pprint(series_3)
print '\n'
pp.pprint(series_4)

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64


California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
dtype: float64


In [53]:
# notice above the different indexing, let's add them
series_3 + series_4

California       NaN
Ohio           70000
Oregon         32000
Texas         142000
Utah           10000
dtype: float64

Series object and index have name attributes.

In [54]:
# give series name
series_4.name = 'population'

# give index name
series_4.index.name = 'state'

In [55]:
series_4

state
California      NaN
Ohio          35000
Oregon        16000
Utah           5000
Texas         71000
Name: population, dtype: float64

You can also alter the `index` in place by assignment.

In [56]:
# reassign the index
series_1.index = ['Bob', 'Jack', 'Ryan', 'Laura', 'Dave'] * 2

In [57]:
# print it out
series_1

Bob        36
Jack      144
Ryan      324
Laura     576
Dave      900
Bob      1296
Jack     1764
Ryan     2304
Laura    2916
Dave     3600
dtype: int64

## DataFrame
A `DataFrame` represents a tabular (spreadsheet-like) data structure containing an ordered collection of columns, each of which can be of a different data type. `DataFrame`'s have both row and column indexes. One thing I'd like to understand in more detail is difference between R's `data.frame` and pandas' `DataFrame`.

There are many ways  to create a `DataFrame`, one of the most common being dict of equal length lists or NumPy arrays.

In [58]:
# this dictionary will be passed the DataFrame function (dict of equal length components)
df_data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
           'year': [2000, 2001, 2002, 2001, 2002],
           'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [59]:
# create the dataframe
frame = pd.DataFrame(df_data)

# print the dataframe
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [60]:
# specify the order of columns
pd.DataFrame(df_data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [64]:
# pass a column that doesn't exist in the underlying data set (get NaN's)
frame_2 = pd.DataFrame(df_data, columns=['year', 'state', 'pop', 'debt'],
                       index=['one', 'two', 'three', 'four', 'five'])

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


Get the columns from the `DataFrame` object

In [66]:
# use column attribute
frame_2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

A column in a `DataFrame` can be retrieved as a `Series` either by dict-like notation or by attribute:

In [67]:
# dict-like notation
frame_2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [68]:
# attribute
frame_2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

Two things to note here: **1)** the resulting `Series` has the same `index`. **2)** the `name` attribute has been set appropriately.

In [70]:
# use ix by name: gets the 3rd row
frame_2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Columns can be modified by reassignment. Take the empty `debt` column for example.

In [71]:
# assign a scalar value to debt
frame_2['debt'] = 16.5

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [73]:
# or use an array
frame_2['debt'] = np.arange(5.)

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4


When using an array or list to assign a column, they must be the same length as the `DataFrame`. However, using a `Series` is a little more flexible. The `Series` will be conformed to the `DataFrame`, matching on index and inserting missing values where the index is not present.

In [74]:
# create a series to be used for the debt column
debt_series = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

# assignment
frame_2['debt'] = debt_series

# print
frame_2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning columns that do not exist result in a new column. To delete a column, use the **del** keyword.

In [75]:
# assign a column that doesn't exist
frame_2['eastern'] = frame_2.state == 'Ohio'

# print
frame_2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [76]:
# now delete
del frame_2['eastern']

# get columns to make sure it worked
frame_2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

**Note:** <br>
The column returned when indexing a `DataFrame` is a *view* on the underlying data, not a copy. Thus, any in-place modifications to the `Series` will be reflected in the `DataFrame`. The column can be explicity copied using the `Series`'s `copy` method.

Another common form of input data is a nested dictionary: a dictionary of dictionaries.

In [77]:
# create nested dict to be passed to DataFrame
population = {'Nevada': {2001: 2.4, 2002: 2.9},
              'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

# pass to Dataframe
frame_3 = pd.DataFrame(population)

# print
frame_3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [78]:
# let's transpose the result
frame_3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The keys in the innner dicts are unioned and sorted to form the index in the final result. However, if you explicitly pass the index, this is not true:

In [79]:
pd.DataFrame(population, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [80]:
# lets try dict's of series
# first input: take a look first
frame_3['Ohio'][:-1] # [:-1] means go up to last one (but not including)

2000    1.5
2001    1.7
Name: Ohio, dtype: float64

In [82]:
# second input: take a look first
frame_3['Nevada'][:2] # same thing as above (but only if you know you have three elements)

2000    NaN
2001    2.4
Name: Nevada, dtype: float64

Similar to `Series`, you can get the values from the `DataFrame` by using the `values` attribute.

In [83]:
# frame 3 values
frame_3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [84]:
frame_2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

`Indexe`s are immutable, which becomes important when they are shared among other data structures.

In [96]:
# create an index
index_input = pd.Index(np.arange(3))

# create series
series_index = pd.Series([-1.5, -2.5, 0], index=index_input)

# test if index_input is and `index'
series_index.index is pd.Int64Index # this needs some investigation

False

In [97]:
# test for membership in columns
'Ohio' in frame_3.columns

True

In [98]:
# test for membership in index
2003 in frame_3.index

False

Each index has a number of methods and properties for set logic and answering other common questions about the data it contains.

# Essential Functionality

## Reindexing

In [103]:
# create a series with an index specified
series_1 = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

# print
series_1

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling `reindex` will rearrange the data according to the new index and introduce missing values if any new index values are introduced (i.e., the value was present in the original data object).

In [104]:
# reindex series
series_2 = series_1.reindex(['a', 'b', 'c', 'd', 'e'])

# print
series_2 # 'e' is new and should have a NaN in it's place

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [105]:
# reindex, but this time specify a fill value for missing values
series_1.reindex(['a', 'b', 'c', 'd', 'e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

Use the forward-fill method for ordered data like time series. Useful for interpolating or filling of values when reindexing. The `method` option allows us  to do this, where `ffill` is the argument value.

In [106]:
# create series
series_3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

# print
series_3

0      blue
2    purple
4    yellow
dtype: object

In [107]:
# reindex with forward fill
series_3.reindex(xrange(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With `DataFrame`, `reindex` can alter either the (row) index, columns, or both. When passed just a sequence , the rows are reindexed in the result:

In [108]:
# create a dataframe
frame = pd.DataFrame(np.arange(9).reshape((3,3)), 
                     index=['a', 'c', 'd'], 
                     columns=['Ohio', 'Texas', 'California'])

# print
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [109]:
# reindex just passing a sequence: rows will be reindexed by default
frame_2 = frame.reindex(['a', 'b', 'c', 'd'])

# print 
frame_2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [110]:
# reindex the columns by using the columns keyword
# use this as the argument
states = ['Texas', 'Utah', 'California']

# pass to reindex
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


**Note:** <br>
Should be self-evident, but want to document that this rearranges (laymen's terms) the columns and adds missing values for columns not present. This is theme present in related operations like reindexing rows with new index values.

Both row and columns can `reindexed` at one time, but interpolation will only apply row-wise (axis 0)

In [114]:
# reindex both row and column and do some interpolating via forward fill
# previous frame didn't have a 'b' index, so the value of 'a' should feed forward
frame.reindex(index=['a', 'b', 'c', 'd'],
            method='ffill',
            columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
b,1,,2
c,4,,5
d,7,,8


According to the author (just haven't proved it to myself yet) reindexing can be done more succinctly with label-indexing with `ix`:

In [115]:
frame.ix[['a', 'b', 'c', 'd'], states] # does ix support interpolation?

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


## Dropping entries from an axis

The `drop` method will return a new object with the indicated value(s) deleted from an axis:

In [122]:
# create a series
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

# print
obj

a    0
b    1
c    2
d    3
e    4
dtype: float64

In [123]:
# now, lets drop 'c'
new_obj = obj.drop('c')

#print new object
new_obj

a    0
b    1
d    3
e    4
dtype: float64

In [125]:
# one more for practice
obj.drop(['d', 'c'])

a    0
b    1
e    4
dtype: float64

With a `DataFrame`, index values can be deleted from either axis.

In [128]:
#### create data frame
data_frame = pd.DataFrame(np.arange(16).reshape((4, 4)),
                          index=['Ohio', 'Colorado', 'Utah', 'New York'],
                          columns=['one', 'two', 'three', 'four'])

# print
data_frame

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [129]:
# drop rows
data_frame.drop(['Ohio', 'Colorado'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [130]:
# drop columns
data_frame.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, selection, and filtering

`Series` indexing works analogously to NumPy array indexing, except you can use the `Series`'s index values instead of only integers.

In [132]:
# create an object
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

# print
obj

a    0
b    1
c    2
d    3
dtype: float64

In [135]:
# 'b' or the value in the 1 index
print obj['b']

# print new line between two objects
print "\n"

# or 
print obj[1]

1.0


1.0


In [143]:
# do the same with multpile values
pp.pprint(obj[2:])

# print new line between two objects
print "\n"

# index labels rather than integer
pp.pprint(obj['c':])

c    2
d    3
dtype: float64


c    2
d    3
dtype: float64


In [144]:
# another example: multiple index labels
obj[['b', 'a', 'd']]

b    1
a    0
d    3
dtype: float64

In [145]:
# multpile integer index values
obj[[1, 3]]

b    1
d    3
dtype: float64

In [146]:
# filtering example
obj[obj < 2]

a    0
b    1
dtype: float64

**Note:** <br>
Label slicing is inclusive, which is a deviation from typical slicing convention

In [147]:
# create a dataframe
data = pd.DataFrame(np.arange(16.).reshape((4,4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

# print
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [148]:
# get the 'two' column
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: float64

In [149]:
# what is returned (type)
type(data['two'])

pandas.core.series.Series

In [150]:
# get multiple columns
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [151]:
type(data[['three', 'one']])

pandas.core.frame.DataFrame

Row indexing via slicing in this form is *weird* and *inconsistent* in relation to both Python and pandas indexing conventions. For example, to get rows in the first two slots (indexes 0 and 1), you'd think since `obj['val']` gets a column, then `obj['val', row index]` or `obj[row index, 'val']` would get both a row and column combination but this is not true. Instead, to get the first two rows, you'd enter `obj[:2]`.

In [154]:
# get first two rows
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [155]:
# use a boolean array
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Indexing with a boolean `DataFrame`.

In [156]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


Now, taking this boolean data frame we'll use it to reassign some values

In [157]:
# reassign values less than 5 to 0
data[data < 5] = 0

# print
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Using the `ix` method to select a subset of rows and columns

In [158]:
# grab the colorado row and the columns two and three
data.ix['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: float64

In [159]:
type(data.ix['Colorado', ['two', 'three']])

pandas.core.series.Series

In [160]:
# do the same, except grab multiple rows and use integers to grab columns
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [162]:
type(data.ix[['Colorado', 'Utah'], [3, 0, 1]])

pandas.core.frame.DataFrame

In [163]:
# grab the row in the second row index (utah)
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: float64

In [165]:
# grab multiple rows (up to but not including Utah [2])
data.ix[:2]

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7


The following command would be equivalent if label and integer indexing were equivalent.

In [167]:
# again, label indexing is inclusive
data.ix[:'Utah']

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11


In [168]:
# grab both rows and columns
data.ix[:2, 'three'] # returns a series

Ohio        0
Colorado    6
Name: three, dtype: float64

In [169]:
data.ix[[1, 3], ['three', 'one']] # returns dataframe

Unnamed: 0,three,one
Colorado,6,0
New York,14,12


In [170]:
data.ix[:, ['three', 'one']] # get all rows and columns three and one

Unnamed: 0,three,one
Ohio,0,0
Colorado,6,0
Utah,10,8
New York,14,12


**Note:**<br>
From what I've seen thus far, the most R-like row and column operations are accomplished using `ix`.

## Arithmetic and data alignment
One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. Let's look a simple example:

In [173]:
'''
create two different series with different indexes and add them
together to get a feel for the behavior described above:
    union of indexes, and add common ones
'''

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

# print them out, with blank line in between
print s1
print "\n"
print s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64


a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [172]:
# add them together (not data alignment)
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [188]:
# do the same for a dataframe
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),
                   columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

# print with a line in between
print df1
print "\n"
print df2

          b  c  d
Ohio      0  1  2
Texas     3  4  5
Colorado  6  7  8


        b   d   e
Utah    0   1   2
Ohio    3   4   5
Texas   6   7   8
Oregon  9  10  11


Adding these together returns a `DataFrame` whose index and columns are the unions of the ones in each `DataFrame`.

In [189]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [190]:
# do the same but use the add method and use a fill value
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6,7.0,8,
Ohio,3,1.0,6,5.0
Oregon,9,,10,11.0
Texas,9,4.0,12,8.0
Utah,0,,1,2.0


In [182]:
# one more example using the data frame
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
                   columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
                   columns=list('abcde'))

print df1
print "\n"
print df2

   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11


    a   b   c   d   e
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19


In [183]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [184]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0,2,4,6,4
1,9,11,13,15,9
2,18,20,22,24,14
3,15,16,17,18,19


## Operations between DataFrame and Series
As with Numpy arrays, arithmetic between `DataFrame` and `Series` is well-defined. First, as motivating example, consider the difference between a 2D array and one of its rows:

In [191]:
# create array
arr = np.arange(12.).reshape((3, 4))

# print
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [192]:
# perform the difference of the array and one of it's rows
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

This is referred to as *broadcasting* (covered in Ch. 12). Operations between a `DataFrame` and a `Series` are similar.

In [193]:
# create a dataframe
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

# create a series from the first row of the dataframe
series = frame.ix[0]

In [194]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [195]:
series

b    0
d    1
e    2
Name: Utah, dtype: float64

By default, arithmetic between `DataFrame` and `Series` matches the index of the `Series` on the `DataFrame`'s columns, broadcasting down the rows:

In [196]:
frame - series

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


In [197]:
# create another series with a new column and not including 'd' from the original frame
series_2 = pd.Series(range(3), index=['b', 'e', 'f'])

frame + series_2

Unnamed: 0,b,d,e,f
Utah,0,,3,
Ohio,3,,6,
Texas,6,,9,
Oregon,9,,12,


If you want to instead broadcast over the columns, matching on the rows, you have to use an arithmetic method.

In [198]:
series_3 = frame['d']

# print 
series_3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: float64

In [199]:
# use sub method
frame.sub(series_3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


The axis that you pass is the *axis to match on*. In this case, we mean to match on the `DataFrame`'s row index and broadcast across.

## Function application and mapping
Numpy ufuncs (element-wise array methods) work fine with pandas objects.

In [200]:
frame = pd.DataFrame(np.random.randn(4, 3),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])


# print
frame

Unnamed: 0,b,d,e
Utah,-0.934661,0.385713,-0.269228
Ohio,0.614219,0.688997,0.929702
Texas,-0.332592,0.056721,-0.49837
Oregon,-1.37135,0.56988,0.704676


In [201]:
# absolute value, element-wise
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.934661,0.385713,0.269228
Ohio,0.614219,0.688997,0.929702
Texas,0.332592,0.056721,0.49837
Oregon,1.37135,0.56988,0.704676


Another frequent operation is applying a function on 1D arrays to each column or row. `DataFrame`'s `apply` method does exactly this:

In [202]:
# create function to be passed to the apply method
func = lambda x: x.max() - x.min()

In [203]:
# by default, will apply to each column (looping through the rows)
frame.apply(func)

b    1.985569
d    0.632276
e    1.428072
dtype: float64

In [204]:
# you can specify which axis: this time 'do' for each row, looping across columns
frame.apply(func, axis=1)

Utah      1.320374
Ohio      0.315483
Texas     0.555090
Oregon    2.076026
dtype: float64

Let's do a more complicated example that returns a `Series` with multiple values

In [205]:
# this function will return the min and max for the axis in question
def func(x):
    return pd.Series([x.min(), x.max()],
                     index=['min', 'max'])

In [206]:
frame.apply(func)

Unnamed: 0,b,d,e
min,-1.37135,0.056721,-0.49837
max,0.614219,0.688997,0.929702


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string form each floating point value in `frame`. You can do this with `applymap`.

In [207]:
format_func = lambda x: '%.2f' % x

In [210]:
frame.applymap(format_func) # need to look into distinction between apply/applymap 

Unnamed: 0,b,d,e
Utah,-0.93,0.39,-0.27
Ohio,0.61,0.69,0.93
Texas,-0.33,0.06,-0.5
Oregon,-1.37,0.57,0.7


`Series` has a `map` method that `apply`'s element-wise functions:

In [212]:
frame['e'].map(format_func)

Utah      -0.27
Ohio       0.93
Texas     -0.50
Oregon     0.70
Name: e, dtype: object