## Pandas

http://pandas.pydata.org/

Data analysis library that introduces the concepts of data-frames and series to Python. Powerful tool for time-series analysis and fast visualizations of data.

Excellent introduction by the author: https://vimeo.com/59324550


### Software development rigor applied to spread-sheets.

http://www.eusprig.org/horror-stories.htm


See pages 131-132 of the JP Morgan Task Force Report "...further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically, after subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended. This error likely had the effect of muting volatility by a factor of two and of lowering the VaR" 

As reported in "A tempest in a spreadsheet" http://ftalphaville.ft.com/2013/01/17/1342082/a-tempest-in-a-spreadsheet/? Lisa Pollack comments "On a number of occasions, he asked the trader to whom he reported for additional resources to support his work on the VaR model, but he did not receive any. Also it appears that he (had to?) cut a number of corners, which resulted increased operational risk and artificially low volatility numbers ... pressure was put on the reviewers to get on with approving the model"

- - - 

In [18]:
import pandas as pd
import numpy as np

In [19]:
data = np.random.rand(5, 5)
data

array([[ 0.56915116,  0.35195334,  0.65817698,  0.10906893,  0.6758529 ],
       [ 0.87859716,  0.42080817,  0.68735679,  0.95687725,  0.40429561],
       [ 0.82651561,  0.21193236,  0.74035219,  0.72569012,  0.93913254],
       [ 0.13527403,  0.67155329,  0.5845691 ,  0.52337725,  0.24351929],
       [ 0.71046173,  0.17962049,  0.07790674,  0.09488657,  0.48951089]])

Matrix representation of data, proivided by numpy, often isn't enough and this is where Pandas can help, through the introduction of a "data-frame", which is a tabular representation.

In [20]:
pd.DataFrame(data)

Unnamed: 0,0,1,2,3,4
0,0.569151,0.351953,0.658177,0.109069,0.675853
1,0.878597,0.420808,0.687357,0.956877,0.404296
2,0.826516,0.211932,0.740352,0.72569,0.939133
3,0.135274,0.671553,0.584569,0.523377,0.243519
4,0.710462,0.17962,0.077907,0.094887,0.489511


In [21]:
table = pd.DataFrame(data, columns=["a","b","c","d","e"])
table

Unnamed: 0,a,b,c,d,e
0,0.569151,0.351953,0.658177,0.109069,0.675853
1,0.878597,0.420808,0.687357,0.956877,0.404296
2,0.826516,0.211932,0.740352,0.72569,0.939133
3,0.135274,0.671553,0.584569,0.523377,0.243519
4,0.710462,0.17962,0.077907,0.094887,0.489511


Note the automatic selection of an index (left hand column). This can be used to pull out rows of interest.

In [22]:
table.ix[0]

a    0.569151
b    0.351953
c    0.658177
d    0.109069
e    0.675853
Name: 0, dtype: float64

Similarly we can select data from columns.

In [23]:
table["a"]

0    0.569151
1    0.878597
2    0.826516
3    0.135274
4    0.710462
Name: a, dtype: float64

In [24]:
table[["a","e"]]

Unnamed: 0,a,e
0,0.569151,0.675853
1,0.878597,0.404296
2,0.826516,0.939133
3,0.135274,0.243519
4,0.710462,0.489511


A table behave similarly to a numpy array (in fact underneath it is a nd-array), meaning we can use fancy indexing:

In [25]:
missing_data = table[table > 0.6]
missing_data

Unnamed: 0,a,b,c,d,e
0,,,0.658177,,0.675853
1,0.878597,,0.687357,0.956877,
2,0.826516,,0.740352,0.72569,0.939133
3,,0.671553,,,
4,0.710462,,,,
