In [27]:
import pandas as pd
import numpy as np
from numpy.random import randn
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

I've always felt like I never got the hang of data maniupulation in pandas. It's not one of the things I can confidently do. I always turn back to my pandas text book or go to the documentation. I need to have a better grasp of it if I am to carryout what I want in my project. Otherwise I'm going to be struggling often. Time to learn.

## Things I think I need to revisit:
* Heirarchical Indexing (for when dealing with pandas data reader results)
* Converting to Heirarchical Indexing from a 2D dataset (for now, doing it efficiently is not a goal, simply get it done)
* Handling Heirarchical indexed data.
* Combining and Merging Datasets

Material I'm revisting can be found [here](https://github.com/wesm/pydata-book). 

## Intro to Heirarchical Indexing:

In [5]:
# notice that there are 2 things passed to the index attribute
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data
data.index

a  1    1.308710
   2    0.117625
   3    0.564542
b  1   -1.156377
   3    0.851508
c  1    2.481569
   2    0.081389
d  2   -0.742291
   3    0.479961
dtype: float64

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

When I first saw what was returned for `data.index` I was confused. But after taking a closer look, I think I've got the gist for it. 

Each part describes the charatictics of the series. The `levels` describe the unique indexes in terms of a list of lists. The first inner list in `levels` is the unique indexes for the outer level, that is, `a`, `b`, `c` and `d`. The second inner list in `levels` describes the indexes for the next level. 

The next part, `codes` describes a numeric approach to identifying each datapoint in the series. The first list in `codes` describes the outer index where `a` maps to `0`, `b` maps to `1`, etc. This is why the first 3 values in the 1st innner list of `codes` is `0`, `0`, `0` as the first 3 datapoints in the series have an outer index of `a`. Using some similar logic, you can probably figure out what is going on in the 2nd inner list of `codes`

In [7]:
# partial indexing 
data['a'] # notice the return denotes only the inner index.
data['a':'c'] # notice the return indexes everything from index 'a' to index 'c'
data.loc[['a','d']]  # notice the return is ONLY indexes a and d

1    1.308710
2    0.117625
3    0.564542
dtype: float64

a  1    1.308710
   2    0.117625
   3    0.564542
b  1   -1.156377
   3    0.851508
c  1    2.481569
   2    0.081389
dtype: float64

a  1    1.308710
   2    0.117625
   3    0.564542
d  2   -0.742291
   3    0.479961
dtype: float64

In [9]:
data.loc[:,2]

a    0.117625
c    0.081389
d   -0.742291
dtype: float64

The series so far behaves like a pseudo-flat list. But, using the `unstack` method will turn it into a dataframe.

In [11]:
df = data.unstack()
df

Unnamed: 0,1,2,3
a,1.30871,0.117625,0.564542
b,-1.156377,,0.851508
c,2.481569,0.081389,
d,,-0.742291,0.479961


Notice the `NaN` values appearing. That's because there was no existed no datapoint sharing indexes `b` & `2`. Let's change that.

In [21]:
df[2][1] = 1.234567
df

Unnamed: 0,1,2,3
a,1.30871,0.117625,0.564542
b,-1.156377,1.234567,0.851508
c,2.481569,0.081389,
d,,-0.742291,0.479961


Now restack the data using `.stack()`

In [22]:
data = df.stack()
data

a  1    1.308710
   2    0.117625
   3    0.564542
b  1   -1.156377
   2    1.234567
   3    0.851508
c  1    2.481569
   2    0.081389
d  2   -0.742291
   3    0.479961
dtype: float64

Notice the outer index `b` has 3 inner indexes. Pretty cool, right?... Lets go further.

In [30]:
# here I am manually setting up a heirarchical index
outside = ['G1','G1','G1','G1','G1','G2','G2','G2']
inside = [1,2,3,4,5,1,2,4]
heir_index = pd.MultiIndex.from_tuples(list(zip(outside,inside)))
df = pd.DataFrame(randn(8,3), heir_index, ['Tom', 'Bob', 'John'])
df

Unnamed: 0,Unnamed: 1,Tom,Bob,John
G1,1,-1.728367,-0.12931,1.284283
G1,2,0.99426,-0.464117,2.306496
G1,3,0.146066,-0.385647,0.490269
G1,4,-2.137106,2.002461,-0.025219
G1,5,0.103139,-0.47907,-1.143673
G2,1,0.042306,0.907752,-0.48553
G2,2,-0.964033,0.964472,-0.728183
G2,4,-0.624048,-0.44756,-0.759978


In [34]:
# grab from a column
df['Tom']
df[['Tom', 'Bob']]

G1  1   -1.728367
    2    0.994260
    3    0.146066
    4   -2.137106
    5    0.103139
G2  1    0.042306
    2   -0.964033
    4   -0.624048
Name: Tom, dtype: float64

Unnamed: 0,Unnamed: 1,Tom,Bob
G1,1,-1.728367,-0.12931
G1,2,0.99426,-0.464117
G1,3,0.146066,-0.385647
G1,4,-2.137106,2.002461
G1,5,0.103139,-0.47907
G2,1,0.042306,0.907752
G2,2,-0.964033,0.964472
G2,4,-0.624048,-0.44756


In [40]:
# set some index names
df.index.names = ['Groups', 'Num']
df

Unnamed: 0_level_0,Person,Tom,Bob,John
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
G1,1,-1.728367,-0.12931,1.284283
G1,2,0.99426,-0.464117,2.306496
G1,3,0.146066,-0.385647,0.490269
G1,4,-2.137106,2.002461,-0.025219
G1,5,0.103139,-0.47907,-1.143673
G2,1,0.042306,0.907752,-0.48553
G2,2,-0.964033,0.964472,-0.728183
G2,4,-0.624048,-0.44756,-0.759978


In [41]:
# set the column's name
df.columns.names = ['Person']
df

Unnamed: 0_level_0,Person,Tom,Bob,John
Groups,Num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
G1,1,-1.728367,-0.12931,1.284283
G1,2,0.99426,-0.464117,2.306496
G1,3,0.146066,-0.385647,0.490269
G1,4,-2.137106,2.002461,-0.025219
G1,5,0.103139,-0.47907,-1.143673
G2,1,0.042306,0.907752,-0.48553
G2,2,-0.964033,0.964472,-0.728183
G2,4,-0.624048,-0.44756,-0.759978


In [53]:
# lets grab Group 1, Number 5 for Bob. That's about in the middle. Go step by step.
df.loc['G1'] 
df.loc['G1'].loc[5]
df.loc['G1'].loc[5]['Bob']

Person,Tom,Bob,John
Num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-1.728367,-0.12931,1.284283
2,0.99426,-0.464117,2.306496
3,0.146066,-0.385647,0.490269
4,-2.137106,2.002461,-0.025219
5,0.103139,-0.47907,-1.143673


Person
Tom     0.103139
Bob    -0.479070
John   -1.143673
Name: 5, dtype: float64

-0.4790698644663122

One thing I like about pandas is its ability to return cross sections of multi-dimensional data. For instance:

In [55]:
# grab the 2nd number for everyone for all group instances.
df.xs(2,level='Num')



Person,Tom,Bob,John
Groups,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,0.99426,-0.464117,2.306496
G2,-0.964033,0.964472,-0.728183
