In [1]:
import numpy as np
import pandas as pd

# Data Manipulation with Pandas
    Hierarchical Indexing

## Hierarchical Indexing

Often it is useful to go beyond one- and two- dimensional data. This is data indexed by more the one or two keys. 

We will use **hierarchical indexing** or **multi-indexing** to place multiple index *levels* in a single index.

This allows higher-dimensional data to be represented within familar the Pandas **Series** or **DataFrame**

If we needed to use a multi-index now, we might want to use Python tuples as keys, but this ends up being incredible inefficient.

In [5]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)

We can create a multi-index from the tuples;

In [7]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

You can see that the labels encode each data point, so [0,0] would be ['Califonia', 2000] and [2,1] would be ['Texas', 2010].

If we re-index our series with this MultiIndex, we see the **hierarchical representation** of the data:

In [8]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Now we can use familiar slicing and indexing techniques;

In [9]:
pop[:, 2010] #all data with 2010 as second index

California    37253956
New York      19378102
Texas         25145561
dtype: int64

We could also have just made this a DataFrame with index and column labels. Pandas is already set up to build a multi-index Series into a DataFrame using the **unstack()** method.

In [10]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Naturally, the stack() method provides the opposite operation:

In [11]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Why would we need this multi-index, when we could represent it using a DataFrame? With a DataFrame and a multi-index we can start representing 3 or more dimensions of data.

Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent.

In [12]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [13]:
f_u18 = pop_df['under18'] / pop_df['total'] #compute fraction of people under 18
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. 

In [16]:
df = pd.DataFrame(np.random.rand(4,2),
                 index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  #index1, index2 -> they are mapped
                 columns = ['data1', 'data2'])

df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.959199,0.783978
a,2,0.163921,0.769899
b,1,0.210819,0.10044
b,2,0.020079,0.442412


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

In [17]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [19]:
#multi-index from a simple list of array with index values within each level
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [22]:
#list of tuples giving the multiple index values of each point
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [23]:
#from a Cartesian product of single indices

pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and labels (a list of lists that reference these labels):

In [24]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]], 
             labels = [[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Sometimes it is convenient to name the levels of the MultiIndex.

In [25]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

You can see how this would be useful for more complicated sets of data.

So far, we have only looked at using the multiple index for rows, but we can use it with columns as well. 

In [27]:
#hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,30.0,38.7,55.0,36.8,19.0,38.2
2013,2,37.0,35.7,35.0,37.7,45.0,36.9
2014,1,10.0,36.3,53.0,38.0,45.0,35.6
2014,2,44.0,38.2,22.0,36.7,32.0,38.1


This is four-dimensional data, with dimensions;

>subject
>measurement type
>year
>visit number

In [28]:
health_data['Guido'] #a df containing just Guido's info

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,55.0,36.8
2013,2,35.0,37.7
2014,1,53.0,38.0
2014,2,22.0,36.7


In [30]:
pop #multi-index Series

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [31]:
pop['California', 2000] #access single element

33871648

In [32]:
pop['California'] #partial indexing - returns a series

year
2000    33871648
2010    37253956
dtype: int64

In [33]:
pop.loc['California':'New York'] #only works if MultiIndex is sorted

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

**Aside about Sorted and unsorted indices**

In [35]:
#not lexographically sorted:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.068283
      2      0.674902
c     1      0.496770
      2      0.937884
b     1      0.317819
      2      0.232144
dtype: float64

In [37]:
#try to take a partial slice;
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)
#doesn't work

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Pandas provides a number of convenience routines to perform this type of sorting; examples are the **sort_index()** and **sortlevel()** methods of the DataFrame.

In [38]:
data = data.sort_index()
data

char  int
a     1      0.068283
      2      0.674902
b     1      0.317819
      2      0.232144
c     1      0.496770
      2      0.937884
dtype: float64

In [41]:
data['a':'b'] #now it works

char  int
a     1      0.068283
      2      0.674902
b     1      0.317819
      2      0.232144
dtype: float64

**Back to indexing hierarchical indecies**

In [42]:
pop[:, 2000] #index on lower levels by passin empty slice to first level index

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [43]:
pop[pop > 22000000] #masking

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [44]:
pop[['California', 'Texas']] #fancy indexing

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

In [46]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,30.0,38.7,55.0,36.8,19.0,38.2
2013,2,37.0,35.7,35.0,37.7,45.0,36.9
2014,1,10.0,36.3,53.0,38.0,45.0,35.6
2014,2,44.0,38.2,22.0,36.7,32.0,38.1


In [47]:
health_data['Guido', 'HR'] #grab Guido's heart rate data

year  visit
2013  1        55.0
      2        35.0
2014  1        53.0
      2        22.0
Name: (Guido, HR), dtype: float64

In [54]:
health_data.iloc[:2, :2] #allows for index indexing 
#this is the row index and column index, does not include the higher levels

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,30.0,38.7
2013,2,37.0,35.7


Name specifices the name of the column that will get the data, the first two columns will contain the index names and levels.

In [59]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


The following allows us to build a MultiIndex from columns in the dataframe.

In [60]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


This type of reindexing to be one of the more useful patterns when encountering real-world datasets.

The Pandas data aggregation methods (mean, sum, max), can be passed a level parameter to control which subset of the data the aggregate is computed on.

In [61]:
data_mean = health_data.mean(level = 'year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,33.5,37.2,45.0,37.25,32.0,37.55
2014,27.0,37.25,37.5,37.35,38.5,36.85


In [62]:
#using the axis parameter we not have the mean for each column
#in each year
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,36.833333,37.333333
2014,34.333333,37.15


This shows the average heart rate and temperature measured among all subjects in all visits each year. 