# Hierarchical Indexing

As discussed before, `Series` and `DataFrame` objects are used to work with one-dimensional and two-dimensional data, respectively. Pandas also provides the `Panel` and `Panel4D` objects to work with three-dimensional and four-dimensional data, but is often more convenient to make use of _hierarchical indexing_ (also known as _multi-indexing_) to incorporate multiple index levels within a single index. This allows for representation of higher dimensional data using the more familiar `Series` and `DataFrame` objects.

This section explores `MultiIndex` objects for dealing with high dimensional data.

In [1]:
import numpy as np
import pandas as pd

The following is a way (far from optimal) of representing two-dimensional data within a one-dimensional `Series` object. The book explores a little bit more on why this is a bad approach.

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

A `MultiIndex` object can be created from tuples such as the ones used for indexing `pop`:

In [3]:
index = pd.MultiIndex.from_tuples(index)
print(index.levels)
print(index.codes)
index

[['California', 'New York', 'Texas'], [2000, 2010]]
[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]


MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

`MultiIndex` contains multiple levels of indexing (`index.levels`), as well as multiple labels for each data point which encode these levels (`index.codes`).

It is possible to re-index the original `Series` object `pop` so we can see the hierarchical representation of the data:

In [4]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

The first two columns indicate the multiple index values, while the last column contains the actual data. For the blank entries, the value is equivalent to the line above it.

To access data for which the second index is 2010, Pandas slicing notation is valid:

In [5]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

This representation of data is equivalent to a simple `DataFrame` object, and with this equivalence in mind Pandas provides us with the method `unstack()` to convert between the two:

In [6]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


And the `stack()` performs the opposite operation:

In [7]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

This equivalence shows that using `Series` to work with two-dimensional data is not that useful, but the same concept applies to `DataFrame` and three-dimensional (or higher!) data. Each level in a multi-index represents an extra dimension of data. If we want to add another column of data for each state at each year, with `MultiIndex` this is as easy as adding another column to the `DataFrame`:

In [8]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


As expected, all the ufuncs and other functionality work with hierarchical indices. This allows for easy and quick manipulation of high-dimensional data:

In [9]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


## Creation of MultiIndex

The most straightforward way to create a multiply indexed `Series` or `DataFrame` object is to pass a list of two or more index arrays to the constructor:

In [10]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.42476,0.293457
a,2,0.978867,0.793047
b,1,0.393464,0.640488
b,2,0.335885,0.581526


Similarly, a dictionary with appropriate tuples as keys will be automatically recognized by Pandas and a `MultiIndex` will be used:

In [11]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

Alternatively, `MultiIndex` objects can be created explicitly:

In [12]:
# From a list of arrays
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [13]:
# From a list of tuples
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [14]:
# From a Cartesian product of single indices
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [15]:
# Directly using its internal encoding
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Any of these can be passed as the `index` argument when creating a `Series` or `DataFrame` object, or passed to the `reindex()` method of an existing object.

### MultiIndex level names

Sometimes it is useful to name the levels of the `MultiIndex`. This can be done by either passing the `names` argument to any of the above constructor methods, of by setting the `names` attribute after the construction:

In [16]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### MultiIndex for columns

In a `DataFrame`, rows and columns are symmetric, so it shouldn't be much of a surprise that columns can also have multiple index levels (just as rows can):

In [17]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,37.5,29.0,36.2,49.0,35.6
2013,2,53.0,36.8,42.0,38.7,47.0,35.5
2014,1,21.0,35.9,30.0,36.1,42.0,37.1
2014,2,45.0,36.5,24.0,38.6,41.0,34.8


The example above uses multi-indexing for both rows and columns. This is essentially four-dimensional data, where the dimensions are:
    
1. The subject
2. The measurement type
3. The year
4. The visit number

To retrieve a full `DataFrame` with just one person's information, we can do:

In [18]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,29.0,36.2
2013,2,42.0,38.7
2014,1,30.0,36.1
2014,2,24.0,38.6


## Indexing and Slicing

The next examples will use the multiply indexed `Series` of state populations used earlier:

In [19]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

To access single elements, we can use indexing with multiple terms:

In [20]:
pop['California', 2000]

33871648

Partial indexing is also supported to allow for indexing of just one of the levels in the index. The result is another `Series` with the lower-level indices maintained:

In [21]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

Partial slicing is similar, but it is available as long as the `MultiIndex` is sorted:

In [22]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

With sorted indices, partial indexing can be used on lower levels using an empty slice in the first index:

In [23]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Other types of indexing and selection will work as well:

In [24]:
# Boolean masks
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [25]:
# Fancy indexing
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

The behavior for `DataFrame` objects is similar. The next examples will use the medical `DataFrame` from before:

In [26]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,37.5,29.0,36.2,49.0,35.6
2013,2,53.0,36.8,42.0,38.7,47.0,35.5
2014,1,21.0,35.9,30.0,36.1,42.0,37.1
2014,2,45.0,36.5,24.0,38.6,41.0,34.8


Columns are primary in a `DataFrame`, and the syntax used for multiply indexed `Seres` applies to columns. To recover Guido's heart rate, we can do:

In [27]:
health_data['Guido', 'HR']

year  visit
2013  1        29.0
      2        42.0
2014  1        30.0
      2        24.0
Name: (Guido, HR), dtype: float64

The `loc` and `iloc` indexers will work just as well:

In [28]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,42.0,37.5
2013,2,53.0,36.8


For each individual index in `loc` or `iloc`, we can pass a tuple of multiple indices:

In [29]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        42.0
      2        53.0
2014  1        21.0
      2        45.0
Name: (Bob, HR), dtype: float64

Trying to create a slice within a tuple will lead to a syntax error. To overcome this issue, Pandas provides the `IndexSlice` object:

In [30]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,42.0,29.0,49.0
2014,1,21.0,30.0,42.0


## Rearranging Multi-Indices

There are various operations that preserve all the information on the object, but rearrange it for purposes of computations. The `stack()` and `unstack()` methods are examples of that, but there are many more ways to have a finer control of the arrangement of data. 

### Sorted and unsorted indices

Expanding on why many of the `MultiIndex` operations will fail if the index is not sorted, we'll create an object where the indices are not lexographically sorted:

In [31]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.623309
      2      0.642579
c     1      0.566141
      2      0.132618
b     1      0.362324
      2      0.477503
dtype: float64

Trying to take a partial slice of this index results in an error:

In [32]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


This error is a result of the `MultiIndex` not being sorted. Other similar operations also require the levels in the `MultiIndex` to be sorted. Pandas provides some convenient methods to perform index sorting such as `sort_index()` and `sortlevel()`.

In [33]:
data = data.sort_index()
data

char  int
a     1      0.623309
      2      0.642579
b     1      0.362324
      2      0.477503
c     1      0.566141
      2      0.132618
dtype: float64

With the index sorted, partial slicing will work properly:

In [34]:
data['a':'b']

char  int
a     1      0.623309
      2      0.642579
b     1      0.362324
      2      0.477503
dtype: float64

### Stacking and unstacking indices

As demonstrated previously, we can convert a dataset from a stacked multi-index to a simpler two-dimensional representation. The argument `level` allows us to specify the level to use:

In [35]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [36]:
pop.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


The `stack()` method can be used to recover the original data:

In [37]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

### Index settings and resetting

The index labels can be turn into columns to rearrange the data in another way. The `reset_index()` method takes care of this rearrangement:

In [38]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


The result is a `DataFrame` with a _state_ and _year_ column holding the information that was formerly in the index. For clarity, the `name` argument can be specified to set the name of the data column.

This is a common representation of raw input data from the real world. It can be useful to create a `MultiIndex` from the column values. This can be achieved using the `set_index()` method of the `DataFrame`, which returns a multiply indexed `DataFrame`:

In [39]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


## Data Aggregations

To work with the built-in data aggregation methods (such as `mean()`, `sum()` and `max()`) on hierarchically indexed data, the parameter `level` can be used to control which subset of the data the aggregate is computed on.

In [40]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,37.5,29.0,36.2,49.0,35.6
2013,2,53.0,36.8,42.0,38.7,47.0,35.5
2014,1,21.0,35.9,30.0,36.1,42.0,37.1
2014,2,45.0,36.5,24.0,38.6,41.0,34.8


The following computes the average of the measurements in the two visits each year.

In [41]:
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,47.5,37.15,35.5,37.45,48.0,35.55
2014,33.0,36.2,27.0,37.35,41.5,35.95


Using the `axis` keyword allows us to take the mean among levels on the columns as well:

In [42]:
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,43.666667,36.716667
2014,33.833333,36.5


This results in the average heart rate and temperature measured among all subject in each year.