![rmotr](https://user-images.githubusercontent.com/7065401/39119486-4718e386-46ec-11e8-9fc3-5250a49ef570.png)
<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/39119455-2f3eb97a-46ec-11e8-811d-04a491229648.jpg"
    style="width:300px; float: right; margin: 0 40px 40px 40px;"></img>

# Hierarchical Indexing

[Examples of MultiIndexes](https://docs.google.com/spreadsheets/d/15cjM5XY3jQ4IHEUZA41IFLYCJoupMi_NYhBRKCjJroU/edit?usp=sharing)

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)

## Hands on! 

In [1]:
import numpy as np
import pandas as pd

In [2]:
temps = pd.Series([
    30.7,
    29.6,
    29.8,
    30.8,
    31.3,
    29.2,
    30.9,
    31.4,
    29.2,
    29.9,
    30.3,
    30.7,
    21.3,
    21.7,
    22.4,
    21.2,
    20.3,
    21.8
], index=[
    ['NYC', 'NYC', 'NYC', 'NYC', 'NYC', 'NYC',
     'TKY', 'TKY', 'TKY', 'TKY', 'TKY', 'TKY',
     'SF', 'SF', 'SF', 'SF', 'SF', 'SF'],
    [2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017]
], name='Max Temperatures')

In [3]:
temps

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
     2016    31.3
     2017    29.2
TKY  2012    30.9
     2013    31.4
     2014    29.2
     2015    29.9
     2016    30.3
     2017    30.7
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
     2016    20.3
     2017    21.8
Name: Max Temperatures, dtype: float64

In [None]:
# series can be multidemintional, because it can have mulitple indexes

![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### Simple Selection and Slicing

In [4]:
temps['SF']

2012    21.3
2013    21.7
2014    22.4
2015    21.2
2016    20.3
2017    21.8
Name: Max Temperatures, dtype: float64

In [5]:
#mulpilte keys, so seperate by commons
temps['SF', 2016]

20.3

In [6]:
temps['SF'].loc[2012:2015]

2012    21.3
2013    21.7
2014    22.4
2015    21.2
Name: Max Temperatures, dtype: float64

Traditional slicing withing the multi-indexed series is possible, only if **the index is sorted:**

In [6]:
temps.sort_index(inplace=True)
temps

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
     2016    31.3
     2017    29.2
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
     2016    20.3
     2017    21.8
TKY  2012    30.9
     2013    31.4
     2014    29.2
     2015    29.9
     2016    30.3
     2017    30.7
Name: Max Temperatures, dtype: float64

In [8]:
temps.loc['NYC':'SF', 2012:2015]

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
Name: Max Temperatures, dtype: float64

The index of our series is of type `MultiIndex`:

In [8]:
# this showes that it is not repeating data, aka 3x 2012,
# because that would be a waste of data
temps.index

MultiIndex(levels=[['NYC', 'SF', 'TKY'], [2012, 2013, 2014, 2015, 2016, 2017]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]])

We can create these indexes with several methods, for example:

In [9]:
index = pd.MultiIndex.from_product([['NYC', 'TKY', 'SF'], [2012, 2013, 2014, 2015, 2016, 2017,]])

In [10]:
temps2 = pd.Series([
    30.7,
    29.6,
    29.8,
    30.8,
    31.3,
    29.2,
    30.9,
    31.4,
    29.2,
    29.9,
    30.3,
    30.7,
    21.3,
    21.7,
    22.4,
    21.2,
    20.3,
    21.8
], index=index, name='Max Temperatures')

In [11]:
temps2

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
     2016    31.3
     2017    29.2
TKY  2012    30.9
     2013    31.4
     2014    29.2
     2015    29.9
     2016    30.3
     2017    30.7
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
     2016    20.3
     2017    21.8
Name: Max Temperatures, dtype: float64

In [12]:
temps2.sort_index(inplace=True)

In [13]:
temps.equals(temps2)

True

Each "index" in a MultiIndex is called a "level", and each level can have a name:

In [14]:
temps.index.names = ['City', 'Year']

In [15]:
temps

City  Year
NYC   2012    30.7
      2013    29.6
      2014    29.8
      2015    30.8
      2016    31.3
      2017    29.2
SF    2012    21.3
      2013    21.7
      2014    22.4
      2015    21.2
      2016    20.3
      2017    21.8
TKY   2012    30.9
      2013    31.4
      2014    29.2
      2015    29.9
      2016    30.3
      2017    30.7
Name: Max Temperatures, dtype: float64

If you pay attention, the same data could have been structured as a DataFrame. It turns out, that both structures are related and you can go from one to the other with the `unstack` and `stack` methods:

In [16]:
temps.unstack()

Year,2012,2013,2014,2015,2016,2017
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NYC,30.7,29.6,29.8,30.8,31.3,29.2
SF,21.3,21.7,22.4,21.2,20.3,21.8
TKY,30.9,31.4,29.2,29.9,30.3,30.7


In [17]:
temps.unstack(level=0)

City,NYC,SF,TKY
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012,30.7,21.3,30.9
2013,29.6,21.7,31.4
2014,29.8,22.4,29.2
2015,30.8,21.2,29.9
2016,31.3,20.3,30.3
2017,29.2,21.8,30.7


The `stack()` methods constructs the MultiIndexed Series again:

In [18]:
temps.unstack().stack()

City  Year
NYC   2012    30.7
      2013    29.6
      2014    29.8
      2015    30.8
      2016    31.3
      2017    29.2
SF    2012    21.3
      2013    21.7
      2014    22.4
      2015    21.2
      2016    20.3
      2017    21.8
TKY   2012    30.9
      2013    31.4
      2014    29.2
      2015    29.9
      2016    30.3
      2017    30.7
dtype: float64

In [19]:
temps.unstack(level=0).stack()

Year  City
2012  NYC     30.7
      SF      21.3
      TKY     30.9
2013  NYC     29.6
      SF      21.7
      TKY     31.4
2014  NYC     29.8
      SF      22.4
      TKY     29.2
2015  NYC     30.8
      SF      21.2
      TKY     29.9
2016  NYC     31.3
      SF      20.3
      TKY     30.3
2017  NYC     29.2
      SF      21.8
      TKY     30.7
dtype: float64

![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### MultiIndexed DataFrames

It's also perfectly valid to construct a MultiIndexed DataFrame:

In [21]:
s = """
-3.9
-3.6
-4.5
-5
-5.8
-4.2
0.7
1.1
0.2
-0.5
-1.2
0.8
5.8
6.6
4.9
6.4
5.5
7.2"""

In [22]:
[float(v) for v in s.split('\n') if v]

[-3.9,
 -3.6,
 -4.5,
 -5.0,
 -5.8,
 -4.2,
 0.7,
 1.1,
 0.2,
 -0.5,
 -1.2,
 0.8,
 5.8,
 6.6,
 4.9,
 6.4,
 5.5,
 7.2]

In [23]:
df = pd.DataFrame({
    'Max Temperatures': [
        30.7,
        29.6,
        29.8,
        30.8,
        31.3,
        29.2,
        30.9,
        31.4,
        29.2,
        29.9,
        30.3,
        30.7,
        21.3,
        21.7,
        22.4,
        21.2,
        20.3,
        21.8
    ],
    'Min Temperatures': [
        -3.9,
        -3.6,
        -4.5,
        -5.0,
        -5.8,
        -4.2,
        0.7,
        1.1,
        0.2,
        -0.5,
        -1.2,
        0.8,
        5.8,
        6.6,
        4.9,
        6.4,
        5.5,
        7.2
    ]
}, index=[
    ['NYC', 'NYC', 'NYC', 'NYC', 'NYC', 'NYC',
     'TKY', 'TKY', 'TKY', 'TKY', 'TKY', 'TKY',
     'SF', 'SF', 'SF', 'SF', 'SF', 'SF'],
    [2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017,
     2012, 2013, 2014, 2015, 2016, 2017]
])

In [24]:
df

Unnamed: 0,Unnamed: 1,Max Temperatures,Min Temperatures
NYC,2012,30.7,-3.9
NYC,2013,29.6,-3.6
NYC,2014,29.8,-4.5
NYC,2015,30.8,-5.0
NYC,2016,31.3,-5.8
NYC,2017,29.2,-4.2
TKY,2012,30.9,0.7
TKY,2013,31.4,1.1
TKY,2014,29.2,0.2
TKY,2015,29.9,-0.5


In [25]:
df.sort_index(inplace=True)

In [26]:
df.loc['NYC']

Unnamed: 0,Max Temperatures,Min Temperatures
2012,30.7,-3.9
2013,29.6,-3.6
2014,29.8,-4.5
2015,30.8,-5.0
2016,31.3,-5.8
2017,29.2,-4.2


In [27]:
df.loc['NYC', 'Max Temperatures']

2012    30.7
2013    29.6
2014    29.8
2015    30.8
2016    31.3
2017    29.2
Name: Max Temperatures, dtype: float64

In [28]:
df.loc[('NYC', (2012, 2015)), 'Max Temperatures']

NYC  2012    30.7
     2015    30.8
Name: Max Temperatures, dtype: float64

More complex slicing needs to be performed with the [`IndexSlice`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.IndexSlice.html) constructor from pandas or the [`slice`](https://docs.python.org/3/library/functions.html#slice) builtin function.

In [29]:
df.loc[('NYC', slice(2012, 2015)), 'Max Temperatures']

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
Name: Max Temperatures, dtype: float64

In [30]:
df.loc[pd.IndexSlice['NYC', 2012: 2015], 'Max Temperatures']

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
Name: Max Temperatures, dtype: float64

In [31]:
df.loc[(slice('NYC', 'SF'), slice(2012, 2015)), 'Max Temperatures']

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
Name: Max Temperatures, dtype: float64

In [32]:
df.loc[pd.IndexSlice['NYC': 'SF', 2012: 2015], 'Max Temperatures']

NYC  2012    30.7
     2013    29.6
     2014    29.8
     2015    30.8
SF   2012    21.3
     2013    21.7
     2014    22.4
     2015    21.2
Name: Max Temperatures, dtype: float64

![separator1](https://user-images.githubusercontent.com/7065401/39119545-6d73d9aa-46ec-11e8-98d3-40204614f000.png)

### MultiIndex and MultiColumns DataFrames

DataFrames can also have multiple columns:

In [20]:
df = pd.DataFrame(
    np.random.randn(4, 4),
    index=[['NYC', 'NYC', 'SF', 'SF'], [2016, 2017, 2016, 2017]],
    columns=[
        ['Temperatures', 'Temperatures', 'Humidity', 'Humidity'],
        ['Max Temperature', 'Min Temperature', 'Max Humidity', 'Min Humidity']
    ]
)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Temperatures,Temperatures,Humidity,Humidity
Unnamed: 0_level_1,Unnamed: 1_level_1,Max Temperature,Min Temperature,Max Humidity,Min Humidity
NYC,2016,0.58163,-0.3821,-0.120277,-0.182476
NYC,2017,0.619626,1.026001,1.195437,-0.72204
SF,2016,0.815296,0.162034,0.2566,0.3902
SF,2017,-1.174726,0.75658,-0.845335,-1.7371


In [25]:
df.columns

MultiIndex(levels=[['Humidity', 'Temperatures'], ['Max Humidity', 'Max Temperature', 'Min Humidity', 'Min Temperature']],
           labels=[[1, 1, 0, 0], [1, 3, 0, 2]])

In [21]:
df.loc['NYC']

Unnamed: 0_level_0,Temperatures,Temperatures,Humidity,Humidity
Unnamed: 0_level_1,Max Temperature,Min Temperature,Max Humidity,Min Humidity
2016,0.58163,-0.3821,-0.120277,-0.182476
2017,0.619626,1.026001,1.195437,-0.72204


In [22]:
df.loc['NYC', 2016]

Temperatures  Max Temperature    0.581630
              Min Temperature   -0.382100
Humidity      Max Humidity      -0.120277
              Min Humidity      -0.182476
Name: (NYC, 2016), dtype: float64

In [23]:
df.loc[('NYC', 2016), ('Temperatures',)]

  return self._getitem_tuple(key)


Max Temperature    0.58163
Min Temperature   -0.38210
Name: (NYC, 2016), dtype: float64

In [37]:
df.loc[('NYC', 2016), ('Temperatures', 'Max Temperature')]

0.020022795979633346

**It's not recommended to have so many nested Indexes and Columns**

![separator2](https://user-images.githubusercontent.com/7065401/39119518-59fa51ce-46ec-11e8-8503-5f8136558f2b.png)