In [1]:
import pandas as pd
import numpy as np

# MultiIndex / advanced indexing

## 1. Hierarchical indexing (MultiIndex)

Hierarchical/ Multi-level indexing is very exciting as it ipens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like `series` and `DataFrame`.


In this section, we will show what exactly we mean by 'hierarchical' indexing and how it integrates with all of the pandas indexing functionality described above nd in prior sections. Later, when discussing `groupby` and `pivoting` and `reshaping` data, we'll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook for some advaced strategies.

### Creating a MultiIndex (hierarchial index) object

The `MultiIndex` object is the hierarchical analogue of the standard `Index` object which typically stores the axis labels in pandas objects. You can think of `MultiIndex` as an array of tuples where each tuple is unique. A `MultiIndex` can be created from a list of arrays (using `MultiIndex.from_arrays()`), an array of tuples (using `MultiIndex.from_tuples()`), a crossed set of iterables (using `MultiIndex.from_product()`), or a `DataFrame` (using `MultiIndex.from_frame()`). The `Index` constructor will attempt to return a `MultiIndex` when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [2]:
#FROM AN ARRAY OF TUPLES
arrays = [['국고', '국고', '산금', '산금', '중금', '중금', '시은', '시은'],
          ['단기', '장기', '단기', '장기', '단기', '장기', '단기', '장기']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names = ['first', 'second'])
index

MultiIndex([('국고', '단기'),
            ('국고', '장기'),
            ('산금', '단기'),
            ('산금', '장기'),
            ('중금', '단기'),
            ('중금', '장기'),
            ('시은', '단기'),
            ('시은', '장기')],
           names=['first', 'second'])

In [3]:
s= pd.Series(np.random.randn(8), index = index)
s

first  second
국고     단기       -1.037331
       장기        1.534210
산금     단기       -1.478862
       장기       -1.099519
중금     단기       -0.872887
       장기       -1.895303
시은     단기        0.396818
       장기        0.194559
dtype: float64

When you want _every pairing_ of the elements in two iterables, it can be easier to use the `MultiIndex.from_product()` method:

In [4]:
#FROM A PRODUCT OF ITERABLES
iters = [  ['국고', '산금', '시은', '공사'], ['단기', '장기']  ]

In [5]:
pd.MultiIndex.from_product(iters, names=['first', 'second'])

MultiIndex([('국고', '단기'),
            ('국고', '장기'),
            ('산금', '단기'),
            ('산금', '장기'),
            ('시은', '단기'),
            ('시은', '장기'),
            ('공사', '단기'),
            ('공사', '장기')],
           names=['first', 'second'])

You can also construct a `MultiIndex` from a `DataFrame` directly, using the method `MultiIndex.from_frame()`. This is a complementary method to `MultiIndex.to_frame()`.

In [6]:
#FROM DF
df = pd.DataFrame([['국고','단기'], ['국고', '장기'],
                   ['산금', '단기'],['산금', '장기']],
                  columns = ['first', 'second'])
pd.MultiIndex.from_frame(df)

MultiIndex([('국고', '단기'),
            ('국고', '장기'),
            ('산금', '단기'),
            ('산금', '장기')],
           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a `MultiIndex` **automatically**:


In [7]:
arrays = [np.array(['국고', '국고', '산금', '산금', '중금', '중금', '시은', '시은']),
          np.array(['단기', '장기', '단기', '장기', '단기', '장기', '단기', '장기'])]


In [8]:
s = pd.Series(np.random.randn(8), index=arrays)
s

국고  단기   -1.700901
    장기   -0.185789
산금  단기    0.286530
    장기    1.896629
중금  단기   -0.735264
    장기    1.718560
시은  단기    1.067562
    장기   -0.682318
dtype: float64

In [9]:
df = pd.DataFrame(np.random.randn(8,4), index=arrays)

In [10]:
df

Unnamed: 0,Unnamed: 1,0,1,2,3
국고,단기,0.611997,-0.757712,0.723357,1.368304
국고,장기,-1.374384,-1.746542,0.752819,-2.142924
산금,단기,1.240711,-1.699241,0.646897,-0.876645
산금,장기,-2.119474,0.560222,0.534343,0.694694
중금,단기,0.776006,-0.373566,0.562785,-0.150195
중금,장기,-0.581905,1.326178,0.653509,0.797515
시은,단기,1.588489,-1.734855,0.544505,-0.182831
시은,장기,1.76495,0.736577,1.144645,-0.05056


All of the `MultiIndex` constructors accept a `names` arg. which stores string names for the levels themselves. If no names are provided, `None` will be assigned

In [11]:
df.index.names

FrozenList([None, None])

In [12]:
df.index.names = ['섹터','만기']

In [13]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
섹터,만기,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
국고,단기,0.611997,-0.757712,0.723357,1.368304
국고,장기,-1.374384,-1.746542,0.752819,-2.142924
산금,단기,1.240711,-1.699241,0.646897,-0.876645
산금,장기,-2.119474,0.560222,0.534343,0.694694
중금,단기,0.776006,-0.373566,0.562785,-0.150195
중금,장기,-0.581905,1.326178,0.653509,0.797515
시은,단기,1.588489,-1.734855,0.544505,-0.182831
시은,장기,1.76495,0.736577,1.144645,-0.05056


This index can back any axis of a pandas object, and the number of **levels** of the index is up to you:

In [14]:
df = pd.DataFrame(np.random.randn(3,8), index=['A','B','C'], columns=index)
df

first,국고,국고,산금,산금,중금,중금,시은,시은
second,단기,장기,단기,장기,단기,장기,단기,장기
A,-0.876667,0.966092,1.120699,-0.445573,-0.725877,-0.829777,1.980688,-2.445532
B,-0.5035,-0.88004,-0.206314,-1.258661,-1.345779,-0.528776,-1.55545,-0.741435
C,1.718268,2.243557,0.545362,2.008662,0.12104,-0.198562,0.253164,0.085287


In [15]:
pd.DataFrame(np.random.randn(6,6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,first,국고,국고,산금,산금,중금,중금
Unnamed: 0_level_1,second,단기,장기,단기,장기,단기,장기
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
국고,단기,0.709946,-1.092728,-0.531883,-0.558539,0.189125,1.582518
국고,장기,0.462457,-0.563833,0.498153,0.402449,-0.098995,-0.203256
산금,단기,-2.034993,-0.592813,-0.756421,0.047621,-1.031217,-0.849764
산금,장기,1.385972,1.163836,0.066959,0.649704,-0.517988,-0.844008
중금,단기,1.077053,-1.27188,-1.694384,-1.360082,-0.035053,2.13558
중금,장기,2.060234,0.134447,0.526642,1.720946,0.182323,1.501125


We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the `multi_sparse` option in `pandas.set_options()`:

In [16]:
with pd.option_context('display.multi_sparse', False):
    df

In [17]:
df

first,국고,국고,산금,산금,중금,중금,시은,시은
second,단기,장기,단기,장기,단기,장기,단기,장기
A,-0.876667,0.966092,1.120699,-0.445573,-0.725877,-0.829777,1.980688,-2.445532
B,-0.5035,-0.88004,-0.206314,-1.258661,-1.345779,-0.528776,-1.55545,-0.741435
C,1.718268,2.243557,0.545362,2.008662,0.12104,-0.198562,0.253164,0.085287


It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [18]:
pd.Series(np.random.randn(8), index=tuples)

(국고, 단기)   -0.418815
(국고, 장기)    0.441065
(산금, 단기)    0.768435
(산금, 장기)   -1.066626
(중금, 단기)   -1.350763
(중금, 장기)   -0.853563
(시은, 단기)   -0.844003
(시은, 장기)    0.579016
dtype: float64

In [19]:
pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples, names=['first','second']))

first  second
국고     단기       -0.442064
       장기       -0.244117
산금     단기        0.042039
       장기        0.739813
중금     단기        0.750014
       장기       -1.202049
시은     단기        2.035856
       장기       -1.749841
dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a `MultiIndex` explicitly yourself. However, when loading data from a file, you may wish to generate your own `MultiIndex` when preparing the data set.

### Reconstructing the level labels

The method `get_level_values()` will return a vector of the labels for each location at a particular level.

In [20]:
index.get_level_values(0)

Index(['국고', '국고', '산금', '산금', '중금', '중금', '시은', '시은'], dtype='object', name='first')

In [21]:
index.get_level_values(1)

Index(['단기', '장기', '단기', '장기', '단기', '장기', '단기', '장기'], dtype='object', name='second')

In [22]:
index.get_level_values('second')

Index(['단기', '장기', '단기', '장기', '단기', '장기', '단기', '장기'], dtype='object', name='second')

### Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. **Partial** selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [24]:
df['국고']

second,단기,장기
A,-0.876667,0.966092
B,-0.5035,-0.88004
C,1.718268,2.243557


In [25]:
df.loc[:, '국고']

second,단기,장기
A,-0.876667,0.966092
B,-0.5035,-0.88004
C,1.718268,2.243557


In [26]:
df['국고','단기']

A   -0.876667
B   -0.503500
C    1.718268
Name: (국고, 단기), dtype: float64

In [27]:
df[('국고','단기')]

A   -0.876667
B   -0.503500
C    1.718268
Name: (국고, 단기), dtype: float64

In [28]:
s

국고  단기   -1.700901
    장기   -0.185789
산금  단기    0.286530
    장기    1.896629
중금  단기   -0.735264
    장기    1.718560
시은  단기    1.067562
    장기   -0.682318
dtype: float64

In [29]:
s['중금']

단기   -0.735264
장기    1.718560
dtype: float64

See `Cross-section` with hierarchical index for how to select on a deeper level.

### Defined levels

The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [43]:
df.columns

MultiIndex([('국고', '단기'),
            ('국고', '장기'),
            ('산금', '단기'),
            ('산금', '장기'),
            ('중금', '단기'),
            ('중금', '장기'),
            ('시은', '단기'),
            ('시은', '장기')],
           names=['first', 'second'])

In [30]:
df.columns.levels # original MultiIndex의 레벨은?

FrozenList([['국고', '산금', '시은', '중금'], ['단기', '장기']])

In [34]:
df[['산금','국고']].columns.levels  # sliced 해도 MultiIndex는 본디 자기 자신을 기억함

FrozenList([['국고', '산금', '시은', '중금'], ['단기', '장기']])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the `get_level_values()` method.

In [38]:
df[['국고','산금']].columns.levels

FrozenList([['국고', '산금', '시은', '중금'], ['단기', '장기']])

In [39]:
# to_numpy only used levels
df[['국고','산금']].columns.to_numpy()

array([('국고', '단기'), ('국고', '장기'), ('산금', '단기'), ('산금', '장기')],
      dtype=object)

In [37]:
# for a specific level
df[['국고','산금']].columns.get_level_values(0)

Index(['국고', '국고', '산금', '산금'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, the `remove_unused_levels()` method may be used.

In [40]:
new_mi = df[['국고','산금']].columns.remove_unused_levels()

In [41]:
new_mi.levels

FrozenList([['국고', '산금'], ['단기', '장기']])

### Data Alignment and using `reindex`

Operations between differently-indexed objects having `MultiIndex` on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [49]:
s = pd.Series(range(8), index=index)

In [50]:
s

first  second
국고     단기        0
       장기        1
산금     단기        2
       장기        3
중금     단기        4
       장기        5
시은     단기        6
       장기        7
dtype: int64

In [53]:
s[:-2]

first  second
국고     단기        0
       장기        1
산금     단기        2
       장기        3
중금     단기        4
       장기        5
dtype: int64

In [54]:
s + s[:-2]

first  second
국고     단기         0.0
       장기         2.0
산금     단기         4.0
       장기         6.0
시은     단기         NaN
       장기         NaN
중금     단기         8.0
       장기        10.0
dtype: float64

In [55]:
s + s[::2]

first  second
국고     단기         0.0
       장기         NaN
산금     단기         4.0
       장기         NaN
시은     단기        12.0
       장기         NaN
중금     단기         8.0
       장기         NaN
dtype: float64

The `reindex()` method of Series/DataFrame can be called with another MultiIndex, mi, or even a list or array of tuples:

In [57]:
s.reindex(index[:3])

first  second
국고     단기        0
       장기        1
산금     단기        2
dtype: int64

In [58]:
s.reindex( [ ('국고', '장기'), ('산금','장기'), ('시은','장기'), ('중금','단기') ] )

first  second
국고     장기        1
산금     장기        3
시은     장기        7
중금     단기        4
dtype: int64

## 2. Advanced indexing with hierarchical index

Syntactically integrating `MultiIndex` in advanced indexing with `.loc` is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [59]:
df = df.T

In [60]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
국고,단기,-0.876667,-0.5035,1.718268
국고,장기,0.966092,-0.88004,2.243557
산금,단기,1.120699,-0.206314,0.545362
산금,장기,-0.445573,-1.258661,2.008662
중금,단기,-0.725877,-1.345779,0.12104
중금,장기,-0.829777,-0.528776,-0.198562
시은,단기,1.980688,-1.55545,0.253164
시은,장기,-2.445532,-0.741435,0.085287


In [62]:
df.loc[('국고', '장기'), ['A', 'B']]

A    0.966092
B   -0.880040
Name: (국고, 장기), dtype: float64

Note that `df.loc['bar', 'two']` would also work in this example, but this shorthand notation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuple like this:

In [63]:
df.columns = ['A','B','장기']

In [67]:
df.loc['국고','장기'] # ambiguous

A     0.966092
B    -0.880040
장기    2.243557
Name: (국고, 장기), dtype: float64

You don’t have to specify all levels of the `MultiIndex` by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

`df.loc[‘bar’]`

This is a shortcut for the slightly more verbose notation `df.loc[('bar',),]` (equivalent to `df.loc['bar',]` in this example).

“Partial” slicing also works quite nicely.

In [73]:
iters = [['KRW','USD','CNY','JPY', 'EUR'], ['sell','buy']]

In [74]:
mi = pd.MultiIndex.from_product(iters, names=['ccy','position'])

In [76]:
df = pd.DataFrame(np.random.randn(10,3), index=mi, columns=list('ABC'))

In [94]:
df.loc['KRW']

Unnamed: 0_level_0,A,B,C
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sell,0.099013,0.048897,1.472745
buy,1.026882,-0.559019,0.152572


In [95]:
df.loc[('KRW',),]

  return self._getitem_tuple(key)


Unnamed: 0_level_0,A,B,C
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sell,0.099013,0.048897,1.472745
buy,1.026882,-0.559019,0.152572


In [96]:
df.loc['KRW',]

  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,A,B,C
position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sell,0.099013,0.048897,1.472745
buy,1.026882,-0.559019,0.152572
