In [None]:
import pandas as pd 
import numpy as np

# Multi Indexing or hierarchial indexing

<p Style="color:#c4c95f;font-family:'monospace';background-color:#000;font-size:20px;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).
</p>
<p Style="color:#c4c95f;font-family:'monospace';background-color:#000;font-size:20px;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.</p>

# Creating a MultiIndex (hierarchical index) object
        The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [None]:
 arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]
print(arrays)

[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]


In [None]:
tuples = list(zip(*arrays))
print(tuples)

[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]


In [None]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index


MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [None]:
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one      -0.474963
       two       0.322479
baz    one      -2.615033
       two      -0.096663
foo    one      -0.023613
       two      -1.603999
qux    one      -0.870078
       two      -1.039578
dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [None]:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
pd.MultiIndex.from_product(iterables, names=["first", "second"])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, using the method `MultiIndex.from_frame()`. This is a complementary method to `MultiIndex.to_frame()`.

In [None]:
df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )
df

Unnamed: 0,first,second
0,bar,one
1,bar,two
2,foo,one
3,foo,two


In [None]:
pd.MultiIndex.from_frame(df)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [None]:
arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]
arrays

[array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
       dtype='<U3'),
 array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'],
       dtype='<U3')]

In [None]:
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one    1.021361
     two   -0.492605
baz  one    0.445899
     two    0.163972
foo  one    0.111829
     two    0.965775
qux  one   -1.244975
     two   -0.448707
dtype: float64

In [None]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,-0.517331,-1.479932,0.583075,-0.550293
bar,two,0.145875,1.839952,0.75808,0.491971
baz,one,-0.366161,0.546345,-0.891069,0.326291
baz,two,1.170699,-0.758957,0.573854,-0.016143
foo,one,-0.477831,1.250441,2.193823,0.414363
foo,two,-1.402932,0.191409,0.84495,-0.223622
qux,one,-0.750531,-0.326422,-1.019849,-0.793894
qux,two,-0.793493,1.659182,-0.55404,0.881396


All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [None]:
 df.index.names

FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levels of the index is up to you:

In [None]:
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-0.728761,-0.084107,-1.182437,-1.0433,-0.773936,1.605305,1.010562,-1.354723
B,-1.922347,1.085032,-1.524169,1.360346,1.090313,0.478047,-1.620744,-0.732665
C,0.973348,-1.167055,0.372091,-0.517805,-0.423159,0.498708,0.454401,-0.376528


In [None]:
pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,first,bar,bar,baz,baz,foo,foo
Unnamed: 0_level_1,second,one,two,one,two,one,two
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
bar,one,0.048562,-1.2528,0.134575,1.968382,-0.072534,0.786405
bar,two,0.107968,0.618966,-1.085837,-1.315955,0.434587,0.425862
baz,one,2.25133,0.269664,0.235421,-1.245102,0.877984,1.414413
baz,two,-1.432895,2.177356,-0.33737,0.094059,0.441283,-0.521541
foo,one,1.04919,1.29081,-0.823526,-0.49396,2.753459,-0.559673
foo,two,-0.458928,-0.084823,0.294149,-1.174327,-0.516529,-0.944884


We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

In [None]:
 with pd.option_context("display.multi_sparse", False):
   ....:     df

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [None]:
pd.Series(np.random.randn(8), index=tuples)

(bar, one)    0.111952
(bar, two)    0.142852
(baz, one)    0.985998
(baz, two)    1.813835
(foo, one)   -0.039185
(foo, two)    0.356397
(qux, one)   -1.556560
(qux, two)   -1.097631
dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

# Reconstructing the level labels

The method get_level_values() will return a vector of the labels for each location at a particular level:

In [None]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [None]:
index.get_level_values("second")

In [None]:
 index.get_level_values(1)

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

In [None]:
 #index.get_level_values(2)

<a style="color:#ff0000;font-family:'monospaced'"> IndexError:</a>Too many levels: Index has only 2 levels, not 3


In [None]:
index_=pd.MultiIndex.from_product([[2013,2014],[1,2]],
names=['year','visit'])
columns=pd.MultiIndex.from_product([['Rakesh','Sai','Hari'],['HG','Temp']],
names=['Names','Type'])
#mock some data
data=np.round(np.random.randn(4,6),1)
data[:,::2]*=10
data += 37
hd=pd.DataFrame(data,index=index_,columns=columns)

In [None]:
hd

Unnamed: 0_level_0,Names,Rakesh,Rakesh,Sai,Sai,Hari,Hari
Unnamed: 0_level_1,Type,HG,Temp,HG,Temp,HG,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,26.0,36.8,35.0,36.7,53.0,37.2
2013,2,33.0,37.7,44.0,37.6,39.0,38.2
2014,1,51.0,37.5,29.0,38.5,43.0,37.3
2014,2,39.0,37.2,46.0,39.1,40.0,36.2


# Basic indexing on axis with MultiIndex
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [None]:
df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,-0.728761,-0.084107,-1.182437,-1.0433,-0.773936,1.605305,1.010562,-1.354723
B,-1.922347,1.085032,-1.524169,1.360346,1.090313,0.478047,-1.620744,-0.732665
C,0.973348,-1.167055,0.372091,-0.517805,-0.423159,0.498708,0.454401,-0.376528


In [None]:
df['bar']

second,one,two
A,-0.728761,-0.084107
B,-1.922347,1.085032
C,0.973348,-1.167055


In [None]:
df['foo']

second,one,two
A,-0.773936,1.605305
B,1.090313,0.478047
C,-0.423159,0.498708


In [None]:
df[('bar','one')]

A   -0.728761
B   -1.922347
C    0.973348
Name: (bar, one), dtype: float64

In [None]:
df['bar']['one'] # both ways are same

A   -0.728761
B   -1.922347
C    0.973348
Name: one, dtype: float64

In [None]:
s

bar  one    1.021361
     two   -0.492605
baz  one    0.445899
     two    0.163972
foo  one    0.111829
     two    0.965775
qux  one   -1.244975
     two   -0.448707
dtype: float64

In [None]:
s['baz']

one    0.445899
two    0.163972
dtype: float64

In [None]:
s['bar','one']

1.0213606790116987

In [None]:
s['bar']['two']

-0.49260540913596945

# Defined levels
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [None]:
df.columns.levels  # original MultiIndex

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [None]:
 df[["foo","qux"]].columns.levels #sliced multi index

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

In [None]:
df[["foo", "qux"]].columns.to_numpy()

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [None]:
df[["foo", "qux"]].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [None]:
df.loc["A",("bar", "two")] # first column and then indexing

-0.08410678957317046

To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

In [None]:
new_mi = df[["foo", "qux"]].columns.remove_unused_levels()


In [None]:
new_mi

MultiIndex([('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [None]:
 new_mi.levels

FrozenList([['foo', 'qux'], ['one', 'two']])

# Data alignment and using&nbsp;reindex

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [None]:
s

bar  one    1.021361
     two   -0.492605
baz  one    0.445899
     two    0.163972
foo  one    0.111829
     two    0.965775
qux  one   -1.244975
     two   -0.448707
dtype: float64

In [None]:
s[-2] # 2nd value from last

-1.2449745511736046

In [None]:
s[:-2] # Upto tha last 2nd value from begin

bar  one    1.021361
     two   -0.492605
baz  one    0.445899
     two    0.163972
foo  one    0.111829
     two    0.965775
dtype: float64

In [None]:
s + s[:-2] # adding all s to all except last one main index of s or last two sub index of s

bar  one    2.042721
     two   -0.985211
baz  one    0.891799
     two    0.327944
foo  one    0.223658
     two    1.931549
qux  one         NaN
     two         NaN
dtype: float64

In [None]:
 s + s[::2] # adding all of s to s with step 2

bar  one    2.042721
     two         NaN
baz  one    0.891799
     two         NaN
foo  one    0.223658
     two         NaN
qux  one   -2.489949
     two         NaN
dtype: float64

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [None]:
index[:3]

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one')],
           names=['first', 'second'])

In [None]:
s.reindex(index[:3]) # re indexing s with first 3 dataframe index values ???

first  second
bar    one       1.021361
       two      -0.492605
baz    one       0.445899
dtype: float64

In [None]:
s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])

foo  two    0.965775
bar  one    1.021361
qux  one   -1.244975
baz  one    0.445899
dtype: float64

# Advanced indexing with hierarchical index

Syntactically integrating MultiIndex in advanced indexing with `.loc` is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [None]:
df = df.T
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-0.728761,-1.922347,0.973348
bar,two,-0.084107,1.085032,-1.167055
baz,one,-1.182437,-1.524169,0.372091
baz,two,-1.0433,1.360346,-0.517805
foo,one,-0.773936,1.090313,-0.423159
foo,two,1.605305,0.478047,0.498708
qux,one,1.010562,-1.620744,0.454401
qux,two,-1.354723,-0.732665,-0.376528


In [None]:
df.loc[("bar", "two")]

A   -0.084107
B    1.085032
C   -1.167055
Name: (bar, two), dtype: float64

In [None]:
df.loc[("bar", "two"), "A"] # first column and then indexing

-0.08410678957317046

In [None]:
df.loc["bar"]

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,-0.728761,-1.922347,0.973348
two,-0.084107,1.085032,-1.167055


This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).
“Partial” slicing also works quite nicely.

In [None]:
df.loc["baz":"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,-1.182437,-1.524169,0.372091
baz,two,-1.0433,1.360346,-0.517805
foo,one,-0.773936,1.090313,-0.423159
foo,two,1.605305,0.478047,0.498708


In [None]:
df.loc[("baz",'two'):("foo",'one')]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-1.0433,1.360346,-0.517805
foo,one,-0.773936,1.090313,-0.423159


In [None]:
df.loc[("baz", "two"):("qux", "one")]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-1.0433,1.360346,-0.517805
foo,one,-0.773936,1.090313,-0.423159
foo,two,1.605305,0.478047,0.498708
qux,one,1.010562,-1.620744,0.454401


In [None]:
     df.loc[("baz", "two"):"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-1.0433,1.360346,-0.517805
foo,one,-0.773936,1.090313,-0.423159
foo,two,1.605305,0.478047,0.498708


<p style="color:#ffffff;background-color:#000;font-family:'monospaced';font-size:20px;">Note:- </p>

It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).  
Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [None]:
df1=pd.DataFrame([['HI','Temp'],['HI','precip'],
['NJ','Temp'],['NJ','precip']],columns=['a','b'])

In [None]:
df1

Unnamed: 0,a,b
0,HI,Temp
1,HI,precip
2,NJ,Temp
3,NJ,precip


In [None]:
mi=pd.MultiIndex.from_frame(df1)

In [None]:
mi

MultiIndex([('HI',   'Temp'),
            ('HI', 'precip'),
            ('NJ',   'Temp'),
            ('NJ', 'precip')],
           names=['a', 'b'])

In [None]:
mi.levels

FrozenList([['HI', 'NJ'], ['Temp', 'precip']])

In [None]:
mi.nlevels

2

In [None]:
mi.set_levels(['a','b'],level=0)

MultiIndex([('a',   'Temp'),
            ('a', 'precip'),
            ('b',   'Temp'),
            ('b', 'precip')],
           names=['a', 'b'])

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=fcf2399d-084b-4173-af36-20a4a45218a8' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>