# 2.5 – Hierarchical Indexing

Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas ``Series`` and ``DataFrame`` objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
This can be achieved with the help of *hierarchical indexing* (also known as *multi-indexing*), that is, by incorporating multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.

In this section, we'll explore the direct creation of ``MultiIndex`` objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

We begin with the standard imports:

In [1]:
import pandas as pd
import numpy as np

## A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data within a one-dimensional ``Series``.
For concreteness, we will consider a series of data where each point has a character and numerical key.

### The bad way

Suppose you would like to track data about states from two different years.
Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:

In [3]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there. For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [4]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax we've grown to love in Pandas.

### The Better Way: Pandas MultiIndex
Fortunately, Pandas provides a better way.
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas ``MultiIndex`` type gives us the type of operations we wish to have.
We can create a multi-index from the tuples as follows:

In [5]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

Notice that the ``MultiIndex`` contains multiple *levels* of indexing–in this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.

If we re-index our series with this ``MultiIndex``, we see the hierarchical representation of the data:

In [6]:
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [7]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first two columns of the ``Series`` representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [8]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

The result is a singly indexed array with just the keys we're interested in.
This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with.
We'll now further discuss this sort of indexing operation on hieararchically indexed data.

**Your turn.** Access the following data in the ``pop`` Series:
- from year 2000
- from New York
- from New York in year 2010

In [9]:
# write your code here - from year 2000



In [10]:
# write your code here - from New York



In [11]:
# write your code here - from New York in year 2010



### MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [12]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


Naturally, the ``stack()`` method provides the opposite operation:

In [13]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical indexing at all.
The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional ``Series``, we can also use it to represent data of three or more dimensions in a ``Series`` or ``DataFrame``.
Each extra level in a multi-index represents an extra dimension of data. Taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year, say, population under 18. With a ``MultiIndex`` this is as easy as adding another column to the ``DataFrame``:

In [14]:
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [15]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


You can think of this DataFrame as of a 3-dimensional ndarray. To access data from year 2010 you can use explicit indexing:

In [16]:
pop_df.loc[:,2010,:]  # explicit indexing

Unnamed: 0,total,under18
California,37253956,9284094
New York,19378102,4318033
Texas,25145561,6879014


In addition, all the ufuncs and other functionality discussed in [Operating on Data in Pandas](L23_Operations_in_Pandas.ipynb) work with hierarchical indices as well.
Here we compute the fraction of people under 18 by year, given the above data:

In [17]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


This allows us to easily and quickly manipulate and explore even high-dimensional data.

## Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor. For example:

In [18]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.638168,0.323976
a,2,0.909834,0.799652
b,1,0.435188,0.152668
b,2,0.731202,0.314979


The work of creating the ``MultiIndex`` is done in the background.

Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:

In [19]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
ds = pd.Series(data)

Nevertheless, it is sometimes useful to explicitly create a ``MultiIndex``; we'll see a couple of these methods here.

### Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.
For example, as we did before, you can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:

In [20]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point:

In [21]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can even construct it from a Cartesian product of single indices:

In [22]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can also construct a MultiIndex from a DataFrame directly, using the method ``MultiIndex.from_frame()``:

In [23]:
df = pd.DataFrame([['a', 'b'], [1, 2]], columns=["first", "second"])
df

Unnamed: 0,first,second
0,a,b
1,1,2


In [24]:
pd.MultiIndex.from_frame(df)

MultiIndex([('a', 'b'),
            (  1,   2)],
           names=['first', 'second'])

Any of these objects can be passed as the ``index`` argument when creating a ``Series`` or ``Dataframe``, or be passed to the ``reindex`` method of an existing ``Series`` or ``DataFrame``. For instance:

In [25]:
ind = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
df = pd.DataFrame(np.random.rand(4, 2), index=ind, columns=["A","B"])
df

Unnamed: 0,Unnamed: 1,A,B
a,1,0.01003,0.229009
a,2,0.585979,0.117646
b,1,0.618129,0.712314
b,2,0.089452,0.509659


**Your turn.** Create the following MultiIndex DataFrame using the three ``from_`` methods explained above.
```
            2010  2020
England  a   835   317
         b   908   875
         c   455   259
Wales    a   517   687
         b   882   217
         c   362   738
Scotland a   598   620
         b   334   195
         c   678   759
```
For DataFrame values choose random integers between 100 and 999.

In [26]:
# write your code here - create index from arrays



In [27]:
# write your code here - create index from tuples



In [28]:
# write your code here - create index from product



In [29]:
# write your code here - create dataframe



### MultiIndex level names

Sometimes it is convenient to name the levels of the ``MultiIndex``.
This can be accomplished by passing the ``names`` argument to any of the above ``MultiIndex`` constructors, or by setting the ``names`` attribute of the index after the fact:

In [30]:
pop  # index has no names

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [31]:
pop.index.names = ['state', 'year']  # set index names
pop                                  # inspect the result

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

With more involved datasets, this can be a useful way to keep track of the meaning of various index values.

**Your turn.** Use the MultiIndex from your previous exercise to create a MultiIndex Series `sr` with values 100, 200, ..., 900. Then set index names to "country" and "division". The result should be as follows:

```
country   division
England   a           100
          b           200
          c           300
Wales     a           400
          b           500
          c           600
Scotland  a           700
          b           800
          c           900
dtype: int64
```

In [32]:
# write your code here - create series



In [33]:
# write your code here - set index names



### Swapping and dropping MultiIndex levels

Two swap two MultiIndex levels, ``i`` and ``j``, use ``DataFrame.swaplevel(i=-2, j=-1, axis=0)``, with ``-2`` and ``-1`` being the default levels.

In [34]:
pop_df = pop_df.swaplevel(0,1)
pop_df.sort_index(inplace = True) # comment this out to see the difference
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,total,under18
year,state,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,California,33871648,9267089
2000,New York,18976457,4687374
2000,Texas,20851820,5906301
2010,California,37253956,9284094
2010,New York,19378102,4318033
2010,Texas,25145561,6879014


To drop any ``MultiIndex`` level, use ``DataFrame.droplevel(level=0)`` with ``0`` being the default level. 

In [35]:
pop_df.droplevel(level=0)

Unnamed: 0_level_0,total,under18
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,9267089
New York,18976457,4687374
Texas,20851820,5906301
California,37253956,9284094
New York,19378102,4318033
Texas,25145561,6879014


### MultiIndex for columns

In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [36]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,36.7,42.0,35.8,40.0,37.0
2013,2,27.0,36.7,55.0,34.4,44.0,35.6
2014,1,43.0,37.7,39.0,36.5,19.0,37.5
2014,2,27.0,36.9,35.0,35.6,33.0,37.6


Here we see where the multi-indexing for both rows and columns can come in *very* handy.
This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.
With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:

In [37]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,42.0,35.8
2013,2,55.0,34.4
2014,1,39.0,36.5
2014,2,35.0,35.6


For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!

Note that MultiIndex can have a different number of nested indices withing each index. For example, each year does not need to have the same number of visits. A new record can be added using the ``pd.concat()`` method which will be explained in the [next](L26_Concat_and_Append.ipynb) section. 

In [38]:
# create a dataframe for a new patient record
new_record = pd.DataFrame(np.round(np.random.randn(1, 6), 1)*10+37,
                          index = [[2014],[3]], 
                          columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']]))



new_record

Unnamed: 0_level_0,Unnamed: 1_level_0,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,Unnamed: 1_level_1,HR,Temp,HR,Temp,HR,Temp
2014,3,39.0,45.0,43.0,60.0,43.0,27.0


In [39]:
# append the dataframe - this does not modify the original dataframe unless inplace=True is used
pd.concat([health_data,new_record])

Unnamed: 0_level_0,Unnamed: 1_level_0,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,Unnamed: 1_level_1,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,36.7,42.0,35.8,40.0,37.0
2013,2,27.0,36.7,55.0,34.4,44.0,35.6
2014,1,43.0,37.7,39.0,36.5,19.0,37.5
2014,2,27.0,36.9,35.0,35.6,33.0,37.6
2014,3,39.0,45.0,43.0,60.0,43.0,27.0


**Your turn.** Create a mock-up sales table for a small chain of local grocery stores called "Nick's", "Macy's" and "Khan's". Each shop should have "turnover" and "profit" columns populated quarterly (q1, q2, q3, q4) for years 2020 and 2021. Year and quarter are your index names. Populate this table with a realistic random data (profit should be smaller than the turnover!). Here's an example of such a DataFrame:

In [40]:
df = pd.read_pickle("data/sales_data.pkl")  # pickle is byte stream representation of pandas DataFrame
df

Unnamed: 0_level_0,store,Nick's,Nick's,Macy's,Macy's,Khan's,Khan's
Unnamed: 0_level_1,finances,turnover,profit,turnover,profit,turnover,profit
year,quarter,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2020,q1,3159354,83816,1514733,83758,3228291,16421
2020,q2,1209120,66851,1791570,97572,3050553,93129
2020,q3,790977,14158,1337292,19350,1666698,55603
2020,q4,913011,39999,782100,14187,2877864,80979
2021,q1,993036,76085,3294027,87841,3066393,88910
2021,q2,1485066,51642,1630992,57591,3063852,38328
2021,q3,558129,66832,2489487,81197,3125529,97096
2021,q4,1348050,18679,2349732,49373,1672671,66005


In [41]:
# write your code here



## Indexing and Slicing a MultiIndex

Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrame``s.

### Multiply indexed Series

Consider the multiply indexed ``Series`` of state populations we saw earlier:

In [42]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

We can access single elements by indexing with multiple terms:

In [43]:
pop['California', 2000]

33871648

The ``MultiIndex`` also supports *partial indexing*, or indexing just one of the levels in the index.
The result is another ``Series``, with the lower-level indices maintained:

In [44]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

Partial slicing is available as well, as long as the ``MultiIndex`` is sorted (see discussion in [Sorted and Unsorted Indices](#Sorted-and-unsorted-indices) below):

In [45]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [46]:
pop[:, 2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

Other types of indexing and selection (discussed in [Data Indexing and Selection](L22_Data_Indexing_and_Selection.ipynb) previously) work as well; for example, selection based on Boolean masks:

In [47]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

Selection based on fancy indexing also works:

In [48]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

**Your turn.** Select California and Texas data in 2010. Hint: use ``loc``.

In [49]:
# write your code here



### Multiply indexed DataFrames

A multiply indexed ``DataFrame`` behaves in a similar manner.
Consider our toy medical ``DataFrame`` from before:

In [50]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,42.0,36.7,42.0,35.8,40.0,37.0
2013,2,27.0,36.7,55.0,34.4,44.0,35.6
2014,1,43.0,37.7,39.0,36.5,19.0,37.5
2014,2,27.0,36.9,35.0,35.6,33.0,37.6


Remember that columns are primary in a ``DataFrame``, and the syntax used for multiply indexed ``Series`` applies to the columns.
For example, we can recover Guido's heart rate data with a simple operation:

In [51]:
health_data['Guido', 'HR']

year  visit
2013  1        42.0
      2        55.0
2014  1        39.0
      2        35.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the ``loc`` and ``iloc`` indexers introduced in [Data Indexing and Selection](L22_Data_Indexing_and_Selection.ipynb). For example:

In [52]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,42.0,36.7
2013,2,27.0,36.7


**Your turn.** Select Sue's records from 2014 using `iloc`.

In [53]:
# write your code here



These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in ``loc`` or ``iloc`` can be passed a tuple of multiple indices. For example:

In [54]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        42.0
      2        27.0
2014  1        43.0
      2        27.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

In [55]:
health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (3311942670.py, line 1)

You could get around this by building the desired slice explicitly using Python's built-in ``slice()`` function, but a better way in this context is to use an ``IndexSlice`` object, which Pandas provides for precisely this situation.
For example:

In [56]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,Bob,Guido,Sue
Unnamed: 0_level_1,type,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,42.0,42.0,40.0
2014,1,43.0,39.0,19.0


**Your turn.** Consider the ``sales_data`` dataframe you created above. Use IndexSlice to:
- select q2 and q4 profits of each shop
- select all q2 and q3 data from Nick's and Khan's stores for year 2021

In [57]:
# write your code here - select q2 and q4 profits of each shop



In [58]:
# write your code here - select all q2 and q3 data from Nick's and Khan's stores for year 2021



There are so many ways to interact with data in multiply indexed ``Series`` and ``DataFrame``s, and the best way to become familiar with them is to try them out!

## Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing how to effectively transform the data.
There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations.
We saw a brief example of this in the ``stack()`` and ``unstack()`` methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.

### Sorted and unsorted indices

Earlier, we briefly mentioned a caveat, but we should emphasize it more here.
*Many of the ``MultiIndex`` slicing operations will fail if the index is not sorted.*
Let's take a look at this here.

We'll start by creating some simple multiply indexed data where the indices are *not lexographically sorted*:

In [59]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data  = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.627222
      2      0.246700
c     1      0.785131
      2      0.414200
b     1      0.167664
      2      0.930912
dtype: float64

If we try to take a partial slice of this index, it will result in an error:

In [60]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted.
For various reasons, partial slices and other similar operations require the levels in the ``MultiIndex`` to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting; we'll use the simplest, ``sort_index()``, here:

In [61]:
data = data.sort_index() # aside: try data.sort_values()
data

char  int
a     1      0.627222
      2      0.246700
b     1      0.167664
      2      0.930912
c     1      0.785131
      2      0.414200
dtype: float64

With the index sorted in this way, partial slicing will work as expected:

In [62]:
data['a':'b']

char  int
a     1      0.627222
      2      0.246700
b     1      0.167664
      2      0.930912
dtype: float64

**Your turn.** Create a Series with an unsorted index, say ``list("asdfghjk")``, and subindex, say `[1,2]`, and random data. Then:
- try to slice it from a to j to produce an error
- sort the index and slice from a to j again

In [63]:
# write your code here - create the required Series



In [64]:
# write your code here - slice before sorting



In [65]:
# write your code here - sort and slice



**Important!** Unlike unsorted MultiIndex Series, you can slice unsorted Series with no MultiIdex:

In [66]:
sr = pd.Series(np.random.randint(1,10,size=8), index=list("asdfghjk"))
sr["a":"j"]

a    8
s    5
d    8
f    8
g    1
h    5
j    2
dtype: int64

### Stacking and unstacking indices

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

In [67]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [68]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [69]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [70]:
pop.unstack(level=1) # same as pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


The opposite of ``unstack()`` is ``stack()``, which here can be used to recover the original series:

In [71]:
pop.unstack().stack()

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

**Your turn.** Try out ``stack()`` and ``unstack()`` at various levels on the ``health_data`` DataFrame. Observe how the indices are moved.

In [72]:
# write your code here



In [73]:
# write your code here



In [74]:
# write your code here



In [75]:
# write your code here



### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:

In [76]:
pop_flat = pop.reset_index(name='population')  # without name='population' the last column has name 0
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

In [77]:
pop_flat.set_index(['population'])

Unnamed: 0_level_0,state,year
population,Unnamed: 1_level_1,Unnamed: 2_level_1
33871648,California,2000
37253956,California,2010
18976457,New York,2000
19378102,New York,2010
20851820,Texas,2000
25145561,Texas,2010


In practice, this type of reindexing is one of the more useful patterns when encountering real-world datasets. For instance, data preparation for analysis of variance or covariance requires such manipulations.

**Your turn.** Reset index of the ``health_data`` DataFrame. Inspect the result. Then restore the original indexing with the help of the ``set_index()`` method.

In [78]:
# write your code here - reset index



In [79]:
# write your code here - restore index



---

## Exercises

**Exercise 2.5.1** Create the following MultiIndex Series with entries being random numbers:

```
L0  L1
A   a     32.39
    b     95.90
B   a     11.31
    b     72.08
C   a      6.67
    b     45.17
D   a     64.91
    b     40.99
```

In [80]:
# write your solution here



- Interchange ``L0`` and ``L1`` indexes <br> Hint: you could stack-unstack, or use the ``swaplevel()`` method, see [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.swaplevel.html).

In [81]:
# write your solution here



- Slice sub-series from ``A`` to ``B``

In [82]:
# write your solution here



- Slice sub-series having ``L0=a`` only 

In [83]:
# write your solution here



- Reset the index, that is obtain a DataFrame with columns ``L0``, ``L1``, and ``L2`` for the series values

In [84]:
# write your solution here



---

**Exercise 2.5.2** Create the following MultiIndex DataFrame with entries being random numbers of the given precision:

```
C0     rate        temp            
C1     slow  fast   low   med  high
L0 L1                              
A  a   2.92  9.23  69.2  69.1  77.1
   b   6.46  0.91   0.5  36.9  41.7
B  a   2.70  7.39  62.8  47.8  52.6
   b   2.91  2.48  61.0  20.1  16.0
C  a   1.86  3.74  22.4  40.0  60.4
   b   5.47  0.26  11.3  38.5  17.3
D  a   1.75  0.21  95.7  89.4  88.8
   b   5.22  1.12  90.9  75.7  99.4
```

In [85]:
# write your solution here



- Slice the ``temp`` sub-DataFrame

In [86]:
# write your solution here



- Slice the ``C``-indexed ``temp`` sub-DataFrame

In [87]:
# write your solution here



- Slice the ``slow`` rate and ``med`` to ``high`` temperatures sub-DataFrame

In [88]:
# write your solution here



- Use ``pd.IndexSlice`` to slice the ``b`` indexed and ``slow`` rate and ``low`` temperature sub-DataFrame 

In [89]:
# write your solution here



- Flatten column names. Use ``df.columns = df.columns.map('_'.join)``.

In [90]:
# write your solution here



- Restore the original column structure

In [91]:
# write your solution here



---

**Exercise 2.5.3** Create the following Pandas objects:

- A MultiIndex DataFrame from the given symmetric nested dictionary. Hint: you need to loop through ``nested_dict.items()``.

In [92]:
nested_dict = {'India': {'State': ['Maharashtra', 'West Bengal', 
                                   'Uttar Pradesh', 'Bihar', 'Karnataka'], 
                         'Capital': ['Mumbai', 'Kolkata', 'Lucknow', 
                                     'Patna', 'Bengaluru']}, 
  
               'America': {'State': ['California', 'Florida', 'Georgia', 
                                     'Massachusetts', 'New York'], 
                           'Capital': ['Sacramento', 'Tallahassee', 'Atlanta', 
                                       'Boston', 'Albany']},
               
              'British Isles': {'State': ['England', 'Scotland', 'Wales', 
                                     'Northern Ireland','Ireland'], 
                           'Capital': ['London', 'Edinburgh', 'Cardiff', 
                                       'Belfast', 'Dublin']}
              } 

In [93]:
# the basic method does not work in this case

pd.DataFrame.from_dict(nested_dict, orient='columns')

Unnamed: 0,India,America,British Isles
State,"[Maharashtra, West Bengal, Uttar Pradesh, Biha...","[California, Florida, Georgia, Massachusetts, ...","[England, Scotland, Wales, Northern Ireland, I..."
Capital,"[Mumbai, Kolkata, Lucknow, Patna, Bengaluru]","[Sacramento, Tallahassee, Atlanta, Boston, Alb...","[London, Edinburgh, Cardiff, Belfast, Dublin]"


In [94]:
# inspect this object

nested_dict.items()

dict_items([('India', {'State': ['Maharashtra', 'West Bengal', 'Uttar Pradesh', 'Bihar', 'Karnataka'], 'Capital': ['Mumbai', 'Kolkata', 'Lucknow', 'Patna', 'Bengaluru']}), ('America', {'State': ['California', 'Florida', 'Georgia', 'Massachusetts', 'New York'], 'Capital': ['Sacramento', 'Tallahassee', 'Atlanta', 'Boston', 'Albany']}), ('British Isles', {'State': ['England', 'Scotland', 'Wales', 'Northern Ireland', 'Ireland'], 'Capital': ['London', 'Edinburgh', 'Cardiff', 'Belfast', 'Dublin']})])

In [95]:
# write your solution here



- Create a MultiIndex DataFrame from the given non-symmetric nested dictionary

In [96]:
nested_dict = {'India': {'State': ['Maharashtra', 'West Bengal', 
                                   'Uttar Pradesh', 'Bihar', 'Karnataka'], 
                         'Capital': ['Mumbai', 'Kolkata', 'Lucknow', 
                                     'Patna', 'Bengaluru']}, 
  
               'America': {'State': ['California', 'Florida', 'Georgia', 
                                     'Massachusetts', 'New York'], 
                           'Capital': ['Sacramento', 'Tallahassee', 'Atlanta', 
                                       'Boston', 'Albany']},
               
              'UK': {'State': ['England', 'Scotland', 'Wales', 
                                     'Northern Ireland'], 
                           'Capital': ['London', 'Edinburgh', 'Cardiff', 
                                       'Belfast']},
              }

In [97]:
# write your solution here



---

**Exercise 3.5.4** Consider the following DataFrame:

In [98]:
from datetime import datetime
basic_index = pd.MultiIndex.from_product([[1, 2, 3], ['a', 'b', 'c']])

orders = pd.DataFrame(
    data={
        'customer': [1, 2, 3, 1, 2, 3, 3, 1, 3, 1, 2],
        'order_date': [
            datetime(2022, 1, 3),
            datetime(2022, 1, 5),
            datetime(2022, 1, 7),
            datetime(2022, 1, 8),
            datetime(2022, 1, 8),
            datetime(2022, 1, 8),
            datetime(2022, 1, 9),
            datetime(2022, 1, 9),
            datetime(2022, 1, 10),
            datetime(2022, 1, 10),
            datetime(2022, 1, 11)
        ],
        'amount': [25, 42, 116, 11, 10, 21, 23, 13, 4, 67, 87]
    }
)

orders

Unnamed: 0,customer,order_date,amount
0,1,2022-01-03,25
1,2,2022-01-05,42
2,3,2022-01-07,116
3,1,2022-01-08,11
4,2,2022-01-08,10
5,3,2022-01-08,21
6,3,2022-01-09,23
7,1,2022-01-09,13
8,3,2022-01-10,4
9,1,2022-01-10,67


- Create a sorted MultiIndex DataFrame with ``customer`` set to level 0 index and ``order_date`` set to level 1 index

In [99]:
# write your solution here



- Update the records on 2022-01-08 by amount = amount+10. Hint: use ``pd.IndexSlice``

In [100]:
# write your solution here



- Slice the DataFrame from Jan 8 to Jan 10

In [101]:
# write your solution here



---

**Exercise 2.5.5** Consider the following DataFrame:

In [102]:
stocks = pd.DataFrame([['2016-10-03', 31.50, 14070500, 'CSCO'],
                       ['2016-10-03', 112.52, 21701800, 'AAPL'],
                       ['2016-10-03', 57.42, 19189500, 'MSFT'],
                       ['2016-10-04', 113.00, 29736800, 'AAPL'],
                       ['2016-10-04', 57.24, 20085900, 'MSFT'],
                       ['2016-10-04', 31.35, 18460400, 'CSCO'],
                       ['2016-10-05', 57.64, 16726400, 'MSFT'],
                       ['2016-10-05', 31.59, 11808600, 'CSCO'],
                       ['2016-10-05', 113.05, 21453100, 'AAPL']],
                      columns=['Date', 'Close', 'Volume', 'Symbol'])
stocks["Date"]=pd.to_datetime(stocks["Date"])
stocks

Unnamed: 0,Date,Close,Volume,Symbol
0,2016-10-03,31.5,14070500,CSCO
1,2016-10-03,112.52,21701800,AAPL
2,2016-10-03,57.42,19189500,MSFT
3,2016-10-04,113.0,29736800,AAPL
4,2016-10-04,57.24,20085900,MSFT
5,2016-10-04,31.35,18460400,CSCO
6,2016-10-05,57.64,16726400,MSFT
7,2016-10-05,31.59,11808600,CSCO
8,2016-10-05,113.05,21453100,AAPL


- Combine ``Symbol`` and ``Date`` into a multiindex with ``Symbol`` at level 0 and ``Date`` at level 1. Then sort the index.

In [103]:
# write your solution here



- Slice ``AAPL`` and ``MSFT`` stocks on ``2016-10-05``

In [104]:
# write your solution here



- Slice ``CSCO`` at ``2016-10-05`` and ``2016-10-03``

In [105]:
# write your solution here



- Slice all stocks from ``2016-10-03`` to ``2016-10-04``

In [106]:
# write your solution here



---

<!--NAVIGATION-->
< [2.4 – Missing Values](L24_Missing_Values.ipynb) | [Contents](../index.ipynb) | [2.6 – Concat and Append](L26_Concat_and_Append.ipynb) >

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; also available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*