<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# **multi indexing** in  *pandas*

   - in a *pandas.DataFrame* axis labels are **rows** and **columns** labels
   - they are represented by *Index* **objects**
   - with **row** and **column** **Index** you have **two-dimensional structured** arrays 

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame(
  {'row': [0, 1, 2],
   'one_X': [1, 2, 3],
   'one_Y': [4, 5, 6],
   'two_X': [10, 20, 30],
   'two-Y': [40, 50, 60]})
df

In [None]:
# we set the index to the 'row' column
df = df.set_index('row')   
df

   - this example appears to be more **structured**
   - we can see **two labelled pairs** of values $(X_{one}, Y_{one})$ and $(X_{two}, Y_{two})$
   - the **first** pair is labelled by **one** and the **second** by **two**

   - there is a **hierarchy** in the labels

the **two pairs** of values $(X_{one}, Y_{one})$ and $(X_{two}, Y_{two})$ can be seen as:
   - **two labels**: $one$ and $two$
   - with **two values** labelled **X** and **Y** each
   - with **three values** each indexed by the label **row**

something like this:

|$\  $ |one   |$\ $ |$\ $ |two | $\ $|
|-     |-     |-    |    -|  - |-    |
|$\ $  |**X**     |**Y**    |$\ $ |**X**   |**Y**    |
|**row**   | $\ $ |$\ $ |$\ $ |$\ $|$\ $ | 
|**0**  |1     |4    |$\ $ |10  |40   |
|**1**  |2     |5    |$\ $ |20  |50   |
|**2**  |3     |6    |$\ $ |30  |60   |


it is **multi-indexing**

   - you want to express **multi-dimensionality** in a data structure of **lower dimension**
      - a *pandas.Series* with **more than one** dimension
      - a *pandas.DataFrame* with **more than two** dimensions
      
      

   - **columns** **Index** will be **replaced by** a **columns multiIndex**

   - you express the **multi-indexing** by **tuples** of **related labels**

In [None]:
tuples_from_pairs = [
    ('one', 'X'), ('one', 'Y'),
    ('two', 'X'), ('two', 'Y')]

In [None]:
# we create a multi-index object from the tuples
pd.MultiIndex.from_tuples(tuples_from_pairs)

a multi-index is **composed** of:
   - the **levels** (groups of labels in ***descending*** **levels** like $[[one, two], [X, Y]]$)
   - their **coding**

   - you **replace** the columns index by a columns multi-index

In [None]:
df.columns = pd.MultiIndex.from_tuples(tuples_from_pairs)

df

   - you have now an indexing with **hierarchical columns**

   - to  **access** multi-index  use the *pandas.DataFrame.loc* and *pandas.DataFrame.loc*
   - the first index is the row and the second is the column

In [None]:
# first rows, all columns
df.loc[0]

In [None]:
# first rows, columns 'one'
df.loc[0, 'one'] 

In [None]:
df.loc[0, ['one', 'two']] # first row
                          # list of columns 'one' and 'two'

   - the index of the columns is **hierarchical**
   - i.e. it can be described using **tuples** of **labels**
   - the same **tuples** you use to construct the **multi-index**
   - $(one, X), (one, Y), (two, X), (two, Y)$

   - you can use **tuples** of **labels** with **.loc**

In [None]:
df.loc[0, ('two', 'X')] # first row
                        # columns label ('one', 'X')

In [None]:
df.loc[[0,2], ('two', 'X')] # columns label ('one', 'X')
                            # of first and third rows

   - you can use **.iloc**

In [None]:
df.iloc[0] # first row

   - multi-index on **rows** and **columns**

In [None]:
# index for years and visits
index = pd.MultiIndex.from_product(
    [[2013, 2014],
     [1, 2, 3]],
    names=['year', 'visit'])

In [None]:
# columns for clients and medical data
columns = pd.MultiIndex.from_product(
    [['Alice', 'Bob'],
     ['before test', 'after test']],
    names=['Patient', 'HearthRate'])

In [None]:
data = np.random.randint(60, 100, 24).reshape(6, 4) # heart rates between 60 and 100 beats

medical_data = pd.DataFrame(data, index=index, columns=columns)
medical_data

In [None]:
medical_data.columns

In [None]:
medical_data.loc[:, 'Alice'] # all medical data on Alice

In [None]:
medical_data.loc[(2013, 2), ('Alice', 'before test')] # Alice's HearthRate 'before test'
                                                      # in the second visit in 2013,

In [None]:
medical_data.loc[(2013, 2), ('Alice', 'before test')] = 82

   - you **must** use **.loc** or **.iloc** to modify an element
   - never use direct access
   - http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a

In [None]:
medical_data < 80 # you can test