# The index

**Prerequisites**

- [Introduction to pandas](03_pandas_intro.ipynb)  
- [Core pandas](04_core_pandas.ipynb)


**Outcomes**

- Understand how the index is used to align data  
- Know how to set and reset the index  
- Understand how to select subsets of data by slicing on index and columns  
- Understand that for DataFrames, the column names also align data  

In [1]:
import pandas as pd
import numpy as np

## So what is this index?

Every Series or DataFrame has an index

We told you that the index was the “row labels” for the data

This is true, but an index in pandas does much more than label the rows

The purpose of this lecture is to understand the importance of the index

The [pandas
documentation](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
says

> Data alignment is intrinsic. The link between labels and data will
not be broken unless done so explicitly by you.


In practice what this means is that when operating on multiple
DataFrames, the index and column names are used to make sure the data is
properly aligned

This is a somewhat abstract concept that is best understood by
example…

Let’s begin by making an example DataFrame

In [2]:
d1 = {'one' : [1., 2., 3., 4.],
      'two' : [4., 3., 2., 1.]}
df1 = pd.DataFrame(d1, index=list("abcd"))
df1_copy = df1.copy()  # Creates a new dataframe with exactly same data
df1

Unnamed: 0,one,two
a,1.0,4.0
b,2.0,3.0
c,3.0,2.0
d,4.0,1.0


Observe what happens when we evaluate `df1 + df1_copy`

In [3]:
df1 + df1_copy

Unnamed: 0,one,two
a,2.0,8.0
b,4.0,6.0
c,6.0,4.0
d,8.0,2.0


Notice that this operated *elementwise*, meaning that the `+`
operation was applied to each element of `df1` and the corresponding
element of `df1_copy`

Let’s make another DataFrame, with a slightly different index…

In [4]:
d2 = {'one' : [100, 400, 200],
      'two' : [40., 10., 30],
      'three' : [-1, -2, -4],}
df2 = pd.DataFrame(d2, index=["a", "d", "b"])
df2

Unnamed: 0,one,two,three
a,100,40.0,-1
d,400,10.0,-2
b,200,30.0,-4


Notice a few things about the index and column names of our two
DataFrames:

- `df2` does not have a row labeled `c`  
- Rows `b` and `d` are not ordered the same in `df1` and `df2`  
- `df2` contains a column named `three` that `df1` does not have  


Let’s see what happens when we try to add `df1` and `df2`

In [5]:
df12 = df1 + df2
df12

Unnamed: 0,one,three,two
a,101.0,,44.0
b,202.0,,33.0
c,,,
d,404.0,,11.0


Whoa, a lot happened! Let’s break it down.

### Automatic alignment

For all (row, column) combinations that appear in both DataFrames (e.g.
rows [a, b, d] and columns [one, two]) the value of `df12` is equal to
`df1.loc[row, col] + df2.loc[row, col]`

This happened even though the rows and columns were not in the same
order

We refer to this as pandas *aligning* the data for us

To see how awesome this is, think about how to do something similar in
Excel:

- `df1` and `df2` would be in different sheets  
- The index and column names would be the first column and row in each
  sheet  
- We would have a third sheet to hold the sum  
- For each label in the first row and column of *either* the `df1`
  sheet or the `df2` sheet we would have to do a `IFELSE` to check
  if the label exists in the other sheet and then a `VLOOKUP` to
  extract the value  


In pandas this happens automatically, behind the scenes, and *very
quickly*

### Handling missing data

For all elements in row `c` or column `three`, the value in `df12`
is `NaN`

This is how pandas represents *missing data*

The reason pandas treats this row and column as missing is because it
only appeared in one of `df1` and `df2`

So, when trying to look up the values in `df1` and `df2`, it could
only find a value in one DataFrame: the other value was missing

When pandas tries to add a number to something that is missing, it says
that the result is missing (or `NaN`)

<blockquote>

**Check for understanding**

What happens when you apply the `mean` method to `df12`?

In particular, what happens to columns that have missing data? (HINT:
also looking at the output of the `sum` method might help)


</blockquote>

## Setting the index

In the examples above, we set the index when the DataFrame was created

This was ok, because we were creating the DataFrame by hand

More often, we will import our data from some resource (e.g. a file on
our computer or from the web), in which case we can’t set the index by
hand

For a DataFrame `df`, the `df.set_index` method allows us to use one
(or more) of the DataFrame’s columns as the index

Here’s an example

In [6]:
# first, create the DataFrame
d3 = {
    "X1": range(6),
    "X2": range(6, 12),
    "X3": ["A", "B", "C", "D", "E", "F"],
    "X4": "one one one three two two".split()
}
df3 = pd.DataFrame(d3)
df3

Unnamed: 0,X1,X2,X3,X4
0,0,6,A,one
1,1,7,B,one
2,2,8,C,one
3,3,9,D,three
4,4,10,E,two
5,5,11,F,two


Notice that we did not pass an index and pandas set the index to count
from 0 to the number of rows in `df3` (i.e. `range(df.shape[0])`)

This is the default index that pandas creates for us when we don’t
supply one

Suppose now that we would like the `X3` column to be the index

We can make this happen using `df3.set_index("X3")`

<blockquote>

**Check for understanding**

What type of object is returned by the `set_index` method? How can you
be sure?

Does calling `df3.set_index("X3")` change `df3`? How could you
*save* the results of the call to `set_index`? Discuss with your
neighbor


</blockquote>

### Setting a Hierarchical index

We can even set more than one column to be the index

To do this we pass a list of multiple column names to `set_index`

In [7]:
df4 = df3.set_index(["X4", "X3"])
df4

Unnamed: 0_level_0,Unnamed: 1_level_0,X1,X2
X4,X3,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0,6
one,B,1,7
one,C,2,8
three,D,3,9
two,E,4,10
two,F,5,11


Notice that in the display above, the row labels seem to have two
*levels* now

The *outer* (or left-most) level is named `X4` and the *inner* (or
right-most) level is named `X3`

### Slicing a Hierarchical index

As with our normally indexed DataFrames, we can get data from our
DataFrame using `df.loc`, though the rules are a bit more complicated

We will summarize the main rules, and then work through an exercise that
demonstrates each of them:

**Slicing rules**

Pandas slicing reacts differently to `list`s and `tuple`s

It does this in order to provide more flexibility for you to select the
data you want

`list` in row slicing will be an “or” operation where it chooses rows
based on whether the index value corresponds to any element of the list

`tuple` in row slicing will be used to denote a single hierarchical
index and must include a value for each level

**Row slicing examples**

1. `df.loc[id11]`: all rows where the *outer* most index value is
  equal to `id11`  
1. `df.loc[(id11, id21)]`: all rows where the *outer-most* index value
  is equal to `id11` and the second level is equal to `id21`  
1. `df.loc[[id11, id12]]`: all rows where the *outer-most* index is
  either `id11` or `id12`  
1. `df.loc[([id11, id12], [id21, id22]), :]`: all rows where the
  *outer-most* index is either `id11` or `id12` AND where the
  second level index is either `id21` or `id22`  
1. `df.loc[[(id11, id21), (id12, id22)], :]`: all rows where the the
  two hierarchical indices are either `(id11, id21)` or
  `(id12, id22)`  


We can also restrict `.loc` to extract certain columns by doing

1. `df.loc[rows, col1]`: return the rows specified by rows (see rules
  above) and only column named `col1` (returned object will be a
  Series)  
1. `df.loc[rows, [col1, col2]]`: return the rows specified by rows
  (see rules above) and only columns `col1` and `col2`  


<blockquote>

**Check for understanding**

For each of the examples below do the following (*hint*: this is more fun with a partner!):

- Determine which of the rules above applies  
- Identify the `type` of the returned value  
- Explain why the slicing operation returned the data it did  



</blockquote>

In [8]:
df4.loc[["one", "three"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,X1,X2
X4,X3,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0,6
one,B,1,7
one,C,2,8
three,D,3,9


<blockquote>

</blockquote>

In [9]:
df4.loc[(["one", "three"], ["A", "B", "C"]), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,X1,X2
X4,X3,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0,6
one,B,1,7
one,C,2,8


<blockquote>

</blockquote>

In [10]:
df4.loc["one"]

Unnamed: 0_level_0,X1,X2
X3,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,6
B,1,7
C,2,8


<blockquote>

</blockquote>

In [11]:
df4.loc[("one", "C"), ["X1", "X2"]]

X1    2
X2    8
Name: (one, C), dtype: int64

<blockquote>

</blockquote>

In [12]:
df4.loc[("one", "C")]

X1    2
X2    8
Name: (one, C), dtype: int64

<blockquote>

</blockquote>

In [13]:
df4.loc[[("one", "C"), ("three", "D")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,X1,X2
X4,X3,Unnamed: 2_level_1,Unnamed: 3_level_1
one,C,2,8
three,D,3,9


<blockquote>

</blockquote>

In [14]:
df4.loc[["one", "three"], "X1"]

X4     X3
one    A     0
       B     1
       C     2
three  D     3
Name: X1, dtype: int64

<blockquote>

</blockquote>

In [15]:
df4.loc["one", "X1"]

X3
A    0
B    1
C    2
Name: X1, dtype: int64

<blockquote>

</blockquote>

### Alignment with `MultiIndex`

The data alignment features we talked about above also apply to a
`MultiIndex` DataFrame

The exercise below gives you a chance to experiment with this

<blockquote>

**Check for understanding**

Try setting `df5` to some subset of the rows in `df4` (use one of the
`.loc` variations above)

Then see what happens when you do `df4 / df5` or `df5 ** df4`

Try changing the subset of rows in `df5` and repeat until you
understand what is happening


</blockquote>

### `pd.IndexSlice`

When we want to extract rows for a few values of the outer index and all
values for an inner index level, we can use the convenient
`df.loc[[id11, id22]]` shorthand

However, if we want to extract only a few values from an inner index
level and *all* values from the outer most level, this will not work

To get around this limitation, we can use the `pd.IndexSlice` helper

Here’s an example

In [16]:
df4.loc[pd.IndexSlice[:, ["A", "D"]], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,X1,X2
X4,X3,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0,6
three,D,3,9


Notice that the `:` in the first part of `[:, ["A", "D"]]`
instructed pandas to give us rows for all values of the outer most index
level

<blockquote>

**Check for understanding**

Below we create `df6`, which is the same as `df4` except that the
levels of the index are swapped

In the cells that follow the definition of `df6` we have commented out
a few of the slicing examples from the previous exercise

For each of these examples, use `pd.IndexSlice` to extract the same
data from `df6`

(HINT: you will need to *swap* the order of the row slicing arguments
within the `pd.IndexSlice`)


</blockquote>

In [17]:
df6 = df3.set_index(["X3", "X4"])

<blockquote>

</blockquote>

In [18]:
# df4.loc["one"]

<blockquote>

</blockquote>

In [19]:
# df4.loc[(["one", "three"], ["A", "B", "C"]), :]

<blockquote>

</blockquote>

In [20]:
# df4.loc[["one", "three"], "X1"]

<blockquote>

</blockquote>

### Multi-index columns

The functionality of `MultiIndex` also applies to the column names

Let’s see how it works

In [21]:
df4T = df4.T  # .T means "transpose" or "swap rows and columns"
df4T

X4,one,one,one,three,two,two
X3,A,B,C,D,E,F
X1,0,1,2,3,4,5
X2,6,7,8,9,10,11


Notice that `df4T` seems to have two levels of names for the columns

The same logic laid out in the row slicing rules above applies when we
have a hierarchical index for column names

In [22]:
df4T.loc[:, "one"]

X3,A,B,C
X1,0,1,2
X2,6,7,8


In [23]:
df4T.loc[:, ["one", "three"]]

X4,one,one,one,three
X3,A,B,C,D
X1,0,1,2,3
X2,6,7,8,9


In [24]:
df4T.loc[:, (["one", "three"], "A")]

X4,one
X3,A
X1,0
X2,6


<blockquote>

**Check for understanding**

Use `pd.IndexSlice` to extract all data from `df4T` where the `X3`
level of the column names is one of `A`, `C` and `E`


</blockquote>

## Re-setting the index

The `df.reset_index` method will move one or more level of the index
back into the DataFrame as a normal column

With no additional arguments, it moves all levels out of the index and
sets the index of the returned DataFrame to the default of
`range(df.shape[0])`

In [25]:
df4.reset_index()

Unnamed: 0,X4,X3,X1,X2
0,one,A,0,6
1,one,B,1,7
2,one,C,2,8
3,three,D,3,9
4,two,E,4,10
5,two,F,5,11


<blockquote>

**Check for understanding**

Look up the documentation for the `reset_index` method and study it to
learn how to do the following:

- Move just the `X3` level of the index back as a column  
- Completely throw away all levels of the index  
- Remove the `X4` of the index and *do not* keep it as a column  



</blockquote>

In [26]:
# remove just X3 level and add as column

<blockquote>

</blockquote>

In [27]:
# throw away all levels of index

<blockquote>

</blockquote>

In [28]:
# Remove X4 from the index -- don't keep it as a column

<blockquote>

</blockquote>

## Choose the index carefully

So, now that we know that the index and column names are used for
aligning data, a natural question is how should we pick the index?

To guide us to the right answer, we will list the first two components
to [Hadley Wickham’s](http://hadley.nz/) description of [tidy
data](http://vita.had.co.nz/papers/tidy-data.html):

> 1. Each column should each have one variable  
1. Each row should each have one observation  



If we strive to have our data in a tidy form (we should), then when
choosing the index we should set

- the row labels (index) to be a unique identifier for an observation
  of data  
- the column names to identify one variable  


For example, suppose we are looking data on interest rates

Each column might represent one bond or asset and each row might
represent the date

Using hierarchical row and column indices allows us to store higher
dimensional data in our (inherently) two dimensional DataFrame