# Advanced indexing

In [3]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

This dataset is borrowed from the [PyCon tutorial of Brandon Rhodes](https://github.com/brandon-rhodes/pycon-pandas-tutorial/) (so all credit to him!). You can download these data from here: [`titles.csv`](https://drive.google.com/file/d/0B3G70MlBnCgKa0U4WFdWdGdVOFU/view?usp=sharing) and [`cast.csv`](https://drive.google.com/file/d/0B3G70MlBnCgKRzRmTWdQTUdjNnM/view?usp=sharing) and put them in the `/data` folder.

In [4]:
cast = pd.read_csv('data/cast.csv')
cast.head()

Unnamed: 0,title,year,name,type,character,n
0,Suuri illusioni,1985,Homo $,actor,Guests,22.0
1,Gangsta Rap: The Glockumentary,2007,Too $hort,actor,Himself,
2,Menace II Society,1993,Too $hort,actor,Lew-Loc,27.0
3,Porndogs: The Adventures of Sadie,2009,Too $hort,actor,Bosco,3.0
4,Stop Pepper Palmer,2014,Too $hort,actor,Himself,


In [5]:
titles = pd.read_csv('data/titles.csv')
titles.head()

Unnamed: 0,title,year
0,The Rising Son,1990
1,Ashes of Kukulcan,2016
2,The Thousand Plane Raid,1969
3,Crucea de piatra,1993
4,The 86,2015


## Setting columns as the index

Why is it useful to have an index?

- Giving meaningful labels to your data -> easier to remember which data are where
- Unleash some powerful methods, eg with a DatetimeIndex for time series
- Easier and faster selection of data

It is this last one we are going to explore here!

Setting the `title` column as the index:

In [6]:
c = cast.set_index('title')

In [7]:
c.head()

Unnamed: 0_level_0,year,name,type,character,n
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Suuri illusioni,1985,Homo $,actor,Guests,22.0
Gangsta Rap: The Glockumentary,2007,Too $hort,actor,Himself,
Menace II Society,1993,Too $hort,actor,Lew-Loc,27.0
Porndogs: The Adventures of Sadie,2009,Too $hort,actor,Bosco,3.0
Stop Pepper Palmer,2014,Too $hort,actor,Himself,


Instead of doing:

In [8]:
%%time
cast[cast['title'] == 'Hamlet']

Wall time: 345 ms


Unnamed: 0,title,year,name,type,character,n
1456,Hamlet,1996,Riz Abbasi,actor,Attendant to Claudius,1.0
6710,Hamlet,1921,Fritz Achterberg,actor,"Fortinbras,",9.0
10274,Hamlet,2009,Hayden Adams,actor,Laertes,7.0
10275,Hamlet,2009,Hayden Adams,actor,Player,7.0
12743,Hamlet,1913,Eric Adeney,actor,Reynaldo,14.0
...,...,...,...,...,...,...
3212647,Hamlet,1964,Carol Teitel,actress,Lady,
3236803,Hamlet,1969,Jennifer Tudor,actress,Court lady,23.0
3257437,Hamlet,2000,Diane Venora,actress,Gertrude,3.0
3284728,Hamlet,1996,Perdita Weeks,actress,Second Player,44.0


we can now do:

In [8]:
%%time
c.loc['Hamlet']

Wall time: 2 s


Unnamed: 0_level_0,year,name,type,character,n
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hamlet,1996,Riz Abbasi,actor,Attendant to Claudius,1.0
Hamlet,1921,Fritz Achterberg,actor,"Fortinbras,",9.0
Hamlet,2009,Hayden Adams,actor,Laertes,7.0
Hamlet,2009,Hayden Adams,actor,Player,7.0
Hamlet,1913,Eric Adeney,actor,Reynaldo,14.0
...,...,...,...,...,...
Hamlet,1964,Carol Teitel,actress,Lady,
Hamlet,1969,Jennifer Tudor,actress,Court lady,23.0
Hamlet,2000,Diane Venora,actress,Gertrude,3.0
Hamlet,1996,Perdita Weeks,actress,Second Player,44.0


But you can also have multiple columns as the index, leading to a **multi-index or hierarchical index**:

In [9]:
c = cast.set_index(['title', 'year'])

In [10]:
c.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,type,character,n
title,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Suuri illusioni,1985,Homo $,actor,Guests,22.0
Gangsta Rap: The Glockumentary,2007,Too $hort,actor,Himself,
Menace II Society,1993,Too $hort,actor,Lew-Loc,27.0
Porndogs: The Adventures of Sadie,2009,Too $hort,actor,Bosco,3.0
Stop Pepper Palmer,2014,Too $hort,actor,Himself,


In [14]:
%%time
c.loc[('Hamlet', 2000),:]

Wall time: 2.18 s


Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_lowerdim(tup)


Unnamed: 0_level_0,year,name,type,character,n
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Hamlet,1996.0,Riz Abbasi,actor,Attendant to Claudius,1.0
Hamlet,1921.0,Fritz Achterberg,actor,"Fortinbras,",9.0
Hamlet,2009.0,Hayden Adams,actor,Laertes,7.0
Hamlet,2009.0,Hayden Adams,actor,Player,7.0
Hamlet,1913.0,Eric Adeney,actor,Reynaldo,14.0
...,...,...,...,...,...
Hamlet,1969.0,Jennifer Tudor,actress,Court lady,23.0
Hamlet,2000.0,Diane Venora,actress,Gertrude,3.0
Hamlet,1996.0,Perdita Weeks,actress,Second Player,44.0
Hamlet,1996.0,Kate Winslet,actress,Ophelia,46.0


In [None]:
c2 = c.sort_index()

In [None]:
%%time
c2.loc[('Hamlet', 2000),:]