# Handling a MultiIndex

### Objectives
After this lesson you should be able to...
+ Identify a MultiIndex DataFrame
+ Identify all the components of a MultiIndex - the number of **levels**, the **name** of each and the **values** (or **labels**) of each level.
+ Know that all index levels have names. If it isn't displayed it is **None**
+ Know that index levels have integer locations beginning at 0 from the left/top
+ Know that repeated level values in the outer levels do not repeat when not necessary
+ Use the **`reset_index`** method to move all or a specific levels from the index to columns
+ Use the **`rename_axis`** method to change all level names
+ Rename columns with a hand-created list
+ Remove a level from a MultiIndex with the **droplevel** method
+ Retrieve level values with the **`get_level_values`** method and use it to concatenate levels together
### Prepare for this lesson by

+ Read the pandas [split apply combine documentation](http://pandas.pydata.org/pandas-docs/stable/groupby.html) stopping at 'transformation'.

### Introduction

We looked at many different ways of aggregating in the last notebook. pandas puts the grouping columns into the index which creates an index with multiple levels called a MultiIndex. pandas will also put the aggregating columns into a MultiIndex if different columns get applied different aggregating functions. This section will teach you a few basic things about MultiIndex DataFrames and how to turn them back into normal indexes to make your life easier.

In [None]:
import pandas as pd
import numpy as np

### MultiIndex DataFrames
Many of the groupby statements in the last notebook created DataFrames with multiple **levels**. These are called MultiIndexes and they are a bit annoying to deal with. A **`MultiIndex`** can be part of the index or the columns. The below DataFrame has an index with 3 levels and columns with a single level.

All index levels have a **name**. The name is important and is one way to access the level. Row index levels have their names directly above them. They look like column names but they are not. Column levels are placed directly to the left of them.

The name of the column level below is None and so nothing is displayed for it. Index levels also have integers that represent them from 0 to n-1 where n is number of levels. The numbering begins from the 'outside' - that is from the left for the rows and from the top for the columns.

The level **values** (or **labels**) only repeat when necessary. Each row of data still has a value for each of the levels but it isn't displayed for the outer MultiIndex levels.

![](images/DataFrame Explained.png)

In [None]:
import pandas as pd
import numpy as np

# Alleviating the MultiIndex
For beginners multiindexes are a pain to deal with and make selection difficult. This next section covers how to turn them back into normal indexes without losing information.

In [None]:
college = pd.read_csv('../../data/college.csv')

In [None]:
college_summary = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)
college_summary

Verify that both index and columns are MultiIndex

## Reset the index
The easiest way to return back to a normal index is to use **`reset_index`**. This puts all index levels back as DataFrame columns

In [None]:
college_summary.reset_index()

You can reset specific levels by using their name or their level integer.

In [None]:
college_summary.reset_index(level='RELAFFIL')

In [None]:
college_summary.reset_index(level=1)

## Changing index level names
The row indexes have names for the levels. The columns do not. We can change either one with the **`rename_axis`** method. Pass in a list of the new index level names (or a scalar if only one level)

In [None]:
cs2 = college_summary.rename_axis(['STATE ABBREVIATION', 'RELIGIOUS AFFILIATION'])
cs2

In [None]:
# give the column levels a name
cs2.rename_axis(['Aggregating Columns', 'Aggregating Functions'], axis='columns')

## Remove the MultiIndex from the columns
There are a few ways to make the columns only one level again. Here are three strategies:
+ Reassign the columns a list of names by hand
+ Drop one of the levels with the **`droplevel`** index method
+ Concatenate all the top and bottom level with the **`get_level_values`** index method.

In [None]:
college_summary.head()

### Reassign columns by hand
Create a list and assign it to the columns attribute.

In [None]:
columns = ['ugds_size', 'ugds_min', 'ugds_max', 'satmath_size', 'satmath_min', 'satmath_max']
college_summary.columns = columns
college_summary

### Drop a level with `droplevel`
Pass the level name or the level integer location to the **`droplevel`** index method and reassign it to the index.

In [None]:
# must recreated original
college_summary = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)

# Drop the top level
college_summary.columns = college_summary.columns.droplevel(level=0)

# in this case it doesn't really make sense to drop the levels
college_summary

### Concatenate levels with `get_level_values`
Its possible to extract just the values of each level of a MultiIndex with the **`get_level_values`** method.

In [None]:
# recreate
college_summary = college.groupby(['STABBR', 'RELAFFIL'])['UGDS', 'SATMTMID'].agg(['size', 'min', 'max']).head(12)

college_summary.columns.get_level_values(0)

In [None]:
college_summary.columns.get_level_values(1)

In [None]:
# concatenate
columns = college_summary.columns.get_level_values(0) + '_' + college_summary.columns.get_level_values(1)
college_summary.columns = columns
college_summary

# Exercises
Solutions are below:

Use the flights dataset. It has a random selection of 3% of the flights from 2015 from the 10 busiest airports in the US.

In [None]:
flights = pd.read_csv('../../data/flights.csv')
pd.options.display.max_columns = 40
flights.head()

## Problem 1
<span  style="color:green; font-size:16px">Count the number of flights from each origin airport and then turn the result into a DataFrame in one line.</span>

In [None]:
# your code here

## Problem 2
<span  style="color:green; font-size:16px">Find the average departure delay and the total number of flights from each origin airport. What is the name of the row index and column index?</span>

In [None]:
# your code here

## Problem 3
<span  style="color:green; font-size:16px">Produce the same result as with problem 2 but rename the column index level to **Departure Delay Stats** </span>

In [None]:
# your code here

## Problem 4
<span  style="color:green; font-size:16px">Find the average scheduled and elapsed time for each origin airport for each month of the year. Save the first 10 rows to a variable. Then change the month index level name to **MONTH_NUM** and then turn that level into a column.</span>

In [None]:
# your code here

## Problem 5
<span  style="color:green; font-size:16px">For every origin airport, airline and day of week, calculate the median and maximum departure delay, and the percentage diverted and canceled. Save this to a variable</span>

In [None]:
# your code here

## Problem 6
<span  style="color:green; font-size:16px">Take your answer from problem 5. Reassign the columns to be a single dimensional index by concatenating the inner level to the outer level. Then reset the airline and day of week row index levels. After rearranging the index, it is much easier to handle. Find the row of data that has the highest percentage of cancellations.</span>

In [None]:
# your code here

# Solutions

In [None]:
flights = pd.read_csv('../../data/flights.csv')
pd.options.display.max_columns = 40
flights.head()

## Problem 1
<span  style="color:green; font-size:16px">Count the number of flights from each origin airport and then turn the result into a DataFrame in one line.</span>

In [None]:
flights.ORIGIN_AIRPORT.value_counts().reset_index()

## Problem 2
<span  style="color:green; font-size:16px">Find the average departure delay and the total number of flights from each origin airport. What is the name of the row index and column index?</span>

In [None]:
# row index name is ORIGIN_AIRPORT. Column index name is None
a = flights.groupby('ORIGIN_AIRPORT')['DEPARTURE_DELAY'].agg(['mean', 'size'])
a

In [None]:
a.index.name

In [None]:
a.columns.name is None

## Problem 3
<span  style="color:green; font-size:16px">Produce the same result as with problem 2 but rename the column index level to **Departure Delay Stats** </span>

In [None]:
a = flights.groupby('ORIGIN_AIRPORT')['DEPARTURE_DELAY'].agg(['mean', 'size'])
a.rename_axis('Departure Delay Stats', axis='columns')

## Problem 4
<span  style="color:green; font-size:16px">Find the average scheduled and elapsed time for each origin airport for each month of the year. Save the first 10 rows to a variable. Then change the month index level name to **MONTH_NUM** and then turn that level into a column.</span>

In [None]:
a = flights.groupby(['ORIGIN_AIRPORT', 'MONTH'])['SCHEDULED_TIME', 'ELAPSED_TIME'].mean().head(10)
a

In [None]:
b = a.rename_axis(['ORIGIN_AIRPORT', 'MONTH_NUM'])
b

In [None]:
b.reset_index('MONTH_NUM')

## Problem 5
<span  style="color:green; font-size:16px">For every origin airport, airline and day of week, calculate the median and maximum departure delay, and the percentage diverted and canceled. Save this to a variable</span>

In [None]:
a = flights.groupby(['ORIGIN_AIRPORT', 'AIRLINE', 'DAY_OF_WEEK']).agg({'DEPARTURE_DELAY':['median', 'max'],
                                                                       'DIVERTED':'mean',
                                                                       'CANCELLED':'mean'})
a.head(10)

## Problem 6
<span  style="color:green; font-size:16px">Take your answer from problem 5. Reassign the columns to be a single dimensional index by concatenating the inner level to the outer level. Then reset the airline and day of week row index levels. After rearranging the index, it is much easier to handle. Find the row of data that has the highest percentage of cancellations.</span>

In [None]:
a = flights.groupby(['ORIGIN_AIRPORT', 'AIRLINE', 'DAY_OF_WEEK']).agg({'DEPARTURE_DELAY':['median', 'max'],
                                                                       'DIVERTED':'mean',
                                                                       'CANCELLED':'mean'})

columns = a.columns.get_level_values(1) + '_' + a.columns.get_level_values(0)
a.columns = columns
b = a.reset_index(['AIRLINE', 'DAY_OF_WEEK'])
b.head()

In [None]:
b.sort_values('mean_CANCELLED', ascending=False).head()