In [None]:
import pandas as pd
import numpy as np

# Hierarchical Indexing

### Multiindex

If you set an index to more than one columnn you are creating multi index or Hieararchical index. This makes asking questions based on indexes a lot more easier, and also opens the possibility of working with multidimensional data. 

We'll use the example sourced from [here](https://chrisalbon.com/python/pandas_hierarchical_data.html). 

In [None]:
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

## Setting an index of an existing `DataFrame`

In [None]:
df_1_ind = df.set_index('regiment')
df_1_ind

In [None]:
df_1_ind.mean(level = 'regiment')

In [None]:
# Set the hierarchical index to be by regiment, and then by company
df_2_ind = df.set_index(['regiment', 'company'])
df_2_ind

<div class="alert alert-block alert-info">
<p>
Having multiple indexes will give you an easy way to model more than two dimensional data with DataFrames. Remember DataFraemes are by default a two dimensional data structures. 
</p>
<p>
For the above example, you can imagine each regiment is a two-dimensional array giving details about the company, names and the scores, and they are stacked one below the other. 
</p>
</div>

* How about you want to get the mean scores, based on the company but not the regiment? 

In [None]:
df_2_ind.mean(level='company')

In [None]:
df_2_ind.mean(level='regiment')

In [None]:
df_2_ind.mean(level=['regiment','company'])

# Pandas Aggregation

We have already seen some simple aggregations on Pandas **`Series`** and **`DataFrame`** objects.

Let us review a few aggregation functions that will help us in understanding the **Grouping**. 

In [None]:
# We'll be using our college scorecard dataset in this tutorial.
college_scorecard = pd.read_csv('./data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

<div class="alert alert-block alert-info">
<p>
Remember, that a series actually holds its values in a nested NumPy array (ndarray) object. Pandas simply has to apply these aggregations functions to that nested array.
</p>
</div>

Here is the list of available `Series` and `DataFrame` aggregation methods from your textbook.

| Aggregation Function      | Description    |      
|---------------|---------------------|
|count()        |Total number of items (not including NaN)|
|first(), last()|First and last item  |
|mean(), median()  |Mean and median   |
|min(), max()   |Minimum and Maximum  |
|std(), var()   |Standard deviation & variance |
|mad()          |Mean absolute deviation |
|prod()         |Product of all items         |
|sum()          |Sum of all items           |

### The `describe()` method
The `describe()` method is available on both **`Series`** and **`DataFrame`** objects and outputs a variety of aggregations that are very useful in getting the general "sense" of a dataset.

Take a look at the output for our **`sat_average`** series and **`college_scorecard`** dataframe.


In [None]:
sat_averages = college_scorecard['sat_average']


In [None]:
sat_averages.describe()

In [None]:
college_scorecard.describe()

#### Tweaking `describe()` behavior with `include` and `exclude` parameters.
When used on a **`DataFrame`** object, the default behavior of the **`describe()`** method is to provide statistics on numeric columns only.

Let's take a look at the **`dtypes`** attribute on our college_scorecard dataframe to see what columns this does/doesn't include.

In [None]:
college_scorecard.dtypes

<div class="alert alert-block alert-info">
<p>
The `dtype` attribute of `DataFrame` objects returns information on the datatype of each nested series/column.
</p>
</div>

See all the places where it lists the datatype of a column as 'object'? These columns won't be reported on with **`describe()`** when using the default parameters.

We can change this using either the **`include`** or the **`exclude`** parameters:

In [None]:
# Include the object datatype columns
college_scorecard.describe(include= [np.object])

In [None]:
# Exclude the numeric datatypes
college_scorecard.describe(exclude=[np.number])

There are two things here to notice:
1. The type of statistics returned changed when operating on **`object`** column types.
2. I used NumPy datatypes in the specification of what to include and exclude.

**The Statistics**  
Object(esp. string based) columns cannot be summarized reasonably with many of numeric aggregations so Pandas gives an alternative set of aggregations which make more sense for this type of data.

**NumPy Datatypes**  
Remember that the values of each `Series` inside of a `DataFrame` are stored in a NumPy array. Therefore the elements in that NumPy array are described by NumPy datatypes.

That is why we specify NumPy datatypes here to specifically include/exclude them for Pandas `describe` method.

This is just another example of the tight integration between the two libraries.

In [None]:
# Finally, you can specify **`include='all'`** to force Pandas
# to evaluate all columns.  It will inject NaN where
# a calculation cannot be done.
college_scorecard.describe(include='all')

# Pandas Grouping

In this case we will look at the sample dataset of the flight schedules data that is available on Kaggle [here](https://www.kaggle.com/usdot/flight-delays)

This is only a sample of the original data. You will use the original data in your Group Project!

In [None]:
flights = pd.read_csv('./data/flight_sample.csv')
flights.head()

## The `groupby()` Method

So far, all the calculations that we've done on **`DataFrame`** objects have looked at the values of columns as a whole.

The `groupby()` method allows you to move into deeper forms analysis by splitting up the rows of a dataset into groups by the values in specified row(s). You can think of this in some ways as putting rows into buckets for evaluation.

### Specifying how to Split your Dataset into Groups
Of course, before we can perform evaluations on groups, we have to create them from an existing dataframe. 

Let's explore how **`groupby()`** provides a variety of ways to split up your datasets. We'll explore some of these here, starting with the most simple.

#### Single Column Grouping

In [None]:
# NOTE THIS IS ONLY SHOWING GROUPS, LOOK BELOW ON HOW TO USE THE GROUPS
flights_by_airline = flights.groupby(['AIRLINE'])
flights_by_airline.groups

The **`groupby()`** method returns an type called **`DataFrameGroupBy`**. We will explore it in more depth shortly, but for now just know that it has an attribute called **`groups`** which provides a *`dict`* object with the **labels** of each group and the **corresponding index values** in the original dataframe that belong to that group.

If you look above, you can see there is a group labelled 'AA' will index values [2,   19,   43,   55,   59,   64,   71,   74,   82,   92, ...].

You can think of this as a record of all the groups that we will perform calculations on later.

#### Multi Column Grouping

You can specify multiple columns if you wish to split your data up in multiple levels:

In [None]:
# NOTE THIS IS ONLY SHOWING GROUPS, LOOK BELOW ON HOW TO USE THE GROUPS
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
flights_by_airline_month.groups

### Aggregations after GroupBy

For example, let us say you want to find out the average distance traveled by each airline, you can do that using the following aggregeate function

In [None]:
flights.head()

In [None]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [None]:
avg_by_airline = flights_by_airline[['DISTANCE', 'TAXI_IN']].mean()

**NOTE**: The double [[ ]] for computing the summary stististics. The first pair [] is used to look into the `DataFrameGroupyBy` object the second pair [] is used to list all the columns you want to produce the summary statistics. 

In [None]:
avg_by_airline

## Activity


### Gerneralizing using GroupBy

1\. Use AIRLINE `groupby` records into a `DataFrameGroupBy` object?

2\. Compute the median distnace travelled per airline. 

3\. Extract the median DISTANCE for SouthWest airlines (WN) and assign it a variable `median_distance_WN`. 

4\. What is the median DISTANCE, TAXI_IN times and TAXI_OUT times per airline per month? 

5\. Extract the median TAXI_OUT for SouthWest airlines (WN) in December (12) and assign it a variable `median_distance_WN_12`. 

In [None]:
# Question 1


In [None]:
# Question 2


In [None]:
# Question 3
median_distance_WN = 

In [None]:
# Question 4


In [None]:
#Question 5: Select WN airline in month 12, average TAXI_OUT
median_distance_WN_12 = 

### Understanding the Aggregation After GroupBy: Method Dispatching

Let us now understand how the Aggregations on the DataFrameGroupBy objects work. In the **`DataFrameGroupBy`** objects, any method not found on the object itself is forwarded ("**dispatched**") to all the groups that it contains.

That is why we were able to ask for the *`median`* of a **`flights_by_airline`** object above and get something back: it is (1) "dispatching" the *`median`* method call to each group (that is each airline), (2) collecting the results and (3) presenting them to us.

In [None]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [None]:
flights_by_airline.median()

In [None]:
# Compute the median for the entire DataFrameGroupBy object and then select 'DISTANCE' column 
flights_by_airline.median()[['DISTANCE']]

In [None]:
# Select the 'DISTANCE' Column and then compute the median
flights_by_airline[['DISTANCE']].median()

**Question**: Which of the above two methods should be preferred? 

In [None]:
# Select the 'DISTANCE' Column and then compute the median. THIS GIVES YOU SERIES OBJECT. 
flights_by_airline['DISTANCE'].median()

**NOTE** Note difference between using double square brackets [[ ]] and single bracket [ ]. For example, ``flights_by_airline[['DISTANCE']].median()`` above is a Dataframe with one column, where as if you use `` flights_by_airline['DISTANCE'].median()`` it'll give you a Series. 

### Methods of `DataFrameGroupBy` Objects
Now we will understand the various operations built into the `DataFrameGroupBy` object type.

#### The `aggregate()` Method
At first, the `aggregate()` method appears to be quite similiar to what we just covered when we talked about method dispatching. It performs aggregations on the groups in a **`DataFrameGroupBy`** object.

In [None]:
flights_by_airline.aggregate('mean')

The difference is that the **`aggregate()`** method gives you some additional options that are not available if you rely on method dispatching as shown above.

In [None]:
# You can pass multiple aggregates as a list.
# Here will we get various aggregates for each
# column of our flights_by_airline object.
flights_by_airline.aggregate([np.mean, 'min', 'max'])

<div class="alert alert-block alert-warning">
<p>
It is important to notice that you are able to pass both strings and functions to the `aggregate()` method. It is probably best to choose one approach and stick with it rather than mixing and matching like I've done here.
</p>
</div>

In [None]:
flights_by_airline.aggregate([np.mean, np.min, np.max])

Your textbook also talks about using a dict to apply labels to the aggregation columns so that they can have user friendly names like 'Longest Distance' rather than just 'max'.

This sort of functionality is, however, deprecated in Pandas, which means that it will be removed in future versions.

To accomplish the same thing, we should instead append a `rename()` method after our `aggregate()` method like so:

In [None]:
# Using `rename()` to apply friendly labels to output columns
flights_by_airline['DISTANCE'].aggregate(
    [np.mean, np.min, np.max]).rename(
        columns={'mean': 'Avg. Distance', 
                 'amin': 'Shortest Distance', 
                 'amax': 'Longest Distance'})

<div class="alert alert-block alert-danger">
<p>
Note, there are three main things happening in the above statement. 

<li> flights_by_airline['DISTANCE'] selects the distance column for analysis
<li> flights_by_airline['DISTANCE'].aggregate([np.mean, np.min, np.max]) computes the average, min and max of the distance column selected
<li> Finally .rename() function is appropriately renaming the columns according the dictionary we have given  
</p>
</div>

The recommended way of using a **`dict`** with the **`aggregate()`** method is actually to specify which aggregation(s) to perform on what columns. You can use it to specify different aggregation(s) on a per-column basis.

Here I'll use it to get the high/low values for DISTANCE and the mean for TAXI_IN on our *`flights_by_airline_month`* object.

In [None]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])

# Notice how using this style automatically filters
# out all columns you don't specify.
flights_by_airline_month.aggregate(
        {'DISTANCE': [np.min, np.max], 
         'TAXI_IN': np.mean}).tail(20)

## Activity: 

We will work again on the `college-loan-default-rates.csv` and `college-scorecard-data-scrubbed.csv` datasets. 

Use `aggregate()` method to produce

1. The average, minimum and maximum `full_time_retention_rate_4_year` per state using `college-scorecard-data-scrubbed.csv` dataset. 
    * After producing the above summary statistics, make sure you rename your columns for average, minimum and maximum as `Avg. Retention`, `Low Retention`, and `High Retention` respectively. 

2. Which state has the highest average four year retention rate (`full_time_retention_rate_4_year`)? Which has the lowest average? 

3. Produce per state and city, minimum and maximum for the `sat_average` column and average for the `full_time_retention_rate_4_year` column. 


In [None]:
# For this tutorial, we will need both of our datasets.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv')

college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

In [None]:
# Question 1

In [None]:
# Question 1 (contd...)


In [None]:
# Question 2


In [None]:
# Question 3


<div class="alert alert-block alert-warning">
<h3> Important Notes</h3>
<p> </p> 
When producing any of the summary statistics using group by, you can assign your intermediate operations to the variables. In the entire section above, I have been mostly trying to produce the results to show them to you. However, you can assign the results to a variable for using it in the future. **See the example below.** 
</div>

In [None]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
summary_distanc_taxi_in = flights_by_airline_month.aggregate(
        {'DISTANCE': [np.min, np.max], 
         'TAXI_IN': np.mean})

In [None]:
summary_distanc_taxi_in.head()

In [None]:
# Remember from the last class that we can do aggregations at multiple levels using Hierarchical index. 
summary_distanc_taxi_in.mean(level='AIRLINE')