In [1]:
import pandas as pd
import numpy as np

# Hierarchical Indexing

### Multiindex

If you set an index to more than one columnn you are creating multi index or Hieararchical index. This makes asking questions based on indexes a lot more easier, and also opens the possibility of working with multidimensional data. 

We'll use the example sourced from [here](https://chrisalbon.com/python/pandas_hierarchical_data.html). 

In [2]:
# Create dataframe
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df

Unnamed: 0,regiment,company,name,preTestScore,postTestScore
0,Nighthawks,1st,Miller,4,25
1,Nighthawks,1st,Jacobson,24,94
2,Nighthawks,2nd,Ali,31,57
3,Nighthawks,2nd,Milner,2,62
4,Dragoons,1st,Cooze,3,70
5,Dragoons,1st,Jacon,4,25
6,Dragoons,2nd,Ryaner,24,94
7,Dragoons,2nd,Sone,31,57
8,Scouts,1st,Sloan,2,62
9,Scouts,1st,Piger,3,70


## Setting an index of an existing `DataFrame`

In [3]:
df_1_ind = df.set_index('regiment')
df_1_ind

Unnamed: 0_level_0,company,name,preTestScore,postTestScore
regiment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nighthawks,1st,Miller,4,25
Nighthawks,1st,Jacobson,24,94
Nighthawks,2nd,Ali,31,57
Nighthawks,2nd,Milner,2,62
Dragoons,1st,Cooze,3,70
Dragoons,1st,Jacon,4,25
Dragoons,2nd,Ryaner,24,94
Dragoons,2nd,Sone,31,57
Scouts,1st,Sloan,2,62
Scouts,1st,Piger,3,70


In [4]:
df_1_ind.mean(level = 'regiment')

Unnamed: 0_level_0,preTestScore,postTestScore
regiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Nighthawks,15.25,59.5
Dragoons,15.5,61.5
Scouts,2.5,66.0


In [5]:
# Set the hierarchical index to be by regiment, and then by company
df_2_ind = df.set_index(['regiment', 'company'])
df_2_ind

Unnamed: 0_level_0,Unnamed: 1_level_0,name,preTestScore,postTestScore
regiment,company,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Nighthawks,1st,Miller,4,25
Nighthawks,1st,Jacobson,24,94
Nighthawks,2nd,Ali,31,57
Nighthawks,2nd,Milner,2,62
Dragoons,1st,Cooze,3,70
Dragoons,1st,Jacon,4,25
Dragoons,2nd,Ryaner,24,94
Dragoons,2nd,Sone,31,57
Scouts,1st,Sloan,2,62
Scouts,1st,Piger,3,70


<div class="alert alert-block alert-info">
<p>
Having multiple indexes will give you an easy way to model more than two dimensional data with DataFrames. Remember DataFraemes are by default a two dimensional data structures. 
</p>
<p>
For the above example, you can imagine each regiment is a two-dimensional array giving details about the company, names and the scores, and they are stacked one below the other. 
</p>
</div>

* How about you want to get the mean scores, based on the company but not the regiment? 

In [6]:
df_2_ind.mean(level='company')

Unnamed: 0_level_0,preTestScore,postTestScore
company,Unnamed: 1_level_1,Unnamed: 2_level_1
1st,6.666667,57.666667
2nd,15.5,67.0


In [7]:
df_2_ind.mean(level='regiment')

Unnamed: 0_level_0,preTestScore,postTestScore
regiment,Unnamed: 1_level_1,Unnamed: 2_level_1
Nighthawks,15.25,59.5
Dragoons,15.5,61.5
Scouts,2.5,66.0


In [8]:
df_2_ind.mean(level=['regiment','company'])

Unnamed: 0_level_0,Unnamed: 1_level_0,preTestScore,postTestScore
regiment,company,Unnamed: 2_level_1,Unnamed: 3_level_1
Nighthawks,1st,14.0,59.5
Nighthawks,2nd,16.5,59.5
Dragoons,1st,3.5,47.5
Dragoons,2nd,27.5,75.5
Scouts,1st,2.5,66.0
Scouts,2nd,2.5,66.0


# Pandas Aggregation

We have already seen some simple aggregations on Pandas **`Series`** and **`DataFrame`** objects.

Let us review a few aggregation functions that will help us in understanding the **Grouping**. 

In [9]:
# We'll be using our college scorecard dataset in this tutorial.
college_scorecard = pd.read_csv('./data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

<div class="alert alert-block alert-info">
<p>
Remember, that a series actually holds its values in a nested NumPy array (ndarray) object. Pandas simply has to apply these aggregations functions to that nested array.
</p>
</div>

Here is the list of available `Series` and `DataFrame` aggregation methods from your textbook.

| Aggregation Function      | Description    |      
|---------------|---------------------|
|count()        |Total number of items (not including NaN)|
|first(), last()|First and last item  |
|mean(), median()  |Mean and median   |
|min(), max()   |Minimum and Maximum  |
|std(), var()   |Standard deviation & variance |
|mad()          |Mean absolute deviation |
|prod()         |Product of all items         |
|sum()          |Sum of all items           |

### The `describe()` method
The `describe()` method is available on both **`Series`** and **`DataFrame`** objects and outputs a variety of aggregations that are very useful in getting the general "sense" of a dataset.

Take a look at the output for our **`sat_average`** series and **`college_scorecard`** dataframe.


In [10]:
sat_averages = college_scorecard['sat_average']


In [11]:
sat_averages.describe()

count    1304.000000
mean     1059.072086
std       133.356979
min       720.000000
25%       973.000000
50%      1039.500000
75%      1120.250000
max      1545.000000
Name: sat_average, dtype: float64

In [12]:
college_scorecard.describe()

Unnamed: 0,UNITID,OPEID,OPEID6,predominant_degree_code,institutional_owner_code,locale,men_only,women_only,religious_affiliation_code,sat_reading_25,...,part_time_students_percentage,open_or_closed,average_net_price_public,average_net_price_private,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans
count,7282.0,7282.0,7282.0,7282.0,7282.0,7282.0,7282.0,7282.0,7109.0,1195.0,...,6969.0,7282.0,1911.0,4688.0,6966.0,2293.0,3843.0,1412.0,2208.0,6966.0
mean,283704.0883,1911246.0,16393.400439,1.903735,2.196924,19.620434,0.009063,0.005356,5.256576,468.421757,...,0.225924,0.901126,9624.656201,18230.176621,0.532093,0.707081,0.686155,0.455639,0.564679,0.523092
std,133558.728309,3459461.0,13945.231754,0.954501,0.838866,9.366024,0.094776,0.072991,20.379158,69.283492,...,0.246391,0.298513,4669.671522,7272.125743,0.225941,0.195645,0.180121,0.293325,0.26354,0.284088
min,100654.0,100200.0,1002.0,0.0,1.0,-3.0,0.0,0.0,-2.0,265.0,...,0.0,0.0,-2434.0,-581.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,170749.5,345950.0,3459.5,1.0,1.0,12.0,0.0,0.0,-2.0,420.0,...,0.0,1.0,6297.0,13314.5,0.35885,0.6182,0.5679,0.25,0.382925,0.3333
50%,222372.5,1063250.0,10490.0,2.0,2.0,21.0,0.0,0.0,-2.0,458.0,...,0.1489,1.0,8751.0,18254.5,0.5233,0.7414,0.6906,0.45,0.50325,0.5849
75%,442070.75,3010606.0,26089.75,3.0,3.0,22.0,0.0,0.0,-2.0,500.0,...,0.3766,1.0,12704.0,22719.5,0.7143,0.8333,0.81575,0.6364,0.7895,0.747325
max,485458.0,82098820.0,42371.0,4.0,3.0,43.0,1.0,1.0,105.0,730.0,...,1.0,1.0,28201.0,89406.0,1.0,1.0,1.0,1.0,1.0,1.0


#### Tweaking `describe()` behavior with `include` and `exclude` parameters.
When used on a **`DataFrame`** object, the default behavior of the **`describe()`** method is to provide statistics on numeric columns only.

Let's take a look at the **`dtypes`** attribute on our college_scorecard dataframe to see what columns this does/doesn't include.

In [13]:
college_scorecard.dtypes

UNITID                                       int64
OPEID                                        int64
OPEID6                                       int64
institution_name                            object
city                                        object
                                            ...   
students_with_federal_loans                float64
median_student_earnings                     object
median_student_debt                         object
less_than_4_year_school_completion_rate     object
4_year_school_completion_rate               object
Length: 63, dtype: object

<div class="alert alert-block alert-info">
<p>
The `dtype` attribute of `DataFrame` objects returns information on the datatype of each nested series/column.
</p>
</div>

See all the places where it lists the datatype of a column as 'object'? These columns won't be reported on with **`describe()`** when using the default parameters.

We can change this using either the **`include`** or the **`exclude`** parameters:

In [14]:
# Include the object datatype columns
college_scorecard.describe(include= [np.object])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  college_scorecard.describe(include= [np.object])


Unnamed: 0,institution_name,city,state,url,predominant_degree_desc,institutional_owner_desc,religious_affiliation_desc,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
count,7282,7282,7282,7225,7282,7282,7282,6201,7251,3972,2497
unique,7164,2493,59,5992,5,3,61,598,2059,3742,2377
top,Stevens-Henager College,New York,CA,www.itt-tech.edu,Certificate,PrivateForProfit,Not applicable,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed
freq,7,87,708,143,3343,3420,6199,816,1519,166,116


In [15]:
# Exclude the numeric datatypes
college_scorecard.describe(exclude=[np.number])

Unnamed: 0,institution_name,city,state,url,predominant_degree_desc,institutional_owner_desc,religious_affiliation_desc,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
count,7282,7282,7282,7225,7282,7282,7282,6201,7251,3972,2497
unique,7164,2493,59,5992,5,3,61,598,2059,3742,2377
top,Stevens-Henager College,New York,CA,www.itt-tech.edu,Certificate,PrivateForProfit,Not applicable,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed
freq,7,87,708,143,3343,3420,6199,816,1519,166,116


There are two things here to notice:
1. The type of statistics returned changed when operating on **`object`** column types.
2. I used NumPy datatypes in the specification of what to include and exclude.

**The Statistics**  
Object(esp. string based) columns cannot be summarized reasonably with many of numeric aggregations so Pandas gives an alternative set of aggregations which make more sense for this type of data.

**NumPy Datatypes**  
Remember that the values of each `Series` inside of a `DataFrame` are stored in a NumPy array. Therefore the elements in that NumPy array are described by NumPy datatypes.

That is why we specify NumPy datatypes here to specifically include/exclude them for Pandas `describe` method.

This is just another example of the tight integration between the two libraries.

In [16]:
# Finally, you can specify **`include='all'`** to force Pandas
# to evaluate all columns.  It will inject NaN where
# a calculation cannot be done.
college_scorecard.describe(include='all')

Unnamed: 0,UNITID,OPEID,OPEID6,institution_name,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
count,7282.0,7282.0,7282.0,7282,7282,7282,7225,7282.0,7282,7282.0,...,6966.0,2293.0,3843.0,1412.0,2208.0,6966.0,6201,7251,3972,2497
unique,,,,7164,2493,59,5992,,5,,...,,,,,,,598,2059,3742,2377
top,,,,Stevens-Henager College,New York,CA,www.itt-tech.edu,,Certificate,,...,,,,,,,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed,PrivacySuppressed
freq,,,,7,87,708,143,,3343,,...,,,,,,,816,1519,166,116
mean,283704.0883,1911246.0,16393.400439,,,,,1.903735,,2.196924,...,0.532093,0.707081,0.686155,0.455639,0.564679,0.523092,,,,
std,133558.728309,3459461.0,13945.231754,,,,,0.954501,,0.838866,...,0.225941,0.195645,0.180121,0.293325,0.26354,0.284088,,,,
min,100654.0,100200.0,1002.0,,,,,0.0,,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
25%,170749.5,345950.0,3459.5,,,,,1.0,,1.0,...,0.35885,0.6182,0.5679,0.25,0.382925,0.3333,,,,
50%,222372.5,1063250.0,10490.0,,,,,2.0,,2.0,...,0.5233,0.7414,0.6906,0.45,0.50325,0.5849,,,,
75%,442070.75,3010606.0,26089.75,,,,,3.0,,3.0,...,0.7143,0.8333,0.81575,0.6364,0.7895,0.747325,,,,


# Pandas Grouping

In this case we will look at the sample dataset of the flight schedules data that is available on Kaggle [here](https://www.kaggle.com/usdot/flight-delays)

This is only a sample of the original data. You will use the original data in your Group Project!

In [17]:
flights = pd.read_csv('./data/flight_sample.csv')
flights.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAXI_IN,TAXI_OUT,DISTANCE
0,2015,8,19,3,EV,3260,7.0,20.0,1091
1,2015,9,23,3,WN,3050,4.0,9.0,837
2,2015,10,16,5,AA,1382,13.0,25.0,761
3,2015,1,19,1,WN,4274,5.0,23.0,1547
4,2015,4,22,3,WN,2237,5.0,18.0,872


## The `groupby()` Method

So far, all the calculations that we've done on **`DataFrame`** objects have looked at the values of columns as a whole.

The `groupby()` method allows you to move into deeper forms analysis by splitting up the rows of a dataset into groups by the values in specified row(s). You can think of this in some ways as putting rows into buckets for evaluation.

### Specifying how to Split your Dataset into Groups
Of course, before we can perform evaluations on groups, we have to create them from an existing dataframe. 

Let's explore how **`groupby()`** provides a variety of ways to split up your datasets. We'll explore some of these here, starting with the most simple.

#### Single Column Grouping

In [18]:
# NOTE THIS IS ONLY SHOWING GROUPS, LOOK BELOW ON HOW TO USE THE GROUPS
flights_by_airline = flights.groupby(['AIRLINE'])
flights_by_airline.groups

{'AA': [2, 19, 43, 55, 59, 64, 71, 74, 82, 92, 100, 134, 139, 141, 156, 160, 171, 179, 182, 186, 215, 222, 254, 268, 289, 295, 298, 307, 310, 351, 352, 361, 362, 363, 376, 384, 387, 401, 414, 417, 426, 433, 443, 445, 458, 475, 476, 497, 502, 505, 511, 512, 518, 542, 545, 570, 571, 573, 574, 581, 583, 593, 594, 612, 616, 618, 635, 641, 644, 655, 662, 683, 701, 705, 715, 722, 727, 736, 751, 767, 777, 779, 782, 791, 795, 820, 821, 825, 826, 836, 847, 848, 856, 860, 867, 873, 880, 882, 889, 892, ...], 'AS': [18, 26, 27, 79, 95, 127, 147, 167, 180, 181, 207, 313, 333, 343, 377, 383, 470, 535, 547, 560, 598, 636, 679, 741, 744, 749, 774, 792, 845, 853, 893, 938, 1043, 1074, 1105, 1137, 1171, 1172, 1187, 1245, 1272, 1313, 1324, 1345, 1346, 1377, 1437, 1500, 1526, 1528, 1543, 1571, 1598, 1599, 1646, 1722, 1783, 1785, 1800, 1801, 1821, 1902, 1904, 1943, 1984, 2023, 2139, 2212, 2216, 2219, 2259, 2332, 2393, 2418, 2461, 2468, 2511, 2554, 2676, 2748, 2814, 2821, 2849, 2854, 2884, 2964, 3006, 3050,

The **`groupby()`** method returns an type called **`DataFrameGroupBy`**. We will explore it in more depth shortly, but for now just know that it has an attribute called **`groups`** which provides a *`dict`* object with the **labels** of each group and the **corresponding index values** in the original dataframe that belong to that group.

If you look above, you can see there is a group labelled 'AA' will index values [2,   19,   43,   55,   59,   64,   71,   74,   82,   92, ...].

You can think of this as a record of all the groups that we will perform calculations on later.

#### Multi Column Grouping

You can specify multiple columns if you wish to split your data up in multiple levels:

In [19]:
# NOTE THIS IS ONLY SHOWING GROUPS, LOOK BELOW ON HOW TO USE THE GROUPS
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
flights_by_airline_month.groups

{('AA', 1): [182, 476, 573, 641, 655, 722, 848, 914, 971, 1027, 1266, 1836, 1889, 1892, 2024, 2060, 2062, 2188, 2207, 2240, 2409, 2454, 2512, 2652, 2737, 2895, 2933, 2958, 2978, 3039, 3542, 3562, 3635, 3808, 4031, 4130, 4193, 4245, 4318, 4435, 4540, 4623, 4631, 4800, 4914, 4955, 5199, 5239, 5402, 5417, 5453, 5773, 5853, 5870, 5893, 5963, 6028, 6149, 6345, 6395, 6736, 6800, 6997, 7051, 7229, 7239, 7380, 7434, 7717, 7791, 7862, 7875, 7879, 8015, 8205, 8217, 8233, 8243, 8280, 8329, 8497, 8742, 8779, 8872, 9152, 9236, 9294, 9571], ('AA', 2): [512, 571, 616, 727, 860, 929, 953, 956, 1086, 1118, 1159, 1231, 1291, 1456, 1512, 1734, 1796, 1910, 1940, 1941, 1959, 2090, 2430, 2589, 2962, 3257, 3358, 3369, 3629, 3760, 3851, 4019, 4060, 4078, 4155, 4319, 4508, 4520, 4575, 4829, 5759, 5924, 6170, 6215, 6350, 6488, 6645, 6684, 6817, 7392, 7414, 7673, 7683, 7686, 7868, 7927, 7996, 8102, 8135, 8164, 8209, 8398, 8719, 8751, 8970, 9073, 9178, 9336, 9378, 9460, 9647, 9715, 9787, 9794], ('AA', 3): [19, 38

### Aggregations after GroupBy

For example, let us say you want to find out the average distance traveled by each airline, you can do that using the following aggregeate function

In [20]:
flights.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAXI_IN,TAXI_OUT,DISTANCE
0,2015,8,19,3,EV,3260,7.0,20.0,1091
1,2015,9,23,3,WN,3050,4.0,9.0,837
2,2015,10,16,5,AA,1382,13.0,25.0,761
3,2015,1,19,1,WN,4274,5.0,23.0,1547
4,2015,4,22,3,WN,2237,5.0,18.0,872


In [21]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [22]:
avg_by_airline = flights_by_airline[['DISTANCE', 'TAXI_IN']].mean()

**NOTE**: The double [[ ]] for computing the summary stististics. The first pair [] is used to look into the `DataFrameGroupyBy` object the second pair [] is used to list all the columns you want to produce the summary statistics. 

In [23]:
avg_by_airline

Unnamed: 0_level_0,DISTANCE,TAXI_IN
AIRLINE,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,1053.736842,9.092593
AS,1202.405145,6.44373
B6,1064.124444,5.966216
DL,862.416996,7.279392
EV,466.038961,7.700409
F9,1034.223776,10.188811
HA,789.768595,7.214876
MQ,433.701961,8.512397
NK,993.298578,8.908213
OO,516.424537,6.725919


## Activity


### Gerneralizing using GroupBy

1\. Use AIRLINE to `groupby` records into a `DataFrameGroupBy` object?

2\. Compute the median distnace travelled per airline. 

3\. Extract the median DISTANCE for SouthWest airlines (WN) and assign it a variable `median_distance_WN`. 

4\. What is the median DISTANCE, TAXI_IN times and TAXI_OUT times per airline per month? 

5\. Extract the median TAXI_OUT for SouthWest airlines (WN) in December (12) and assign it a variable `median_taxi_out_WN_12`. 

In [None]:
# Question 1
flights_by_airline = flights.groupby(['AIRLINE'])

In [24]:
# Question 2
median_distance = flights_by_airline[['DISTANCE']].median()
median_distance

Unnamed: 0_level_0,DISTANCE
AIRLINE,Unnamed: 1_level_1
AA,985.0
AS,954.0
B6,997.0
DL,666.0
EV,429.0
F9,927.0
HA,163.0
MQ,408.0
NK,977.0
OO,451.0


In [26]:
median_distance.loc['WN']['DISTANCE']

611.0

In [27]:
# Question 3
median_distance_WN = median_distance.loc['WN']['DISTANCE']

In [28]:
# Question 4
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
summary_by_airline_month = flights_by_airline_month[['DISTANCE', 'TAXI_IN', 'TAXI_OUT']].median()
summary_by_airline_month

Unnamed: 0_level_0,Unnamed: 1_level_0,DISTANCE,TAXI_IN,TAXI_OUT
AIRLINE,MONTH,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AA,1,1061.0,7.0,15.0
AA,2,932.5,8.0,14.0
AA,3,1089.0,8.0,16.0
AA,4,1045.0,9.0,17.0
AA,5,1050.0,7.0,15.0
...,...,...,...,...
WN,8,577.0,5.0,10.0
WN,9,562.0,5.0,11.0
WN,10,601.5,5.0,10.0
WN,11,577.0,5.0,11.0


In [31]:
summary_by_airline_month.loc['WN'].loc[12]['TAXI_OUT']

10.0

In [32]:
summary_by_airline_month.loc['WN',12]['TAXI_OUT']

10.0

In [33]:
#Question 5: Select WN airline in month 12, median TAXI_OUT
median_taxi_out_WN_12 = summary_by_airline_month.loc['WN'].loc[12]['TAXI_OUT']

### Understanding the Aggregation After GroupBy: Method Dispatching

Let us now understand how the Aggregations on the DataFrameGroupBy objects work. In the **`DataFrameGroupBy`** objects, any method not found on the object itself is forwarded ("**dispatched**") to all the groups that it contains.

That is why we were able to ask for the *`median`* of a **`flights_by_airline`** object above and get something back: it is (1) "dispatching" the *`median`* method call to each group (that is each airline), (2) collecting the results and (3) presenting them to us.

In [34]:
flights_by_airline = flights.groupby(['AIRLINE'])

In [35]:
flights_by_airline.median()

Unnamed: 0_level_0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAXI_IN,TAXI_OUT,DISTANCE
AIRLINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AA,2015.0,8.0,16.0,4.0,1292.0,7.0,15.0,985.0
AS,2015.0,7.0,15.0,4.0,384.0,5.0,14.0,954.0
B6,2015.0,7.0,16.0,4.0,749.0,5.0,15.0,997.0
DL,2015.0,7.0,15.0,4.0,1654.5,6.0,15.0,666.0
EV,2015.0,6.0,15.0,4.0,4891.0,7.0,15.0,429.0
F9,2015.0,7.0,14.0,4.0,720.0,8.0,13.0,927.0
HA,2015.0,6.0,16.0,4.0,214.0,6.0,11.0,163.0
MQ,2015.0,6.0,14.0,4.0,3301.5,6.0,14.0,408.0
NK,2015.0,7.0,16.0,4.0,511.0,7.0,12.0,977.0
OO,2015.0,6.0,16.0,4.0,5265.0,5.0,16.0,451.0


In [36]:
# Compute the median for the entire DataFrameGroupBy object and then select 'DISTANCE' column 
flights_by_airline.median()[['DISTANCE']]

Unnamed: 0_level_0,DISTANCE
AIRLINE,Unnamed: 1_level_1
AA,985.0
AS,954.0
B6,997.0
DL,666.0
EV,429.0
F9,927.0
HA,163.0
MQ,408.0
NK,977.0
OO,451.0


In [37]:
# Select the 'DISTANCE' Column and then compute the median
flights_by_airline[['DISTANCE']].median()

Unnamed: 0_level_0,DISTANCE
AIRLINE,Unnamed: 1_level_1
AA,985.0
AS,954.0
B6,997.0
DL,666.0
EV,429.0
F9,927.0
HA,163.0
MQ,408.0
NK,977.0
OO,451.0


**Question**: Which of the above two methods should be preferred? 

In [38]:
# Select the 'DISTANCE' Column and then compute the median. THIS GIVES YOU SERIES OBJECT. 
flights_by_airline['DISTANCE'].median()

AIRLINE
AA     985.0
AS     954.0
B6     997.0
DL     666.0
EV     429.0
F9     927.0
HA     163.0
MQ     408.0
NK     977.0
OO     451.0
UA    1023.0
US     705.5
VX    1313.5
WN     611.0
Name: DISTANCE, dtype: float64

**NOTE** Note difference between using double square brackets [[ ]] and single bracket [ ]. For example, ``flights_by_airline[['DISTANCE']].median()`` above is a Dataframe with one column, where as if you use `` flights_by_airline['DISTANCE'].median()`` it'll give you a Series. 

### Methods of `DataFrameGroupBy` Objects
Now we will understand the various operations built into the `DataFrameGroupBy` object type.

#### The `aggregate()` Method
At first, the `aggregate()` method appears to be quite similiar to what we just covered when we talked about method dispatching. It performs aggregations on the groups in a **`DataFrameGroupBy`** object.

In [39]:
flights_by_airline.aggregate('mean')

Unnamed: 0_level_0,YEAR,MONTH,DAY,DAY_OF_WEEK,FLIGHT_NUMBER,TAXI_IN,TAXI_OUT,DISTANCE
AIRLINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AA,2015.0,7.097862,16.024671,4.055921,1266.090461,9.092593,17.785354,1053.736842
AS,2015.0,6.565916,15.356913,3.996785,386.385852,6.44373,15.073955,1202.405145
B6,2015.0,6.653333,15.771111,3.893333,905.42,5.966216,18.231982,1064.124444
DL,2015.0,6.551383,15.486825,3.887352,1631.198287,7.279392,17.189564,862.416996
EV,2015.0,6.408591,15.355644,3.851149,4742.821179,7.700409,16.813456,466.038961
F9,2015.0,7.181818,15.762238,3.79021,775.594406,10.188811,16.335664,1034.223776
HA,2015.0,6.239669,16.140496,3.809917,209.578512,7.214876,11.115702,789.768595
MQ,2015.0,6.247059,14.858824,3.917647,3304.494118,8.512397,16.628099,433.701961
NK,2015.0,6.706161,15.938389,3.976303,534.971564,8.908213,14.318841,993.298578
OO,2015.0,6.347614,16.03408,3.938656,5193.527751,6.725919,17.864945,516.424537


The difference is that the **`aggregate()`** method gives you some additional options that are not available if you rely on method dispatching as shown above.

In [41]:
# You can pass multiple aggregates as a list.
# Here will we get various aggregates for each
# column of our flights_by_airline object.
#flights_by_airline.aggregate([np.mean, 'min', 'max'])

flights_by_airline.aggregate(['mean', 'min', 'max'])

Unnamed: 0_level_0,YEAR,YEAR,YEAR,MONTH,MONTH,MONTH,DAY,DAY,DAY,DAY_OF_WEEK,...,FLIGHT_NUMBER,TAXI_IN,TAXI_IN,TAXI_IN,TAXI_OUT,TAXI_OUT,TAXI_OUT,DISTANCE,DISTANCE,DISTANCE
Unnamed: 0_level_1,mean,min,max,mean,min,max,mean,min,max,mean,...,max,mean,min,max,mean,min,max,mean,min,max
AIRLINE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AA,2015,2015,2015,7.097862,1,12,16.024671,1,31,4.055921,...,2580,9.092593,2.0,60.0,17.785354,7.0,110.0,1053.736842,130,3784
AS,2015,2015,2015,6.565916,1,12,15.356913,1,31,3.996785,...,895,6.44373,2.0,24.0,15.073955,3.0,88.0,1202.405145,31,2846
B6,2015,2015,2015,6.653333,1,12,15.771111,1,31,3.893333,...,2784,5.966216,2.0,38.0,18.231982,7.0,81.0,1064.124444,68,2704
DL,2015,2015,2015,6.551383,1,12,15.486825,1,31,3.887352,...,2853,7.279392,1.0,68.0,17.189564,7.0,105.0,862.416996,74,4502
EV,2015,2015,2015,6.408591,1,12,15.355644,1,31,3.851149,...,6189,7.700409,2.0,47.0,16.813456,3.0,144.0,466.038961,69,1330
F9,2015,2015,2015,7.181818,1,12,15.762238,1,31,3.79021,...,1491,10.188811,4.0,45.0,16.335664,7.0,53.0,1034.223776,373,2218
HA,2015,2015,2015,6.239669,1,12,16.140496,1,31,3.809917,...,520,7.214876,3.0,22.0,11.115702,5.0,26.0,789.768595,84,2917
MQ,2015,2015,2015,6.247059,1,12,14.858824,1,31,3.917647,...,3691,8.512397,1.0,66.0,16.628099,4.0,167.0,433.701961,89,1236
NK,2015,2015,2015,6.706161,1,12,15.938389,1,31,3.976303,...,1104,8.908213,2.0,71.0,14.318841,7.0,63.0,993.298578,177,2381
OO,2015,2015,2015,6.347614,1,12,16.03408,1,31,3.938656,...,7432,6.725919,2.0,46.0,17.864945,4.0,78.0,516.424537,67,1735


<div class="alert alert-block alert-warning">
<p>
It is important to notice that you are able to pass both strings and functions to the `aggregate()` method. It is probably best to choose one approach and stick with it rather than mixing and matching like I've done here.
</p>
</div>

In [42]:
flights_by_airline.aggregate([np.mean, np.min, np.max])

Unnamed: 0_level_0,YEAR,YEAR,YEAR,MONTH,MONTH,MONTH,DAY,DAY,DAY,DAY_OF_WEEK,...,FLIGHT_NUMBER,TAXI_IN,TAXI_IN,TAXI_IN,TAXI_OUT,TAXI_OUT,TAXI_OUT,DISTANCE,DISTANCE,DISTANCE
Unnamed: 0_level_1,mean,amin,amax,mean,amin,amax,mean,amin,amax,mean,...,amax,mean,amin,amax,mean,amin,amax,mean,amin,amax
AIRLINE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AA,2015,2015,2015,7.097862,1,12,16.024671,1,31,4.055921,...,2580,9.092593,2.0,60.0,17.785354,7.0,110.0,1053.736842,130,3784
AS,2015,2015,2015,6.565916,1,12,15.356913,1,31,3.996785,...,895,6.44373,2.0,24.0,15.073955,3.0,88.0,1202.405145,31,2846
B6,2015,2015,2015,6.653333,1,12,15.771111,1,31,3.893333,...,2784,5.966216,2.0,38.0,18.231982,7.0,81.0,1064.124444,68,2704
DL,2015,2015,2015,6.551383,1,12,15.486825,1,31,3.887352,...,2853,7.279392,1.0,68.0,17.189564,7.0,105.0,862.416996,74,4502
EV,2015,2015,2015,6.408591,1,12,15.355644,1,31,3.851149,...,6189,7.700409,2.0,47.0,16.813456,3.0,144.0,466.038961,69,1330
F9,2015,2015,2015,7.181818,1,12,15.762238,1,31,3.79021,...,1491,10.188811,4.0,45.0,16.335664,7.0,53.0,1034.223776,373,2218
HA,2015,2015,2015,6.239669,1,12,16.140496,1,31,3.809917,...,520,7.214876,3.0,22.0,11.115702,5.0,26.0,789.768595,84,2917
MQ,2015,2015,2015,6.247059,1,12,14.858824,1,31,3.917647,...,3691,8.512397,1.0,66.0,16.628099,4.0,167.0,433.701961,89,1236
NK,2015,2015,2015,6.706161,1,12,15.938389,1,31,3.976303,...,1104,8.908213,2.0,71.0,14.318841,7.0,63.0,993.298578,177,2381
OO,2015,2015,2015,6.347614,1,12,16.03408,1,31,3.938656,...,7432,6.725919,2.0,46.0,17.864945,4.0,78.0,516.424537,67,1735


Your textbook also talks about using a dict to apply labels to the aggregation columns so that they can have user friendly names like 'Longest Distance' rather than just 'max'.

This sort of functionality is, however, deprecated in Pandas, which means that it will be removed in future versions.

To accomplish the same thing, we should instead append a `rename()` method after our `aggregate()` method like so:

In [45]:
# Using `rename()` to apply friendly labels to output columns
flights_by_airline['DISTANCE'].aggregate(
    [np.mean, np.min, np.max]).rename(
        columns={'mean': 'Avg. Distance', 
                 'amin': 'Shortest Distance', 
                 'amax': 'Longest Distance'})

Unnamed: 0_level_0,Avg. Distance,Shortest Distance,Longest Distance
AIRLINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AA,1053.736842,130,3784
AS,1202.405145,31,2846
B6,1064.124444,68,2704
DL,862.416996,74,4502
EV,466.038961,69,1330
F9,1034.223776,373,2218
HA,789.768595,84,2917
MQ,433.701961,89,1236
NK,993.298578,177,2381
OO,516.424537,67,1735


<div class="alert alert-block alert-danger">
<p>
Note, there are three main things happening in the above statement. 

<li> flights_by_airline['DISTANCE'] selects the distance column for analysis
<li> flights_by_airline['DISTANCE'].aggregate([np.mean, np.min, np.max]) computes the average, min and max of the distance column selected
<li> Finally .rename() function is appropriately renaming the columns according the dictionary we have given  
</p>
</div>

The recommended way of using a **`dict`** with the **`aggregate()`** method is actually to specify which aggregation(s) to perform on what columns. You can use it to specify different aggregation(s) on a per-column basis.

Here I'll use it to get the high/low values for DISTANCE and the mean for TAXI_IN on our *`flights_by_airline_month`* object.

In [46]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])

# Notice how using this style automatically filters
# out all columns you don't specify.
flights_by_airline_month.aggregate(
        {'DISTANCE': [np.min, np.max], 
         'TAXI_IN': np.mean}).tail(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,DISTANCE,DISTANCE,TAXI_IN
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,amax,mean
AIRLINE,MONTH,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
VX,5,414,2704,8.0
VX,6,337,2586,8.0
VX,7,189,2454,7.285714
VX,8,337,2704,7.0
VX,9,236,2475,6.0
VX,10,236,2475,7.294118
VX,11,236,2565,7.1875
VX,12,337,2586,7.333333
WN,1,148,2447,6.519337
WN,2,148,2039,5.715278


## Activity: 

We will work again on the `college-loan-default-rates.csv` and `college-scorecard-data-scrubbed.csv` datasets. 

Use `aggregate()` method to produce

1. The average, minimum and maximum `full_time_retention_rate_4_year` per state using `college-scorecard-data-scrubbed.csv` dataset. 
    * After producing the above summary statistics, make sure you rename your columns for average, minimum and maximum as `Avg. Retention`, `Low Retention`, and `High Retention` respectively. 

2. Which state has the highest average four year retention rate (`full_time_retention_rate_4_year`)? Which has the lowest average? 

3. Produce per state and city, minimum and maximum for the `sat_average` column and average for the `full_time_retention_rate_4_year` column. 


In [47]:
# For this tutorial, we will need both of our datasets.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv')

college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1')

In [49]:
# Question 1

college_by_state = college_scorecard.groupby(['state'])
retention_summary = college_by_state['full_time_retention_rate_4_year'].aggregate(['mean', 'min', 'max'])
retention_summary.head()

Unnamed: 0_level_0,mean,min,max
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,0.66324,0.3333,0.7756
AL,0.615436,0.0,1.0
AR,0.650996,0.2564,0.8667
AS,1.0,1.0,1.0
AZ,0.6796,0.2,1.0


In [52]:
# Question 1 (contd...)
retention_summary = retention_summary.rename(columns = {'mean':'Avg. Retention',
                                                       'min': 'Low Retention',
                                                       'max': 'High Retention'})
retention_summary.head()

Unnamed: 0_level_0,Avg. Retention,Low Retention,High Retention
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AK,0.66324,0.3333,0.7756
AL,0.615436,0.0,1.0
AR,0.650996,0.2564,0.8667
AS,1.0,1.0,1.0
AZ,0.6796,0.2,1.0


In [53]:
# Question 2
retention_summary['Avg. Retention'].idxmax()

'AS'

In [54]:
retention_summary['Avg. Retention'].idxmin()

'CO'

In [55]:
# Question 3
college_by_state_city = college_scorecard.groupby(['state', 'city'])

college_by_state_city.aggregate({'sat_average':['min','max'], 
                                 'full_time_retention_rate_4_year': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,sat_average,sat_average,full_time_retention_rate_4_year
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,mean
state,city,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,Anchorage,1054.0,1054.0,0.7453
AK,Barrow,,,
AK,Fairbanks,,,0.7756
AK,Juneau,,,0.7167
AK,Palmer,,,0.3333
...,...,...,...,...
WY,Powell,,,
WY,Riverton,,,
WY,Rock Springs,,,
WY,Sheridan,,,


<div class="alert alert-block alert-warning">
<h3> Important Notes</h3>
<p> </p> 
When producing any of the summary statistics using group by, you can assign your intermediate operations to the variables. In the entire section above, I have been mostly trying to produce the results to show them to you. However, you can assign the results to a variable for using it in the future. **See the example below.** 
</div>

In [56]:
flights_by_airline_month = flights.groupby(['AIRLINE', 'MONTH'])
summary_distanc_taxi_in = flights_by_airline_month.aggregate(
        {'DISTANCE': [np.min, np.max], 
         'TAXI_IN': np.mean})

In [57]:
summary_distanc_taxi_in.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,DISTANCE,DISTANCE,TAXI_IN
Unnamed: 0_level_1,Unnamed: 1_level_1,amin,amax,mean
AIRLINE,MONTH,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AA,1,190,3784,9.176471
AA,2,175,2611,10.267606
AA,3,175,2504,9.8
AA,4,192,2422,10.583333
AA,5,175,2585,8.350649


In [58]:
# Remember from the last class that we can do aggregations at multiple levels using Hierarchical index. 
summary_distanc_taxi_in.mean(level='AIRLINE')

Unnamed: 0_level_0,DISTANCE,DISTANCE,TAXI_IN
Unnamed: 0_level_1,amin,amax,mean
AIRLINE,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
AA,164.416667,2983.333333,9.190944
AS,144.416667,2675.666667,6.473706
B6,171.25,2586.5,5.99438
DL,120.833333,2989.666667,7.296186
EV,82.916667,1184.666667,7.680604
F9,508.833333,1829.5,10.092238
HA,98.666667,2443.25,7.181896
MQ,107.916667,1067.166667,8.579951
NK,305.583333,1844.75,8.83569
OO,70.5,1544.75,6.729811
