# 03. Custom Aggregation

### Objectives
After this lesson you should be able to...
+ Write your own custom aggregation functions
+ Assign the GroupBy object itself to a variable
+ Use GroupBy methods other than **`agg`**

# Introduction
Pandas GroupBy objects come with many built-in aggregate functions. These are all available as strings within the **`agg`** method. There are, of course, many other possible aggregations that are not directly available. It is possible to define your own customized aggregate function. These customized functions must return a single value.

## Writing your own custom aggregation function
Let's suppose you would like to know the difference between the max and min value of a column for each group. Pandas does not have an aggregate function built to do this. You will have to define this one yourself. 

Each customized aggregate function is defined as you would a regular Python function with the **`def`** keyword. Each function is **implicitly** passed the aggregating column. This aggregating column is passed as a **`Series`**. This means that all Series methods will work on the passed argument.

The **`min_max`** function below takes one argument, **`s`**, which is a Series object. It returns the difference between the max and min values of that Series.

In [1]:
import pandas as pd
import numpy as np

college = pd.read_csv('../data/college.csv')

def min_max(s):
    return s.max() - s.min()

## Using your customized aggregation function
Customized aggregation functions are used similarly to the built-in aggregation functions. When using them within the **`agg`** method, use the actual function object and not the string name. 

The following finds the difference between the maximum and minimum student populations for school with and without religious affiliation. 

In [2]:
college.groupby('relaffil').agg({'ugds': min_max})

Unnamed: 0_level_0,ugds
relaffil,Unnamed: 1_level_1
0,151558.0
1,49340.0


### Implicit passing of aggregation Series
The above **`agg`** method passed the **`UGDS`** column as a Series to our customized aggregation function, **`min_max`**, for each group. The parameter **`s`** takes on this Series. We say this is implicit, because we don't actually see the function executed.

An **explicit** call to **`min_max`** would look like this:

In [3]:
min_max(college['ugds'])

151558.0

###  Custom aggregation function must return a single value
If your custom aggregation function does not return a single value, an exception will be raised. Let's create a custom aggregation that adds 5 to each value. This will return a Series the same size as group and not a single number.

In [4]:
def add5(s):
    return s + 5

Attempting this produces an error:

In [5]:
college.groupby('relaffil').agg({'ugds': add5})

Exception: Must produce aggregated value

## Combine custom aggregation function with built-ins
The custom aggregation function can be used in conjunction with any number of other built-in aggregation functions that we have previously seen. You will have to rename the columns to remove the MutliIndex as usual.

In [6]:
college.groupby(['stabbr', 'relaffil'], as_index=False) \
       .agg({'ugds': ['size', 'min', 'max', min_max]}).head(12)

Unnamed: 0_level_0,stabbr,relaffil,ugds,ugds,ugds,ugds
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,size,min,max,min_max
0,AK,0,7,109.0,12865.0,12756.0
1,AK,1,3,27.0,275.0,248.0
2,AL,0,72,12.0,29851.0,29839.0
3,AL,1,24,13.0,3033.0,3020.0
4,AR,0,68,18.0,21405.0,21387.0
5,AR,1,18,20.0,4485.0,4465.0
6,AS,0,1,1276.0,1276.0,0.0
7,AZ,0,124,1.0,151558.0,151557.0
8,AZ,1,9,25.0,4102.0,4077.0
9,CA,0,609,0.0,44744.0,44744.0


## Finding the percentage of all undergraduates represented in the top 5 most populous colleges
A slightly more involved example would be to find the percentage of undergraduates that attend the top 5 most populous colleges for each state.

To accomplish this, our custom function sorts the values during each group from greatest to least. We then select the first 5 values with **`.iloc`** and sum them. We divide this sum by the total.

In [7]:
def top5_perc(s):
    s = s.sort_values(ascending=False)
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [8]:
college.groupby('stabbr').agg({'ugds': top5_perc}).head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486
CA,0.076559
CO,0.378463
CT,0.296679
DC,0.755056
DE,0.855314


## Run operations that are independent of the group outside of the custom function
In general, it is best to minimize the amount of code inside the custom function. The only commands that should go inside the custom function are those that depend on the grouping.

In the above example, there is no need to sort the values inside the group. We can instead sort the values before the grouping. Pandas preserves the order of the values in each group, so you can be sure that the top 5 values are the same for both methods.

In [9]:
def top5_perc_simple(s):
    top5_total = s.iloc[:5].sum()
    total = s.sum()
    return top5_total / total

In [10]:
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486
CA,0.076559
CO,0.378463
CT,0.296679
DC,0.755056
DE,0.855314


### Comparing performance
The less operations that occur within the custom GroupBy function, the better performance will be.

About a 50% performance improvement is seen.

In [11]:
%timeit -n 5 college.groupby('stabbr').agg({'ugds': top5_perc}).head(10)

37.3 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [12]:
%%timeit -n 5 
college.sort_values('ugds', ascending=False) \
       .groupby('stabbr').agg({'ugds': top5_perc_simple}).head(10)

19.2 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


# Pandas Power User Optimization
Performance is always better when custom functions are avoided. This is because Pandas only optimizes for a few select functions - the ones that we can use as strings such as `sum`, `max`, `min`, etc...

The below only uses builtin Pandas GroupBy function.

### Get top 5 rows with `head` GroupBy method
You can get the first 5 rows of **each** group by calling the `head` method directly after grouping

In [13]:
college_top5 = college.sort_values('ugds', ascending=False) \
                      .groupby('stabbr').head()

In [14]:
college_top5.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
7116,University of Phoenix-Arizona,Tempe,AZ,0.0,0.0,0.0,0,,,0.0,...,0.1131,0.0131,0.3152,0.0,1,0.6009,0.592,,,33000
1189,Ivy Tech Community College,Indianapolis,IN,0.0,0.0,0.0,0,,,0.0,...,0.0209,0.0003,0.0354,0.635,1,0.5153,0.3384,0.478,29400.0,13000
793,Miami Dade College,Miami,FL,0.0,0.0,0.0,0,,,0.0,...,0.0035,0.0521,0.028,0.5824,1,0.5399,0.0921,0.3503,30100.0,8500
3711,Lone Star College System,The Woodlands,TX,0.0,0.0,0.0,0,,,0.0,...,0.0281,0.019,0.0292,0.6863,1,0.3405,0.1984,0.3201,32900.0,11000
3669,Houston Community College,Houston,TX,0.0,0.0,0.0,0,,,0.0,...,0.0151,0.0911,0.0198,0.7027,1,0.668,0.3348,0.4751,32500.0,10750


We can verify this by counting the number of states in the resulting DataFrame. They should all be 5 or at most 5.

In [15]:
college_top5['stabbr'].value_counts().head(10)

OK    5
IN    5
SC    5
KS    5
WA    5
MO    5
WV    5
TN    5
FL    5
NV    5
Name: stabbr, dtype: int64

### Sum the school populations from this DataFrame
We can now total the populations for each state by using another call to **`groupby`**.

In [16]:
top5_total = college_top5.groupby('stabbr').agg({'ugds': 'sum'})
top5_total.head()

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,23974.0
AL,92059.0
AR,56985.0
AS,1276.0
AZ,287015.0


### Sum all the school for each state
Use the original DataFrame to find the total of all the states with yet another call to **`groupby`**.

In [17]:
total = college.groupby('stabbr').agg({'ugds': 'sum'})
total.head()

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,24932.0
AL,248298.0
AR,134820.0
AS,1276.0
AZ,520439.0


### Divide the last two DataFrames
We get our desired result by dividing the top 5 total by the grand total. This is the same result as the other two methods.

In [18]:
(top5_total / total).head()

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,0.961575
AL,0.37076
AR,0.422675
AS,1.0
AZ,0.551486


## Performance
Running all the commands together yields the best performance. We were able to reduce the time to complete the task by 80% from the original custom aggregation. Interestingly, the fastest performance used three **`groupby`** calls vs just one for the others. This shows you how much more optimized Pandas builtin grouping functions are.

In [19]:
%%timeit -n 5
college_top5 = college.sort_values('ugds', ascending=False) \
                      .groupby('stabbr').head()
top5_total = college_top5.groupby('stabbr').agg({'ugds': 'sum'})
total = college.groupby('stabbr').agg({'ugds': 'sum'})
top5_total / total

6.84 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


# Complexity vs Performance
This is usually a topic of debate when deciding on which Pandas methods to use. I typically like to avoid custom aggregation functions at all cost as they can drastically reduce performance for larger datasets.

Readability (low complexity) is very valuable when sharing your code or looking back at it at a later date. 

# More GroupBy methods
There are many more GroupBy methods that work by calling them after **`groupby`**. Let's assign the GroupBy object to a variable and call several methods directly from it.

In [20]:
g = college.groupby(['stabbr', 'relaffil'], as_index=False)

### Using an aggregation on all the columns
It's possible to use an aggregation on all the columns at once. Simply do not specify any aggregating columns. Pandas silently drops columns where it fails to produce an aggregation - like taking the mean of strings.

In [21]:
g.sum().head()

Unnamed: 0,stabbr,relaffil,hbcu,menonly,womenonly,satvrmid,satmtmid,distanceonly,ugds,ugds_white,...,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv
0,AK,0,0.0,0.0,0.0,0.0,0.0,0.0,24562.0,2.9691,...,1.3864,0.1492,0.3366,0.0523,0.765,2.6594,7,2.3862,2.322,3.9164
1,AK,1,0.0,0.0,0.0,555.0,503.0,0.0,370.0,1.4416,...,1.039,0.0109,0.1462,0.0,0.2197,0.5961,3,1.5591,1.4946,1.146
2,AL,0,9.0,0.0,0.0,6694.0,6705.0,0.0,230663.0,36.3033,...,0.4512,0.1297,0.9414,0.4844,2.5344,18.7881,68,42.7749,33.1788,28.8483
3,AL,1,6.0,0.0,1.0,3984.0,3885.0,1.0,17635.0,6.1954,...,0.0994,0.0197,0.1796,0.2687,0.922,3.2535,22,10.9474,12.1875,5.2111
4,AR,0,1.0,0.0,0.0,4330.0,4532.0,0.0,121971.0,40.2866,...,0.6433,0.0889,1.3633,0.3991,1.0406,14.2407,66,40.2682,32.1545,25.4562


### Get the head and tail of each group
Get the first or last few rows with the head and tail method. This is not an aggregation as more than 1 row is being returned for each group. We sort the data by the grouping keys to show how we return 3 (or less) schools from each state by religious affiliation.

In [22]:
g.head(3).sort_values(['stabbr', 'relaffil']).head(15)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
60,University of Alaska Anchorage,Anchorage,AK,0.0,0.0,0.0,0,,,0.0,...,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500,19449.5
62,University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,...,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200,19355
63,University of Alaska Southeast,Juneau,AK,0.0,0.0,0.0,0,,,0.0,...,0.0686,0.0049,0.2241,0.5112,1,0.1769,0.1996,0.555,37400,16875
61,Alaska Bible College,Palmer,AK,0.0,0.0,0.0,1,,,0.0,...,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,,PrivacySuppressed
64,Alaska Pacific University,Anchorage,AK,0.0,0.0,0.0,1,555.0,503.0,0.0,...,0.0945,0.0,0.0873,0.3745,1,0.3152,0.5297,0.491,47000,23250
5417,Alaska Christian College,Soldotna,AK,0.0,0.0,0.0,1,,,0.0,...,0.0147,0.0,0.1324,0.0735,1,0.8868,0.6792,0.2264,,PrivacySuppressed
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370


You can use **`first/last`** to get the very first or last row of each group.

In [23]:
g.first().head(10)

Unnamed: 0,stabbr,relaffil,instnm,city,hbcu,menonly,womenonly,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,AK,0,University of Alaska Anchorage,Anchorage,0.0,0.0,0.0,,,0.0,...,0.098,0.0181,0.0457,0.4539,1,0.2385,0.2647,0.4386,42500,19449.5
1,AK,1,Alaska Bible College,Palmer,0.0,0.0,0.0,555.0,503.0,0.0,...,0.037,0.0,0.0,0.1481,1,0.3571,0.2857,0.4286,47000,PrivacySuppressed
2,AL,0,Alabama A & M University,Normal,1.0,0.0,0.0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888
3,AL,1,Amridge University,Montgomery,0.0,0.0,0.0,560.0,560.0,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370
4,AR,0,University of Arkansas at Little Rock,Little Rock,0.0,0.0,0.0,470.0,510.0,0.0,...,0.0755,0.0283,0.0003,0.4126,1,0.3941,0.4775,0.4062,33900,21736
5,AR,1,Arkansas Baptist College,Little Rock,1.0,0.0,0.0,505.0,528.0,0.0,...,0.0,0.0089,0.0,0.1127,1,0.8306,0.8695,0.2833,22000,38000
6,AS,0,American Samoa Community College,Pago Pago,0.0,0.0,0.0,,,0.0,...,0.0,0.0721,0.0024,0.4389,1,0.7245,0.0,0.1774,19800,PrivacySuppressed
7,AZ,0,Collins College,Phoenix,0.0,0.0,0.0,565.0,580.0,0.0,...,0.0241,0.0,0.3855,0.3373,0,0.7205,0.8228,0.4764,25700,47000
8,AZ,1,Everest College-Phoenix,Phoenix,0.0,0.0,0.0,485.0,480.0,0.0,...,0.0373,0.0,0.1026,0.4749,0,0.8291,0.7151,0.67,28600,9500
9,CA,0,Academy of Art University,San Francisco,0.0,0.0,0.0,765.0,785.0,0.0,...,0.0249,0.2523,0.2098,0.4334,1,0.4008,0.5524,0.4043,36000,35093


Use the **`nth`** method to select rows by integer location within that group. The following takes rows with integer location 1, 10, and 20 for each group.

In [24]:
g.nth([1, 10, 20]).head(10)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
10,Birmingham Southern College,Birmingham,AL,0.0,0.0,0.0,1,560.0,560.0,0.0,...,0.0051,0.0,0.0051,0.0017,1,0.192,0.4809,0.0152,44200,27000.0
13,South University-Montgomery,Montgomery,AL,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0019,0.0326,0.3238,1,0.6845,0.7129,0.5726,28800,25167.0
26,Jacksonville State University,Jacksonville,AL,0.0,0.0,0.0,0,495.0,485.0,0.0,...,0.0,0.0234,0.028,0.1989,1,0.4428,0.6825,0.22,34600,23000.0
46,Samford University,Birmingham,AL,0.0,0.0,0.0,1,565.0,558.0,0.0,...,0.0115,0.0297,0.0036,0.0409,1,0.1391,0.3385,0.0601,45800,23000.0
62,University of Alaska Fairbanks,Fairbanks,AK,0.0,0.0,0.0,0,,,0.0,...,0.0401,0.011,0.306,0.3887,1,0.2263,0.255,0.4519,36200,19355.0
64,Alaska Pacific University,Anchorage,AK,0.0,0.0,0.0,1,555.0,503.0,0.0,...,0.0945,0.0,0.0873,0.3745,1,0.3152,0.5297,0.491,47000,23250.0
70,Empire Beauty School-Paradise Valley,Phoenix,AZ,0.0,0.0,0.0,1,,,0.0,...,0.04,0.0,0.0,0.16,0,0.6349,0.5873,0.4651,17800,9588.0
71,Empire Beauty School-Tucson,Tucson,AZ,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0,0.0079,0.2222,1,0.7962,0.6615,0.4229,18200,9833.0
81,Brookline College-Phoenix,Phoenix,AZ,0.0,0.0,0.0,0,,,0.0,...,0.0265,0.0007,0.0346,0.3051,1,0.687,0.6783,0.6389,22200,9500.0


## `size` vs `count`
Since **`size`** is the same for every variable, Pandas returns just a single column. The number of missing values may be different for each column, so the **`count`** method will return a new column for each original column.

In [25]:
g.size().head(10)

stabbr  relaffil
AK      0             7
        1             3
AL      0            72
        1            24
AR      0            68
        1            18
AS      0             1
AZ      0           124
        1             9
CA      0           609
dtype: int64

In [26]:
g.count().head()

Unnamed: 0,stabbr,relaffil,instnm,city,hbcu,menonly,womenonly,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,AK,0,7,7,7,7,7,0,0,7,...,7,7,7,7,7,7,7,7,7,7
1,AK,1,3,3,3,3,3,1,1,3,...,3,3,3,3,3,3,3,3,1,3
2,AL,0,72,72,72,72,72,13,13,72,...,71,71,71,71,72,71,71,71,64,72
3,AL,1,24,24,18,18,18,8,8,18,...,18,18,18,18,24,18,18,17,21,24
4,AR,0,68,68,68,68,68,9,9,68,...,68,68,68,68,68,68,68,66,64,68


# Exercises
Solutions are below.

Use the flights data for these problems.

In [27]:
import pandas as pd
pd.options.display.max_columns = 40
flights = pd.read_csv('../data/flights.csv')
flights.head()

Unnamed: 0,year,month,day,day_of_week,airline,flight_number,tail_number,origin_airport,destination_airport,scheduled_departure,departure_time,departure_delay,taxi_out,wheels_off,scheduled_time,elapsed_time,air_time,distance,wheels_on,taxi_in,scheduled_arrival,arrival_time,arrival_delay,diverted,cancelled,cancellation_reason,air_system_delay,security_delay,airline_delay,late_aircraft_delay,weather_delay
0,2015,1,1,4,WN,1908,N8324A,LAX,SLC,1625,1723.0,58.0,10.0,1733.0,100.0,107.0,94.0,590,2007.0,3.0,1905,2010.0,65.0,0,0,,31.0,0.0,0.0,34.0,0.0
1,2015,1,1,4,UA,581,N448UA,DEN,IAD,823,830.0,7.0,11.0,841.0,190.0,170.0,154.0,1452,1315.0,5.0,1333,1320.0,-13.0,0,0,,,,,,
2,2015,1,1,4,MQ,2851,N645MQ,DFW,VPS,1305,1341.0,36.0,18.0,1359.0,108.0,107.0,85.0,641,1524.0,4.0,1453,1528.0,35.0,0,0,,0.0,0.0,35.0,0.0,0.0
3,2015,1,1,4,AA,383,N3EUAA,DFW,DCA,1555,1602.0,7.0,13.0,1615.0,160.0,146.0,126.0,1192,1921.0,7.0,1935,1928.0,-7.0,0,0,,,,,,
4,2015,1,1,4,WN,3047,N560WN,LAX,MCI,1720,1808.0,48.0,6.0,1814.0,185.0,176.0,166.0,1363,2300.0,4.0,2225,2304.0,39.0,0,0,,0.0,0.0,17.0,22.0,0.0


## Problem 1
<span  style="color:green; font-size:16px">What are the 3 least common airlines?</span>

## Problem 2
<span  style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on the 4th day of the week. Use a custom aggregation function.</span>

## Problem 3
<span  style="color:green; font-size:16px">Redo problem 2 without using a custom aggregation problem. What is the performance difference?</span>

## Problem 4
<span  style="color:green; font-size:16px">The range of undergrad populations per state was calculated using the `min_max` custom function from the top of this notebook. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

## Problem 5
<span  style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use one of the direct [GroupBy methods][1]</span>

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

## Problem 6
<span  style="color:green; font-size:16px">Find the total number of rows in each group of airline, month, and origin airport.</span>

# Solutions

In [54]:
import pandas as pd
pd.options.display.max_columns = 40
flights = pd.read_csv('../data/flights.csv')
flights.head()

Unnamed: 0,year,month,day,day_of_week,airline,flight_number,tail_number,origin_airport,destination_airport,scheduled_departure,departure_time,departure_delay,taxi_out,wheels_off,scheduled_time,elapsed_time,air_time,distance,wheels_on,taxi_in,scheduled_arrival,arrival_time,arrival_delay,diverted,cancelled,cancellation_reason,air_system_delay,security_delay,airline_delay,late_aircraft_delay,weather_delay
0,2015,1,1,4,WN,1908,N8324A,LAX,SLC,1625,1723.0,58.0,10.0,1733.0,100.0,107.0,94.0,590,2007.0,3.0,1905,2010.0,65.0,0,0,,31.0,0.0,0.0,34.0,0.0
1,2015,1,1,4,UA,581,N448UA,DEN,IAD,823,830.0,7.0,11.0,841.0,190.0,170.0,154.0,1452,1315.0,5.0,1333,1320.0,-13.0,0,0,,,,,,
2,2015,1,1,4,MQ,2851,N645MQ,DFW,VPS,1305,1341.0,36.0,18.0,1359.0,108.0,107.0,85.0,641,1524.0,4.0,1453,1528.0,35.0,0,0,,0.0,0.0,35.0,0.0,0.0
3,2015,1,1,4,AA,383,N3EUAA,DFW,DCA,1555,1602.0,7.0,13.0,1615.0,160.0,146.0,126.0,1192,1921.0,7.0,1935,1928.0,-7.0,0,0,,,,,,
4,2015,1,1,4,WN,3047,N560WN,LAX,MCI,1720,1808.0,48.0,6.0,1814.0,185.0,176.0,166.0,1363,2300.0,4.0,2225,2304.0,39.0,0,0,,0.0,0.0,17.0,22.0,0.0


## Problem 1
<span  style="color:green; font-size:16px">What are the 3 least common airlines?</span>

In [55]:
flights['airline'].value_counts().tail(3)

AS    768
B6    543
HA    112
Name: airline, dtype: int64

## Problem 2
<span  style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on the 4th day of the week. Use a custom aggregation function.</span>

In [56]:
def day_pct(s):
    return (s == 4).mean()

flights.groupby('airline').agg({'day_of_week':day_pct})

Unnamed: 0_level_0,day_of_week
airline,Unnamed: 1_level_1
AA,0.149775
AS,0.143229
B6,0.154696
DL,0.145835
EV,0.139638
F9,0.14123
HA,0.133929
MQ,0.161913
NK,0.149736
OO,0.144809


## Problem 3
<span  style="color:green; font-size:16px">Redo problem 2 without using a custom aggregation problem. What is the performance difference?</span>

In [7]:
flights['is_4th'] = flights['DAY_OF_WEEK'] == 4
flights.groupby('AIRLINE').agg({'is_4th': 'mean'})

Unnamed: 0_level_0,is_4th
AIRLINE,Unnamed: 1_level_1
AA,0.149775
AS,0.143229
B6,0.154696
DL,0.145835
EV,0.139638
F9,0.14123
HA,0.133929
MQ,0.161913
NK,0.149736
OO,0.144809


About 50% improvement

In [8]:
%%timeit -n 5
flights['is_4th'] = flights['DAY_OF_WEEK'] == 4
flights.groupby('AIRLINE').agg({'is_4th': 'mean'})

4.62 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)


In [9]:
%timeit -n 5 flights.groupby('AIRLINE').agg({'DAY_OF_WEEK':day_pct})

9.08 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)


## Problem 4
<span  style="color:green; font-size:16px">The range of undergrad populations per state was calculated using the `min_max` custom function from the top of this notebook. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

In [15]:
def min_max(s):
    return s.max() - s.min()

In [16]:
flights.groupby('AIRLINE').agg({'DISTANCE': min_max})

Unnamed: 0_level_0,DISTANCE
AIRLINE,Unnamed: 1_level_1
AA,3609
AS,2425
B6,2473
DL,4396
EV,1256
F9,1845
HA,579
MQ,1161
NK,2145
OO,1668


In [21]:
dist_min_max = flights.groupby('AIRLINE').agg({'DISTANCE': ['min', 'max']}).reset_index()
dist_min_max.columns = ['AIRLINE', 'Min Dist', 'Max Dist']
dist_min_max['Dist Range'] = dist_min_max['Max Dist'] - dist_min_max['Min Dist']
dist_min_max

Unnamed: 0,AIRLINE,Min Dist,Max Dist,Dist Range
0,AA,175,3784,3609
1,AS,421,2846,2425
2,B6,231,2704,2473
3,DL,106,4502,4396
4,EV,74,1330,1256
5,F9,373,2218,1845
6,HA,2338,2917,579
7,MQ,89,1250,1161
8,NK,236,2381,2145
9,OO,67,1735,1668


## Problem 5
<span  style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use one of the direct [GroupBy methods][1]</span>

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#groupby

In [55]:
flights.groupby('AIRLINE').nth([0, -1]).head(10)

Unnamed: 0_level_0,AIRLINE_DELAY,AIR_SYSTEM_DELAY,AIR_TIME,ARRIVAL_DELAY,ARRIVAL_TIME,CANCELLATION_REASON,CANCELLED,DAY,DAY_OF_WEEK,DEPARTURE_DELAY,DEPARTURE_TIME,DESTINATION_AIRPORT,DISTANCE,DIVERTED,ELAPSED_TIME,FLIGHT_NUMBER,LATE_AIRCRAFT_DELAY,MONTH,ORIGIN_AIRPORT,SCHEDULED_ARRIVAL,SCHEDULED_DEPARTURE,SCHEDULED_TIME,SECURITY_DELAY,TAIL_NUMBER,TAXI_IN,TAXI_OUT,WEATHER_DELAY,WHEELS_OFF,WHEELS_ON,YEAR
AIRLINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
AA,,,126.0,-7.0,1928.0,,0,1,4,7.0,1602.0,DCA,1192,0,146.0,383,,1,DFW,1935,1555,160.0,,N3EUAA,7.0,13.0,,1615.0,1921.0,2015
AA,,,166.0,-19.0,1026.0,,0,31,4,5.0,520.0,DFW,1464,0,186.0,1454,,12,SFO,1045,515,210.0,,N852AA,10.0,10.0,,530.0,1016.0,2015
AS,,,127.0,-25.0,1644.0,,0,31,4,-8.0,1412.0,SEA,954,0,152.0,323,,12,LAX,1709,1420,169.0,,N323AS,6.0,19.0,,1431.0,1638.0,2015
AS,,,155.0,-3.0,1659.0,,0,1,4,-2.0,1503.0,SEA,1107,0,176.0,633,,1,PHX,1702,1505,177.0,,N320AS,4.0,17.0,,1520.0,1655.0,2015
B6,,,231.0,-45.0,430.0,,0,31,4,-12.0,2224.0,BOS,2300,0,246.0,602,,12,PHX,515,2236,279.0,,N625JB,3.0,12.0,,2236.0,427.0,2015
B6,,,246.0,-27.0,1959.0,,0,1,4,0.0,1230.0,BOS,2381,0,269.0,178,,1,LAS,2026,1230,296.0,,N625JB,4.0,19.0,,1249.0,1955.0,2015
DL,,,156.0,-18.0,1202.0,,0,1,4,-5.0,708.0,MSP,1299,0,174.0,1550,,1,LAS,1220,713,187.0,,N3739P,6.0,12.0,,720.0,1156.0,2015
DL,,,64.0,-8.0,2330.0,,0,31,4,2.0,2208.0,CMH,447,0,82.0,1640,,12,ATL,2338,2206,92.0,,N841DN,4.0,14.0,,2222.0,2326.0,2015
EV,,,52.0,14.0,1026.0,,0,31,4,21.0,911.0,LFT,351,0,75.0,2758,,12,DFW,1012,850,82.0,,N633AE,4.0,19.0,,930.0,1022.0,2015
EV,,,113.0,5.0,1408.0,,0,1,4,6.0,1201.0,JAN,677,0,127.0,4589,,1,ORD,1403,1155,128.0,,N13992,4.0,10.0,,1211.0,1404.0,2015


## Problem 6
<span  style="color:green; font-size:16px">Find the total number of rows in each group of airline, month, and origin airport.</span>

In [26]:
flights.groupby(['AIRLINE','MONTH', 'ORIGIN_AIRPORT']).size().head(10)

AIRLINE  MONTH  ORIGIN_AIRPORT
AA       1      ATL                 4
                DEN                16
                DFW               365
                IAH                12
                LAS                32
                LAX                82
                MSP                 9
                ORD               121
                PHX                18
                SFO                18
dtype: int64