# Lab 05: Grouping

## #1
Reload the `planes` and `flights` DataFrames, from the last lab.

In [2]:
import pandas as pd

In [3]:
planes = pd.read_csv('../data/planes.csv')
flights = pd.read_csv('../data/flights.csv')

## #2
What is the average departure delay (`dep_delay`) of all flights in this data?

In [5]:
flights['dep_delay'].mean()

12.639070257304708

## #3
What is the average departure delay by carrier, for flights in this data?

In [6]:
flights.groupby('carrier').agg({'dep_delay': ['mean']})

Unnamed: 0_level_0,dep_delay
Unnamed: 0_level_1,mean
carrier,Unnamed: 1_level_2
9E,16.725769
AA,8.586016
AS,5.804775
B6,13.022522
DL,9.264505
EV,19.95539
F9,20.215543
FL,18.726075
HA,4.900585
MQ,10.552041


## #4
If you followed the groupby-agg workflow we covered in lecture, you passed your summary function within a list.

Try removing the brackets and rerunning #3.
What's different about the result?
Why do you think we focused on the list-based approach in class?

In [8]:
flights.groupby('carrier').agg({'dep_delay': 'mean'})

Unnamed: 0_level_0,dep_delay
carrier,Unnamed: 1_level_1
9E,16.725769
AA,8.586016
AS,5.804775
B6,13.022522
DL,9.264505
EV,19.95539
F9,20.215543
FL,18.726075
HA,4.900585
MQ,10.552041


## #5

Working from your code for #3, calculate the minimum, mean, median, and maximum departure delay for each carrier.
Don't do this in 4 separate lines -- you can achieve this result using a single invocation of `groupby` and a single invocation of `agg`.

In [9]:
flights.groupby('carrier').agg({'dep_delay': ['min', 'mean', 'median', 'max']})

Unnamed: 0_level_0,dep_delay,dep_delay,dep_delay,dep_delay
Unnamed: 0_level_1,min,mean,median,max
carrier,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
9E,-24.0,16.725769,-2.0,747.0
AA,-24.0,8.586016,-3.0,1014.0
AS,-21.0,5.804775,-3.0,225.0
B6,-43.0,13.022522,-1.0,502.0
DL,-33.0,9.264505,-2.0,960.0
EV,-32.0,19.95539,-1.0,548.0
F9,-27.0,20.215543,0.5,853.0
FL,-22.0,18.726075,1.0,602.0
HA,-16.0,4.900585,-4.0,1301.0
MQ,-26.0,10.552041,-3.0,1137.0


## #6

Perhaps, after doing some research, you discover that the most meaningful statistic for departure delay is the *average delay*, but the most meaningful way to measure air time is the *median air time*.
Build a groupby-agg invocation that summarizes – at the carrier level – the average departure delay and the median air time.

In [10]:
flights.groupby('carrier').agg({'dep_delay': 'mean', 'air_time': 'median'})

Unnamed: 0_level_0,dep_delay,air_time
carrier,Unnamed: 1_level_1,Unnamed: 2_level_1
9E,16.725769,83.0
AA,8.586016,169.0
AS,5.804775,324.0
B6,13.022522,142.0
DL,9.264505,145.0
EV,19.95539,87.0
F9,20.215543,229.0
FL,18.726075,109.0
HA,4.900585,621.5
MQ,10.552041,83.0


## #7
What is the single most common route in this data?
That is to say, what combination of values for `origin` and `destination` occurs most frequently? *Hint: to order the data yourself, you may want to experiment with the `sort_values` method of Pandas Series.*

In [13]:
# as_index = False is useful to keep our grouped columns as regular, selectable/sortable columns in the data.
# counting tailnum could be changed to counting any non-null column
route_counts = flights.groupby(['origin', 'dest'], as_index=False).agg({'tailnum': 'count'})
route_counts.head()

Unnamed: 0,origin,dest,tailnum
0,EWR,ALB,439
1,EWR,ANC,8
2,EWR,ATL,5022
3,EWR,AUS,961
4,EWR,AVL,265


In [14]:
# By default, sort_values sorts in ascending order.
route_counts.sort_values('tailnum', ascending=False)

Unnamed: 0,origin,dest,tailnum
117,JFK,LAX,11237
156,LGA,ATL,10262
204,LGA,ORD,8717
146,JFK,SFO,8174
170,LGA,CLT,6114
...,...,...,...
152,JFK,STL,1
191,LGA,LEX,1
121,JFK,MEM,1
90,JFK,BHM,1


So JFK-to-LAX is the most common flight in the data, with 11237 records.

## Challenge question

Same as #7, but by carrier: what route is most common for each carrier? You will need to use approaches beyond what we've seen in class, but you should be able to find code that gets you most of the way there just by googling.

In [17]:
# Get route frequency by carrier.
carrier_route_counts = flights.groupby(['origin', 'dest', 'carrier'], as_index=False).agg({'tailnum': 'count'})
carrier_route_counts.head()

Unnamed: 0,origin,dest,carrier,tailnum
0,EWR,ALB,EV,439
1,EWR,ANC,UA,8
2,EWR,ATL,9E,4
3,EWR,ATL,DL,3153
4,EWR,ATL,EV,1762


In [20]:
# Get the max of the tailnum column for each carrier.
max_freq_by_carrier = carrier_route_counts.groupby('carrier', as_index=False).agg({'tailnum': 'max'})
max_freq_by_carrier.head()

Unnamed: 0,carrier,tailnum
0,9E,1034
1,AA,5684
2,AS,714
3,B6,3304
4,DL,5544


In [21]:
# Join this back to the original data to get only the most common flight (matching on
# the row that has the max number of flights, calculated at last step)
pd.merge(max_freq_by_carrier, carrier_route_counts, how='left', on=['tailnum', 'carrier'])

Unnamed: 0,carrier,tailnum,origin,dest
0,9E,1034,JFK,MSP
1,AA,5684,LGA,ORD
2,AS,714,EWR,SEA
3,B6,3304,JFK,MCO
4,DL,5544,LGA,ATL
5,EV,2529,EWR,DTW
6,F9,682,LGA,DEN
7,FL,2337,LGA,ATL
8,HA,342,JFK,HNL
9,MQ,3334,LGA,RDU


So, for example, the most common route on AA is LGA-to-ORD.