# Solutions

1. [Groupby Aggregation Basics](#1.-Groupby-Aggregation-Basics)
1. [Grouping and Aggregating with Multiple Columns](#2.-Grouping-and-Aggregating-with-Multiple-Columns)
1. [Grouping with Pivot Tables](#3.-Grouping-with-Pivot-Tables)
1. [Counting with Crosstabs](#4.-Counting-with-Crosstabs)
1. [Alternate Groupby Syntax](#5.-Alternate-Groupby-Syntax)
1. [Custom Aggregation](#6.-Custom-Aggregation)
1. [Filter and Transform with Groupby](#7.-Filter-and-Transform-with-Groupby)
1. [Other Groupby Methods](#8.-Other-Groupby-Methods)
1. [Binning Numeric Columns](#9.-Binning-Numeric-Columns)
1. [Miscellaneous Grouping Functionality](#10.-Miscellaneous-Grouping-Functionality)
1. [Create your own Data Analysis](#11.-Create-your-own-Data-Analysis)

## 1. Groupby Aggregation Basics

In [1]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 1

<span  style="color:green; font-size:16px">Find the maximum salary for each sex.</span>

In [2]:
emp.groupby('sex').agg(max_salary=('salary', 'max'))

Unnamed: 0_level_0,max_salary
sex,Unnamed: 1_level_1
Female,342784.0
Male,342784.0


### Exercise 2

<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [3]:
emp.groupby('dept').agg(median_salary=('salary', 'median')).head()

Unnamed: 0_level_0,median_salary
dept,Unnamed: 1_level_1
Fire,61921.08
Health & Human Services,50773.0
Houston Airport System,44200.0
Houston Public Works,46841.5
Library,34611.0


### Exercise 3

<span style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [4]:
emp.groupby('race').agg(avg_salary=('salary', 'mean')).round(-3).reset_index()

Unnamed: 0,race,avg_salary
0,Asian,65000.0
1,Black,52000.0
2,Hispanic,55000.0
3,Native American,58000.0
4,White,67000.0


### Exercise 4

<span style="color:green; font-size:16px">Find the number of employees in each department.</span>

It's not necessary to use a groupby.

In [5]:
emp['dept'].value_counts()

Police                     7573
Fire                       4376
Houston Public Works       4190
Other                      3373
Health & Human Services    1353
Houston Airport System     1216
Parks & Recreation         1152
Library                     563
Solid Waste Management      512
Name: dept, dtype: int64

If you do use a groupby, it doesn't matter what column you use, but you must use `size` and not `count` because `count` will not count missing values. It is possible to use the grouping column as the aggregating column.

In [6]:
emp.groupby('dept').agg(num_employees=('dept', 'size'))

Unnamed: 0_level_0,num_employees
dept,Unnamed: 1_level_1
Fire,4376
Health & Human Services,1353
Houston Airport System,1216
Houston Public Works,4190
Library,563
Other,3373
Parks & Recreation,1152
Police,7573
Solid Waste Management,512


### Exercise 5

<span style="color:green; font-size:16px">Find the number of unique titles there are for each department.</span>

In [7]:
emp.groupby('dept').agg(unique_titles=('title', 'nunique'))

Unnamed: 0_level_0,unique_titles
dept,Unnamed: 1_level_1
Fire,77
Health & Human Services,161
Houston Airport System,137
Houston Public Works,215
Library,66
Other,358
Parks & Recreation,109
Police,145
Solid Waste Management,44


### Exercise 6

<span style="color:green; font-size:16px">Find the index of the employee with the maximum salary for each department and then use those index values to select their entire rows from the original DataFrame.</span>

In [8]:
df = emp.groupby('dept').agg(idx_sal=('salary', 'idxmax'))
df

Unnamed: 0_level_0,idx_sal
dept,Unnamed: 1_level_1
Fire,1732
Health & Human Services,8405
Houston Airport System,3897
Houston Public Works,10704
Library,7564
Other,13338
Parks & Recreation,11679
Police,4413
Solid Waste Management,20244


In [9]:
idx = df['idx_sal']
emp.loc[idx]

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White
8405,Health & Human Services,"CHIEF PHYSICIAN,MD",2017-07-31,186685.0,Female,White
3897,Houston Airport System,AVIATION DIRECTOR,2010-06-01,275000.0,Male,Hispanic
10704,Houston Public Works,PUBLIC WORKS DIRECTOR,2005-08-10,275000.0,Female,White
7564,Library,LIBRARY DIRECTOR,2005-11-07,170000.0,Female,Black
13338,Other,CITY ATTORNEY,2016-05-02,275000.0,Male,Black
11679,Parks & Recreation,PARKS & RECREATION DIRECTOR,2017-07-05,150000.0,Male,White
4413,Police,POLICE CHIEF,2016-11-30,280000.0,Male,Hispanic
20244,Solid Waste Management,SOLID WASTE DIRECTOR,2001-05-14,195000.0,Male,Black


### Use the NYC deaths dataset for the remaining exercises

Execute the cell below to read in the NYC deaths dataset and use it to answer the following exercises.

In [10]:
deaths = pd.read_csv('../data/nyc_deaths.csv')
deaths.head(3)

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71


### Exercise 7

<span style="color:green; font-size:16px">What year had the most deaths?</span>

In [11]:
year_deaths = deaths.groupby('year').agg(total=('deaths', 'sum'))
year_deaths

Unnamed: 0_level_0,total
year,Unnamed: 1_level_1
2007,53996
2008,54138
2009,52820
2010,52505
2011,52726
2012,52420
2013,53387
2014,53006


In [12]:
year_deaths.agg(['max', 'idxmax'])

Unnamed: 0,total
max,54138
idxmax,2008


### Exercise 8

<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [13]:
deaths.groupby('race').agg(total=('deaths', 'sum')).sort_values('total', ascending=False)

Unnamed: 0_level_0,total
race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian,26355
Unknown,6238


### Exercise 9

<span  style="color:green; font-size:16px">Find the total number of deaths by cause and then select the five highest causes.</span>

In [14]:
deaths.groupby('cause').agg(total=('deaths', 'sum')).nlargest(5, 'total')

Unnamed: 0_level_0,total
cause,Unnamed: 1_level_1
Heart Disease,147551
Cancer,106367
Other,77999
Flu and Pneumonia,18678
Diabetes,13794


## 2. Grouping and Aggregating with Multiple Columns

Execute the following cell to read in the City of Houston employee data and use it for the first few exercises.

In [15]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 1

<span  style="color:green; font-size:16px">For each department and sex, find the number of unique position titles, the total number of employees, and the average salary. Make sure there is no multi-level index.</span>

In [16]:
data = emp.groupby(['dept', 'sex']).agg(num_unique_titles=('title', 'nunique'),
                                        num_employees=('title', 'size'),
                                        avg_salaray=('salary', 'mean')).reset_index()
data.head(10)

Unnamed: 0,dept,sex,num_unique_titles,num_employees,avg_salaray
0,Fire,Female,51,240,62212.63725
1,Fire,Male,54,4136,60479.306862
2,Health & Human Services,Female,136,987,53838.31078
3,Health & Human Services,Male,110,366,59230.425956
4,Houston Airport System,Female,85,443,51099.300226
5,Houston Airport System,Male,113,773,57278.306598
6,Houston Public Works,Female,151,1195,51294.453004
7,Houston Public Works,Male,180,2995,51490.113309
8,Library,Female,55,404,41126.962921
9,Library,Male,44,159,44399.943396


### Exercise 2

<span  style="color:green; font-size:16px">For each department, race, and sex find the min and max and salaries.</span>

In [17]:
emp.groupby(['dept','race', 'sex']).agg(min_salary=('salary', 'min'),
                                          max_salary=('salary', 'max')).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,min_salary,max_salary
dept,race,sex,Unnamed: 3_level_1,Unnamed: 4_level_1
Fire,Asian,Female,39104.0,342784.0
Fire,Asian,Male,28024.0,342784.0
Fire,Black,Female,16411.0,342784.0
Fire,Black,Male,28024.0,342784.0
Fire,Hispanic,Female,28024.0,89590.02
Fire,Hispanic,Male,26000.0,342784.0
Fire,Native American,Female,48189.7,70181.28
Fire,Native American,Male,28024.0,115835.98
Fire,White,Female,16910.0,342784.0
Fire,White,Male,16515.0,342784.0


Execute the following cell to read in the college dataset and use it for the remaining exercises.

In [18]:
pd.set_option('display.max_columns', 100)
college = pd.read_csv('../data/college.csv')
college.head(3)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


### Exercise 3

<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

In [19]:
size = college.groupby('city').agg(size=('stabbr', 'size'))
size.head()

Unnamed: 0_level_0,size
city,Unnamed: 1_level_1
Aberdeen,3
Abilene,5
Abingdon,2
Abington,1
Ada,3


In [20]:
size.sort_values('size', ascending=False).head()

Unnamed: 0_level_0,size
city,Unnamed: 1_level_1
New York,87
Chicago,78
Houston,72
Los Angeles,56
Miami,51


Can call `idxmax` directly.

In [21]:
college.groupby('city').agg(size=('stabbr', 'size')).idxmax()

size    New York
dtype: object

### Without groupby

Use `value_counts`

In [22]:
college['city'].value_counts().head()

New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
Name: city, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Does the city 'Houston' only appear in the state of Texas (abbreviated 'TX')?</span>

NO! It also appears in Missouri.

In [23]:
filt = college['city'] == 'Houston'
college.loc[filt, 'stabbr'].unique()

array(['TX', 'MO'], dtype=object)

Can see exact counts

In [24]:
college.loc[filt, 'stabbr'].value_counts()

TX    71
MO     1
Name: stabbr, dtype: int64

You can use a groupby and find the number of unique states for each city. This is not very efficient.

In [25]:
city_unique_state = college.groupby('city').agg(num_unique_states=('stabbr', 'nunique'))
city_unique_state.head()

Unnamed: 0_level_0,num_unique_states
city,Unnamed: 1_level_1
Aberdeen,2
Abilene,1
Abingdon,1
Abington,1
Ada,2


In [26]:
city_unique_state.loc['Houston']

num_unique_states    2
Name: Houston, dtype: int64

Also with `drop_duplicates`

In [27]:
college[['city', 'stabbr']].query('city == "Houston"') \
                           .drop_duplicates(subset='stabbr')

Unnamed: 0,city,stabbr
3617,Houston,TX
5366,Houston,MO


### Exercise 5

<span style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [28]:
college.groupby('stabbr').agg(max_ugds=('ugds', 'max')).head(10)

Unnamed: 0_level_0,max_ugds
stabbr,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0
CA,44744.0
CO,25873.0
CT,18016.0
DC,10433.0
DE,18222.0


### Exercise 6

<span style="color:green; font-size:16px">Find the largest college from each state. From those colleges, find the difference between the largest and smallest.</span>

In [29]:
largest_per_state = college.groupby('stabbr').agg(max_ugds=('ugds', 'max'))
largest_per_state.max() - largest_per_state.min()

max_ugds    150956.0
dtype: float64

### Exercise 7

<span style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

Find the index of the maximum college per state first.

In [30]:
ugds_idx = college.groupby('stabbr').agg(idx=('ugds', 'idxmax'))
ugds_idx.head()

Unnamed: 0_level_0,idx
stabbr,Unnamed: 1_level_1
AK,60
AL,5
AR,137
AS,4138
AZ,7116


In [31]:
idx = ugds_idx['idx']
idx.head()

stabbr
AK      60
AL       5
AR     137
AS    4138
AZ    7116
Name: idx, dtype: int64

Use the index to select the desired rows.

In [32]:
college.loc[idx, ['stabbr', 'instnm', 'ugds']].head(10)

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0
1299,CA,Ashford University,44744.0
574,CO,University of Colorado Boulder,25873.0
641,CT,University of Connecticut,18016.0
701,DC,George Washington University,10433.0
691,DE,University of Delaware,18222.0


Second method - set the index first to be instnm so that you can take advantage of idxmax

In [33]:
c2 = college.set_index('instnm')
max_indexes = c2.groupby('stabbr').agg(max_ugds_college=('ugds', 'idxmax'),
                                       max_ugds=('ugds', 'max'))
max_indexes.head()

Unnamed: 0_level_0,max_ugds_college,max_ugds
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,University of Alaska Anchorage,12865.0
AL,The University of Alabama,29851.0
AR,University of Arkansas,21405.0
AS,American Samoa Community College,1276.0
AZ,University of Phoenix-Arizona,151558.0


Third method - Sort the data first, then sse the `first` groupby method to return the first row of each group after sorting.

In [34]:
college.sort_values('ugds', ascending=False).groupby('stabbr') \
        .agg(max_ugds_college=('instnm', 'first'), 
             max_ugds=('ugds', 'first')).head()

Unnamed: 0_level_0,max_ugds_college,max_ugds
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,University of Alaska Anchorage,12865.0
AL,The University of Alabama,29851.0
AR,University of Arkansas,21405.0
AS,American Samoa Community College,1276.0
AZ,University of Phoenix-Arizona,151558.0


Fourth method - Done previously without grouping

In [35]:
college.sort_values(['stabbr', 'ugds'], ascending=[True, False]) \
       .drop_duplicates(subset='stabbr')[['stabbr', 'instnm', 'ugds']] \
       .head()

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0


### Exercise 8

<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [36]:
# They have more
college.groupby('distanceonly').agg(mean_ugds=('ugds', 'mean'))

Unnamed: 0_level_0,mean_ugds
distanceonly,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Exercise 9

<span style="color:green; font-size:16px">Do distance only schools tend to be more or less religiously affiliated than non-distance-only schools?</span>

In [37]:
# Less
college.groupby('distanceonly').agg(mean_relaffil=('relaffil', 'mean'))

Unnamed: 0_level_0,mean_relaffil
distanceonly,Unnamed: 1_level_1
0.0,0.149635
1.0,0.05


### Exercise 10

<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [38]:
rel_oper_mean = college.query('relaffil == 1') \
                       .groupby('stabbr').agg(mean_curroper=('curroper', 'mean')) \
                       .round(2)
rel_oper_mean.head()

Unnamed: 0_level_0,mean_curroper
stabbr,Unnamed: 1_level_1
AK,1.0
AL,0.92
AR,0.94
AZ,0.44
CA,0.59


In [39]:
rel_oper_mean.sort_values('mean_curroper').head()

Unnamed: 0_level_0,mean_curroper
stabbr,Unnamed: 1_level_1
UT,0.4
AZ,0.44
NV,0.5
CA,0.59
CT,0.65


### Exercise 11

<span  style="color:green; font-size:16px">Find the top 5 historically black colleges that have the highest undergraduate white percentage (ugds_white)?</span>

In [40]:
filt = college['hbcu'] == 1
cols = ['instnm', 'ugds_white']
college.loc[filt, cols].sort_values('ugds_white', ascending=False).head()

Unnamed: 0,instnm,ugds_white
4021,Bluefield State College,0.8437
17,Gadsden State Community College,0.6921
4050,West Virginia State University,0.5816
48,Shelton State Community College,0.5613
55,H Councill Trenholm State Community College,0.3951


## 3. Grouping with Pivot Tables

In [41]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.insert(1, 'day_of_week', flights['date'].dt.day_name())
flights.insert(2, 'month', flights['date'].dt.month_name())
flights.head(3)

Unnamed: 0,date,day_of_week,month,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,Monday,January,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,Monday,January,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,Monday,January,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0


In [42]:
flights.shape

(65923, 16)

### Exercise 1

<span style="color:green; font-size:16px">What is the average carrier delay for each day of the week for each airline? Highlight the worst day of the week for each airline.</span>

In [43]:
avg_delay = flights.pivot_table(index='airline', columns='day_of_week', 
                                values='carrier_delay').round(1)
avg_delay.style.highlight_max(axis='columns')

day_of_week,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9E,3.4,4.7,1.8,3.3,7.7,3.0,4.2
AA,3.9,3.7,3.8,3.8,4.9,4.1,2.8
AS,3.0,1.5,3.6,3.7,2.9,2.6,2.1
B6,7.0,4.0,5.5,5.1,5.1,6.2,4.0
DL,3.5,2.6,3.0,3.6,3.9,3.3,3.3
EV,5.2,8.5,2.4,7.2,0.0,6.0,1.7
F9,7.9,3.6,9.5,4.1,7.8,6.6,1.8
MQ,2.1,4.2,0.0,5.4,4.2,1.5,2.3
NK,2.2,1.5,5.8,1.7,2.2,2.5,1.2
OH,6.3,4.9,2.5,9.2,19.9,0.5,1.5


You can highlight min and max by chaining style methods.

In [44]:
avg_delay.style.highlight_max(axis='columns') \
         .highlight_min(axis='columns', color='lightblue')

day_of_week,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9E,3.4,4.7,1.8,3.3,7.7,3.0,4.2
AA,3.9,3.7,3.8,3.8,4.9,4.1,2.8
AS,3.0,1.5,3.6,3.7,2.9,2.6,2.1
B6,7.0,4.0,5.5,5.1,5.1,6.2,4.0
DL,3.5,2.6,3.0,3.6,3.9,3.3,3.3
EV,5.2,8.5,2.4,7.2,0.0,6.0,1.7
F9,7.9,3.6,9.5,4.1,7.8,6.6,1.8
MQ,2.1,4.2,0.0,5.4,4.2,1.5,2.3
NK,2.2,1.5,5.8,1.7,2.2,2.5,1.2
OH,6.3,4.9,2.5,9.2,19.9,0.5,1.5


### Exercise 2

<span style="color:green; font-size:16px">Use a pivot table to find the total number of canceled flights for each origin airport and airline. Place the airlines in the columns. Use the result to find the origin airport with the most cancelled flights for each airline. Also return this maximum number of cancelled flights.</span>

In [45]:
airline_cancel = flights.pivot_table(index='origin', columns='airline', 
                                     values='cancelled', aggfunc='sum', fill_value=0)
airline_cancel.head(10)

airline,9E,AA,AS,B6,DL,EV,F9,MQ,NK,OH,OO,UA,VX,WN,YV,YX
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
ATL,0,5,0,2,19,0,0,0,5,1,1,2,0,8,4,9
BOS,5,41,1,31,9,0,0,1,5,0,0,18,0,2,0,12
CLT,6,33,0,2,1,2,1,0,0,4,0,0,0,0,0,11
DCA,1,27,0,3,3,5,1,0,0,2,1,2,0,6,0,31
DEN,0,3,1,0,0,0,3,0,1,0,0,10,0,9,0,0
DFW,1,33,0,0,1,0,0,1,3,0,1,1,0,0,1,7
DTW,1,4,0,3,8,0,0,4,1,2,4,0,0,1,1,8
EWR,2,10,6,8,0,0,0,3,1,0,2,27,1,1,0,15
IAH,0,7,0,0,1,0,0,1,4,0,3,7,0,0,4,4
JFK,10,6,3,17,3,0,0,0,0,0,0,0,1,0,0,4


In [46]:
airline_cancel.agg(['max', 'idxmax'])

airline,9E,AA,AS,B6,DL,EV,F9,MQ,NK,OH,OO,UA,VX,WN,YV,YX
max,10,41,9,31,19,5,3,5,5,4,14,27,6,18,4,31
idxmax,JFK,BOS,SEA,BOS,ATL,DCA,DEN,LGA,ATL,CLT,ORD,EWR,LAX,LAX,ATL,DCA


### Exercise 3

<span style="color:green; font-size:16px">Find the total distance flown for each airline for each month. Highlight the month with the most number of miles flown and use the style `format` method to put commas in the numbers so that they are easier to read.</span>

In [47]:
total_dist = flights.pivot_table(index='airline', columns='month', 
                                 values='distance', aggfunc='sum')
total_dist.style.format('{:,.0f}').highlight_max(axis='columns')

month,April,August,December,February,January,July,June,March,May,November,October,September
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
9E,54592.0,62216.0,46032.0,51784,47230,53868.0,50421.0,61460,42423.0,42275.0,48106.0,45745.0
AA,1586655.0,1649436.0,1444276.0,1371620,1473883,1669007.0,1619325.0,1528361,1545453.0,1409540.0,1588285.0,1482841.0
AS,454146.0,451512.0,399787.0,201275,195553,455061.0,496358.0,199288,495090.0,391304.0,409479.0,429045.0
B6,352234.0,404458.0,427097.0,348189,385517,478230.0,443151.0,382666,410877.0,384038.0,425712.0,384008.0
DL,1265266.0,1315865.0,1160997.0,997216,1017440,1396697.0,1292928.0,1215516,1253361.0,1100681.0,1214950.0,1173359.0
EV,6847.0,1194.0,3933.0,11854,10186,927.0,5926.0,4511,3569.0,1592.0,2587.0,995.0
F9,117439.0,97777.0,97846.0,97879,118067,84417.0,116116.0,80444,78807.0,110423.0,105833.0,89938.0
MQ,13060.0,15787.0,14057.0,17539,15170,20057.0,15310.0,13349,13656.0,15559.0,16884.0,14767.0
NK,250683.0,270894.0,232613.0,219678,249461,273963.0,318648.0,228829,261421.0,266838.0,253692.0,235754.0
OH,8280.0,7986.0,8596.0,9911,14802,6461.0,5808.0,14664,5296.0,4948.0,6028.0,7674.0


### Exercise 4

<span style="color:green; font-size:16px">Create a pivot table that shows the number of flights flown for every day of the week for every month.</span>

In [48]:
flights.pivot_table(index='month', columns='day_of_week', aggfunc='size')

day_of_week,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
April,786,943,644,898,818,755,754
August,1006,785,592,776,982,757,963
December,707,840,750,897,759,634,695
February,748,725,544,639,753,716,719
January,673,862,536,696,739,821,838
July,808,936,677,932,765,1012,780
June,1005,792,817,809,842,779,822
March,974,734,705,687,884,676,751
May,766,798,597,726,1058,913,894
November,887,710,624,737,871,707,709


### Exercise 5

<span style="color:green; font-size:16px">In exercise 4, the months and days of week are ordered alphabetically. It would be better if these values were ordered chronologically. Can you return a result that has both groups in the correct order. Use Monday as the first day of the week.</span>

Convert to ordered categorical first overwriting the original columns.

In [49]:
months = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
           'August', 'September', 'October', 'November', 'December']
days =  ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',
         'Sunday']
month_dtype = pd.CategoricalDtype(months, ordered=True)
day_dtype = pd.CategoricalDtype(days, ordered=True)
flights = flights.astype({'month': month_dtype, 'day_of_week': day_dtype})

Call the same pivot table and the index and columns will be automatically sorted by their category order.

In [50]:
flights.pivot_table(index='month', columns='day_of_week', aggfunc='size')

day_of_week,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
January,862,821,838,739,673,536,696
February,725,716,719,753,748,544,639
March,734,676,751,884,974,705,687
April,943,755,754,818,786,644,898
May,798,913,894,1058,766,597,726
June,792,779,822,842,1005,817,809
July,936,1012,780,765,808,677,932
August,785,757,963,982,1006,592,776
September,742,719,762,761,739,742,902
October,961,939,894,790,773,554,711


### Exercise 6

<span style="color:green; font-size:16px">Create a new column in the flights dataset called `'dep_time_hour'` and set it equal to the hour (this will be an integer 0 through 23) of the flight. Find the average carrier delay for every month and `dep_time_hour`. Place the month in the columns.</span>

In [51]:
flights['dep_time_hour'] = flights['dep_time'] // 100

In [52]:
flights.pivot_table(index='dep_time_hour', columns='day_of_week', 
                    values='carrier_delay', aggfunc='mean').round(1)

day_of_week,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
dep_time_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.2,3.6,1.2,2.4,6.5,1.7,2.4
1,0.0,0.0,1.1,0.7,0.0,1.7,2.3
2,,,0.0,,,0.0,0.0
4,,,,,0.0,,
5,5.9,1.5,1.8,2.4,4.4,3.0,1.8
6,2.5,2.8,2.6,2.8,5.2,4.0,3.5
7,2.0,3.5,1.8,3.7,3.8,4.1,1.8
8,3.6,3.1,7.2,4.3,2.3,4.0,3.7
9,3.1,2.8,2.5,3.8,3.3,3.4,3.0
10,2.8,2.9,2.1,3.4,3.3,3.2,2.1


### Exercise 7

<span style="color:green; font-size:16px">Use both `groupby` and `pivot_table` to compute the average and median distance flown by day of the week.</span>

In [53]:
flights.groupby('day_of_week').agg(median_dist=('distance', 'median'),
                                   mean_dist=('distance', 'mean')) \
       .style.format('{:,.0f}')

Unnamed: 0_level_0,median_dist,mean_dist
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
Monday,912,1071
Tuesday,888,1052
Wednesday,868,1053
Thursday,907,1066
Friday,868,1051
Saturday,937,1107
Sunday,925,1093


In [54]:
flights.pivot_table(index='day_of_week', values='distance', 
                    aggfunc=['median', 'mean']).style.format('{:,.0f}')

Unnamed: 0_level_0,median,mean
Unnamed: 0_level_1,distance,distance
day_of_week,Unnamed: 1_level_2,Unnamed: 2_level_2
Monday,912,1071
Tuesday,888,1052
Wednesday,868,1053
Thursday,907,1066
Friday,868,1051
Saturday,937,1107
Sunday,925,1093


## 4. Counting with Crosstabs

In [55]:
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_colwidth = 200
mh = pd.read_csv('../data/mental_health.csv')
mh.head(3)

Unnamed: 0,year,age,gender,country,family_history,treatment,work_interfere,no_employees,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor
0,2014,37,Female,United States,No,Yes,Often,6-25,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes
1,2014,44,Male,United States,No,No,Rarely,More than 1000,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No
2,2014,32,Male,Canada,No,No,Rarely,6-25,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes


### Exercise 1
<span  style="color:green; font-size:16px">Do people with a family history of mental illness seek treatment more often than those who do not?</span>

In [56]:
pd.crosstab(index=mh['family_history'], columns=mh['treatment'])

treatment,No,Yes
family_history,Unnamed: 1_level_1,Unnamed: 2_level_1
No,414,241
Yes,111,325


In [57]:
pd.crosstab(index=mh['family_history'], columns=mh['treatment'], normalize='index').round(2)

treatment,No,Yes
family_history,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.63,0.37
Yes,0.25,0.75


Yes, there is a large difference. 75% of people with a family history seek treatment vs 37% for those who have not.

### Exercise 2
<span  style="color:green; font-size:16px">Find the total number and ratio of employees that seek treatment for companies that provide health benefits vs those that do not.</span>

In [58]:
pd.crosstab(index=mh['benefits'], columns=mh['treatment'])

treatment,No,Yes
benefits,Unnamed: 1_level_1,Unnamed: 2_level_1
Don't know,225,134
No,142,150
Yes,158,282


In [59]:
pd.crosstab(index=mh['benefits'], columns=mh['treatment'], normalize='index').round(2)

treatment,No,Yes
benefits,Unnamed: 1_level_1,Unnamed: 2_level_1
Don't know,0.63,0.37
No,0.49,0.51
Yes,0.36,0.64


### Exercise 3
<span  style="color:green; font-size:16px">You can provide a list of multiple columns to both the `index` and `columns` parameters of the `crosstab` function. Put country and number of employees in the index and benefits and treatment in the columns. It's probably easier to make separate list variables first.</span>

In [60]:
index = [mh['country'], mh['no_employees']]
columns = [mh['benefits'], mh['treatment']]
pd.crosstab(index=index, columns=columns)

Unnamed: 0_level_0,benefits,Don't know,Don't know,No,No,Yes,Yes
Unnamed: 0_level_1,treatment,No,Yes,No,Yes,No,Yes
country,no_employees,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Australia,1-5,1,0,1,1,0,0
Australia,100-500,1,0,1,2,0,2
Australia,26-100,0,0,1,3,0,0
Australia,500-1000,1,0,0,0,0,0
Australia,6-25,0,1,0,3,0,0
Australia,More than 1000,1,0,0,0,1,1
Canada,1-5,1,0,5,5,0,0
Canada,100-500,2,3,0,0,1,3
Canada,26-100,4,4,2,1,3,3
Canada,500-1000,0,0,0,0,0,1


In [61]:
import pandas as pd


### Exercise 4

<span style="color:green; font-size:16px">Read in the bikes dataset and find the distribution of total trip duration by gender and events. Normalize over all groups. You should be able to answer the question, "From the total of all trip durations, what percent were done by males on a clear day?".</span>

In [62]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


In [63]:
pd.crosstab(index=bikes['events'], columns=bikes['gender'], 
            values=bikes['tripduration'], aggfunc='sum', 
            normalize=True, margins=True).round(4) * 100

gender,Female,Male,All
events,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
clear,1.64,4.39,6.03
cloudy,6.09,17.12,23.21
fog,0.06,0.14,0.19
hazy,0.19,0.48,0.67
mostlycloudy,8.67,22.29,30.97
partlycloudy,10.39,23.95,34.34
rain,0.73,2.5,3.23
sleet,0.01,0.02,0.02
snow,0.11,0.66,0.77
tstorms,0.16,0.41,0.56


## 5. Alternate Groupby Syntax

Execute the cell below to read in the flights dataset and then use it for the following exercises.

In [64]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0


### Exercise 1

<span style="color:green; font-size:16px">Use a dictionary in the `groupby` `agg` method to calculate the mean, median, min, and max of the air time for every airline.</span>

In [65]:
flights.groupby('airline').agg({'air_time': ['mean', 'median', 'min', 'max']})

Unnamed: 0_level_0,air_time,air_time,air_time,air_time
Unnamed: 0_level_1,mean,median,min,max
airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
9E,89.901705,84.0,24.0,224.0
AA,147.940078,126.0,22.0,421.0
AS,185.414344,149.0,38.0,395.0
B6,170.061845,132.0,32.0,428.0
DL,146.034775,125.0,22.0,405.0
EV,57.440252,45.0,35.0,178.0
F9,139.187223,124.0,55.0,327.0
MQ,80.21813,83.0,20.0,164.0
NK,147.173162,133.0,40.0,388.0
OH,66.418699,70.5,28.0,142.0


### Exercise 2

<span style="color:green; font-size:16px">Without using the `agg` method calculate the number of unique destinations for each airline.</span>

In [66]:
flights.groupby('airline')['dest'].nunique()

airline
9E    13
AA    20
AS    18
B6    19
DL    20
EV     8
F9    17
MQ    12
NK    16
OH     9
OO    19
UA    19
VX    12
WN    15
YV    10
YX    14
Name: dest, dtype: int64

### Exercise 3

<span style="color:green; font-size:16px">Calculate the mean of every numeric column for each airline and origin without using the `agg` method.</span>

In [67]:
flights.groupby(['airline', 'origin']).mean().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
airline,origin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
9E,ATL,729.714286,840.142857,0.0,105.142857,689.0,0.0,0.0,3.857143,0.0,0.0
9E,BOS,1320.029412,1453.970588,0.04902,47.103093,191.558824,2.284314,0.029412,8.607843,0.0,4.176471
9E,CLT,1254.613636,1461.238636,0.068182,82.926829,554.397727,5.261364,0.0,4.022727,0.0,4.295455
9E,DCA,1168.218182,1309.563636,0.018182,44.62963,216.490909,10.945455,0.0,1.309091,0.0,2.163636
9E,DFW,1346.068182,1682.136364,0.011364,137.08046,1054.022727,2.840909,0.738636,3.772727,0.0,11.875


## 6. Custom Aggregation

Execute the cell below to read in the flights dataset and then use it for the following exercises.

In [68]:
import pandas as pd
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0


### Exercise 1

<span style="color:green; font-size:16px">What are the three airlines with the least number of flights?</span>

In [69]:
flights['airline'].value_counts().tail(3)

MQ    373
OH    257
EV    171
Name: airline, dtype: int64

### Exercise 2

<span style="color:green; font-size:16px">For each airline, find the 75th percentile of flight distance. Use a custom aggregation function.</span>

In [70]:
def per_75(s):
    return s.quantile(.75)

In [71]:
flights.groupby('airline').agg(dist_75=('distance', per_75))

Unnamed: 0_level_0,dist_75
airline,Unnamed: 1_level_1
9E,852.0
AA,1558.0
AS,2402.0
B6,2381.0
DL,1587.0
EV,514.5
F9,1476.0
MQ,612.0
NK,1379.0
OH,500.0


### Exercise 3

<span style="color:green; font-size:16px">For each airline, find out what percentage of its flights leave on a Tuesday. Use a custom aggregation function.</span>

In [72]:
def tuesday_pct(s):
    return (s.dt.day_name() == 'Tuesday').mean()

flights.groupby('airline').agg(percent_tuesday=('date', tuesday_pct)).round(3) * 100

Unnamed: 0_level_0,percent_tuesday
airline,Unnamed: 1_level_1
9E,14.5
AA,14.6
AS,13.8
B6,13.5
DL,14.4
EV,15.8
F9,12.9
MQ,16.1
NK,12.8
OH,16.7


### Exercise 4

<span style="color:green; font-size:16px">Optimize exercise 3 without using a custom aggregation. What is the performance difference?</span>

In [73]:
flights['airline_cat'] = flights['airline'].astype('category')
flights['is_tuesday'] = flights['date'].dt.day_name() == 'Tuesday'
flights.groupby('airline_cat')['is_tuesday'].mean().round(3) * 100

airline_cat
9E    14.5
AA    14.6
AS    13.8
B6    13.5
DL    14.4
EV    15.8
F9    12.9
MQ    16.1
NK    12.8
OH    16.7
OO    14.3
UA    14.0
VX    13.3
WN    15.3
YV    13.9
YX    15.0
Name: is_tuesday, dtype: float64

About 50% improvement

In [74]:
%timeit -r 1 -n 5 flights.groupby('airline').agg(percent_tuesday=('date', tuesday_pct))

14.6 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)


In [75]:
%%timeit -r 1 -n 5
flights['is_tuesday'] = flights['date'].dt.day_name() == 'Tuesday'
flights.groupby('airline_cat')['is_tuesday'].mean().round(3) * 100

9.51 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 5 loops each)


### Exercise 5

<span style="color:green; font-size:16px">The range of salaries per department was calculated using the `min_max` custom function from the beginning of this chapter. Use this same function to calculate the range of distance for each airline. Then calculate this range again without a custom function.</span>

In [76]:
def min_max(s):
    return s.max() - s.min()

In [77]:
flights.groupby('airline').agg(dist_range=('distance', min_max))

Unnamed: 0_level_0,dist_range
airline,Unnamed: 1_level_1
9E,1297.0
AA,2515.0
AS,2468.0
B6,2520.0
DL,2610.0
EV,876.0
F9,2042.0
MQ,831.0
NK,2166.0
OH,835.0


In [78]:
d_max = flights.groupby('airline')['distance'].max()
d_min = flights.groupby('airline')['distance'].min()
d_max - d_min

airline
9E    1297.0
AA    2515.0
AS    2468.0
B6    2520.0
DL    2610.0
EV     876.0
F9    2042.0
MQ     831.0
NK    2166.0
OH     835.0
OO    1405.0
UA    2505.0
VX    2468.0
WN    1940.0
YV    1192.0
YX    1320.0
Name: distance, dtype: float64

Alternatively, create an entire DataFrame.

In [79]:
dist_min_max = flights.groupby('airline').agg(max_dist=('distance', 'max'),
                                              min_dist=('distance', 'min'))
dist_min_max['dist range'] = dist_min_max['max_dist'] - dist_min_max['min_dist']
dist_min_max

Unnamed: 0_level_0,max_dist,min_dist,dist range
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9E,1391.0,94.0,1297.0
AA,2611.0,96.0,2515.0
AS,2704.0,236.0,2468.0
B6,2704.0,184.0,2520.0
DL,2704.0,94.0,2610.0
EV,1075.0,199.0,876.0
F9,2446.0,404.0,2042.0
MQ,925.0,94.0,831.0
NK,2402.0,236.0,2166.0
OH,931.0,96.0,835.0


### Exercise 6

<span style="color:green; font-size:16px">Which origin airport has the highest percentage of its flights cancelled?</span>

In [80]:
# no custom aggregation function needed
flights.groupby('origin').agg(pct_cancelled=('cancelled', 'mean')) \
       .nlargest(1, 'pct_cancelled').round(3) * 100

Unnamed: 0_level_0,pct_cancelled
origin,Unnamed: 1_level_1
BOS,3.4


### Use the college dataset

Execute the following cell which reads in a few columns from the college dataset, sets the institution name as the index and converts 'stabbr' and 'relaffil' to categorical.

In [81]:
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, 
                      index_col='instnm', dtype={'stabbr': 'category', 
                                                 'relaffil': 'category'})
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


### Exercise 7

<span style="color:green; font-size:16px">How many states have more schools with a higher 'satvrmid' than 'satmtmid'? Make sure to not count schools that have missing values for either one.</span>

Make a new DataFrame that drops rows when one of the sat columns is missing.

In [82]:
col_has_sat = college.dropna(subset=['satvrmid', 'satmtmid']).copy()
col_has_sat.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
University of Alabama in Huntsville,AL,0,595.0,590.0,5451.0


Only a fraction of the schools have both scores.

In [83]:
len(col_has_sat)

1184

In [84]:
len(college)

7535

Create a new boolean column that determines which score is higher.

In [85]:
col_has_sat['higher_verbal'] = col_has_sat['satvrmid'] > col_has_sat['satmtmid']
col_has_sat.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds,higher_verbal
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0,True
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0,True
University of Alabama in Huntsville,AL,0,595.0,590.0,5451.0,True


Group by state to determine percentage with higher verbal. Only those greater than .5 have more verbal than math.

In [86]:
avg_verbal_higher = col_has_sat.groupby('stabbr')['higher_verbal'].mean()
avg_verbal_higher.sort_values(ascending=False).head(10)

stabbr
AK    1.000000
VI    1.000000
GA    0.690476
FL    0.684211
OR    0.588235
AL    0.571429
VA    0.564103
MN    0.560000
UT    0.500000
NH    0.500000
Name: higher_verbal, dtype: float64

Technically, have to check for ties. No custom function needed.

In [87]:
(avg_verbal_higher > .5).sum()

8

### Exercise 8

<span style="color:green; font-size:16px">Create a pivot table that shows the percentage of schools with less than 1,000 students in each state by religious affiliation. Also return the count of schools.</span>

In [88]:
def less_1k(s):
    return (s < 1_000).mean().round(3) * 100

In [89]:
result = college.pivot_table(index='stabbr', columns='relaffil', values='ugds', 
                             aggfunc=[less_1k, 'count'])
result.head(10)

Unnamed: 0_level_0,less_1k,less_1k,count,count
relaffil,0,1,0,1
stabbr,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AK,42.9,100.0,7,3
AL,44.4,37.5,71,18
AR,58.8,61.1,68,14
AS,0.0,,1,0
AZ,63.7,77.8,118,8
CA,61.4,27.4,579,76
CO,64.4,28.6,113,4
CT,58.8,23.5,82,7
DC,52.9,0.0,14,4
DE,75.0,0.0,16,3


## 7. Filter and Transform with Groupby

Execute the cell below to reread the college dataset and use it for the exercises below.

In [90]:
import pandas as pd
cols = ['instnm', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college = pd.read_csv('../data/college.csv', usecols=cols, index_col='instnm')
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,AL,1,,,291.0


### Exercise 1

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have more than 500,000 total undergraduate students. Can you verify your results?</span>

In [91]:
def filt_500k(sub_df):
    return sub_df['ugds'].sum() > 500_000

college_large = college.groupby('stabbr').filter(filt_500k)
college_large.head()

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Prince Institute-Southeast,IL,0,,,84.0
Everest College-Phoenix,AZ,1,,,4102.0
Collins College,AZ,0,,,83.0
Empire Beauty School-Paradise Valley,AZ,1,,,25.0
Empire Beauty School-Tucson,AZ,0,,,126.0


In [92]:
college_large.groupby('stabbr').agg(ugds_total=('ugds', 'sum')) \
             .sort_values('ugds_total', ascending=False).round(-3)

Unnamed: 0_level_0,ugds_total
stabbr,Unnamed: 1_level_1
CA,2304000.0
TX,1277000.0
NY,994000.0
FL,960000.0
PA,605000.0
IL,600000.0
OH,538000.0
AZ,520000.0


### Exercise 2

<span style="color:green; font-size:16px">Filter the college DataFrame for states that have a an average undergraduate student population greater than 2,500 and have more than 30 religiously affiliated schools. Can you verify your results?</span>

In [93]:
def func2(sub_df):
    return sub_df['ugds'].mean() > 2_500 and sub_df['relaffil'].sum() > 30

In [94]:
c2 = college.groupby('stabbr').filter(func2)
c2.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Academy of Art University,CA,0,,,9885.0
ITT Technical Institute-Rancho Cordova,CA,0,,,500.0
Academy of Chinese Culture and Health Sciences,CA,0,,,


In [95]:
c2.groupby('stabbr').agg(mean_ugds=('ugds', 'mean'),
                         num_relaffil=('relaffil', 'sum'))

Unnamed: 0_level_0,mean_ugds,num_relaffil
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
CA,3518.308397,164
GA,2642.571429,37
IN,2653.559055,62
MI,2643.016043,48
TX,2998.530516,96
VA,2694.9,44


### Exercise 3

<span style="color:green; font-size:16px">The maximum SAT score for each test is 800. Create a new column in the college dataset that shows each school's percentage of maximum for each SAT score.</span>

No need to use transform here.

In [96]:
college['pct_max_sat_verbal'] = (college['satvrmid'] / 800).round(3) * 100
college['pct_max_sat_math'] = (college['satmtmid'] / 800).round(3) * 100
college.head(3)

Unnamed: 0_level_0,stabbr,relaffil,satvrmid,satmtmid,ugds,pct_max_sat_verbal,pct_max_sat_math
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama A & M University,AL,0,424.0,420.0,4206.0,53.0,52.5
University of Alabama at Birmingham,AL,0,570.0,565.0,11383.0,71.2,70.6
Amridge University,AL,1,,,291.0,,


### Use the City of Houston dataset

Execute the following cell to read in the City of Houston employee dataset and then use it for the following exercises.

In [97]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 4

<span style="color:green; font-size:16px">Filter it so that only position titles with an average salary of 100,000 remain. Can you verify your results?</span>

In [98]:
high_sal = emp.groupby('title').filter(lambda sub_df: sub_df['salary'].mean() > 100_000)
high_sal.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
16,Other,ASSOCIATE JUDGE OF MUNICIPAL COURTS,2005-11-09,107744.0,Male,Hispanic
17,Police,POLICE COMMANDER,1983-02-07,115821.42,Male,White
19,Other,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),2002-05-28,95783.0,Female,Hispanic
39,Houston Airport System,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,2017-08-15,112270.0,Male,Black
48,Fire,ASSISTANT FIRE CHIEF,1994-11-07,115835.98,Male,Hispanic


In [99]:
high_sal.groupby('title').agg(avg_salary=('salary', 'mean')).min()

avg_salary    100038.0
dtype: float64

### Exercise 5

<span style="color:green; font-size:16px">Filter the employee dataset so that only position titles with at least 5 employees and an average salary of 80,000 remain. Can you verify the results?</span>

In [100]:
def sal_count(sub_df):
    return sub_df['salary'].mean() > 80000 and len(sub_df) >= 5

In [101]:
high_sal_count = emp.groupby('title').filter(sal_count)
high_sal_count.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
16,Other,ASSOCIATE JUDGE OF MUNICIPAL COURTS,2005-11-09,107744.0,Male,Hispanic
17,Police,POLICE COMMANDER,1983-02-07,115821.42,Male,White
19,Other,ASSISTANT DIRECTOR (EXECUTIVE LEVEL),2002-05-28,95783.0,Female,Hispanic


In [102]:
high_sal_count.groupby('title').agg(avg_salary=('salary', 'mean'),
                                    size=('salary', 'size')).min()

avg_salary    80153.202222
size              5.000000
dtype: float64

### Exercise 6

<span style="color:green; font-size:16px">Add a column to the DataFrame that contains the median salary based on department, sex, and race.</span>

In [103]:
emp['median_drs'] = emp.groupby(['dept', 'sex', 'race'])['salary'].transform('median')
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race,median_drs
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White,73479.0
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic,47445.0
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black,38813.0
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic,68116.62
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White,73479.0


### Exercise 7

<span  style="color:green; font-size:16px">Add a new column, `pct_max_dept_sex`, to the employee DataFrame that holds the employees percentage of the maximum salary for each department and sex. For instance, if a male HPD employee makes 80,000 and the maximum male HPD salary is 120,000 then the value for this employee would be 80,000/120,000 or .666. Verify this value for the first employee.</span>

In [104]:
def pct_max(sub_series):
    return sub_series / sub_series.max()

In [105]:
emp['pct_max_dept_sex'] = emp.groupby(['dept', 'sex'])['salary'].transform(pct_max)
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race,median_drs,pct_max_dept_sex
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White,73479.0,0.312662
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic,47445.0,0.298844
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black,38813.0,0.227809
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic,68116.62,0.271222
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White,73479.0,0.247697


In [106]:
filt = (emp['dept'] == 'Police') & (emp['sex'] == 'Male')
max_sal = emp.loc[filt, 'salary'].max()
max_sal

280000.0

In [107]:
emp.loc[0, 'salary'] / max_sal

0.31266207142857144

## 8. Other Groupby Methods

Execute the next cell to read in some of the columns from the flights dataset and use it to answer the following exercises.

In [108]:
import pandas as pd
cols = ['date', 'airline', 'origin', 'dest', 'dep_time', 'arr_time',
       'cancelled', 'air_time', 'distance', 'carrier_delay']
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'], usecols=cols)
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0


### Exercise 1

<span style="color:green; font-size:16px">For each airline, return the first and last row of each group. Use the `nth` groupby method.</span>

In [109]:
flights.groupby('airline').nth([0, -1]).head(8)

Unnamed: 0_level_0,date,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9E,2018-12-31,CLT,JFK,1400,1603,0,80.0,541.0,0
9E,2018-01-01,IAH,ATL,1346,1651,0,86.0,689.0,0
AA,2018-01-01,DFW,DCA,610,959,0,131.0,1192.0,0
AA,2018-12-31,DFW,SFO,2047,2245,0,194.0,1464.0,0
AS,2018-12-31,SEA,DFW,2315,502,0,210.0,1660.0,3
AS,2018-01-01,SEA,SFO,605,816,0,97.0,679.0,0
B6,2018-12-31,PHX,JFK,2234,509,0,233.0,2153.0,0
B6,2018-01-01,JFK,BOS,550,657,0,39.0,187.0,0


### Exercise 2

<span style="color:green; font-size:16px">For every origin and destination combination, select the 500th flight.</span>

Only the combinations that have at least 500 flights will have a returned value.

In [110]:
flights.groupby(['origin', 'dest']).nth(499)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,airline,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
origin,dest,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
JFK,LAX,2018-11-27,DL,1925,2300,0,325.0,2475.0,0
LAS,LAX,2018-12-21,WN,545,655,0,48.0,236.0,0
LAX,JFK,2018-11-29,DL,1145,2007,0,269.0,2475.0,0
LAX,LAS,2018-12-25,AA,1955,2107,0,54.0,236.0,0
LAX,SFO,2018-10-15,WN,955,1115,0,56.0,337.0,0
LGA,ORD,2018-10-17,UA,1700,1836,0,129.0,733.0,0
ORD,LGA,2018-10-14,UA,1300,1615,0,95.0,733.0,0
SFO,LAX,2018-10-19,UA,1300,1435,0,58.0,337.0,0


### Exercise 3

<span style="color:green; font-size:16px">Find the date of the 10th cancelled flight for each airline.</span>

In [111]:
flights.query('cancelled == 1').groupby('airline').nth(9)

Unnamed: 0_level_0,date,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9E,2018-03-13,JFK,BOS,905,1030,1,,187.0,0
AA,2018-01-04,EWR,PHX,1620,2009,1,,2133.0,0
AS,2018-06-17,EWR,SFO,1725,2054,1,,2565.0,0
B6,2018-01-05,BOS,DFW,731,1105,1,,1562.0,0
DL,2018-01-04,DTW,PHL,1745,1941,1,,453.0,0
EV,2018-10-28,DCA,EWR,600,715,1,,199.0,0
F9,2018-09-15,MSP,DEN,840,951,1,,680.0,0
MQ,2018-04-16,LGA,PHL,1815,1937,1,,96.0,0
NK,2018-02-12,IAH,EWR,630,1042,1,,1400.0,0
OH,2018-09-15,DCA,CLT,1355,1534,1,,331.0,0


### Exercise 4

<span style="color:green; font-size:16px">Find the average carrier delay for each origin and destination combination with more than 300 flights.</span>

In [112]:
delay = flights.groupby(['origin', 'dest'])['carrier_delay'].agg(['size', 'mean']).round(1)
delay.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,size,mean
origin,dest,Unnamed: 2_level_1,Unnamed: 3_level_1
ATL,BOS,304,2.5
ATL,CLT,262,5.0
ATL,DCA,287,2.2
ATL,DEN,215,3.0
ATL,DFW,289,2.3


In [113]:
delay.query('size > 300').head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,size,mean
origin,dest,Unnamed: 2_level_1,Unnamed: 3_level_1
ATL,BOS,304,2.5
ATL,LGA,411,1.5
ATL,MCO,373,4.6
ATL,ORD,319,2.5
BOS,DCA,383,3.4
BOS,LGA,416,3.3
BOS,ORD,314,1.8
DCA,BOS,348,1.7
DCA,ORD,339,1.2
DEN,LAX,345,6.7


### Exercise 5

<span style="color:green; font-size:16px">Find the three shortest air times for every airline.</span>

In [114]:
flights.groupby('airline')['air_time'].nsmallest(3)

airline       
9E       32935    24.0
         45541    25.0
         2317     26.0
AA       8900     22.0
         24774    23.0
         43455    23.0
AS       54429    38.0
         55921    38.0
         24645    39.0
B6       9348     32.0
         16214    32.0
         41277    32.0
DL       13270    22.0
         10482    24.0
         22989    26.0
EV       16598    35.0
         17292    35.0
         20431    35.0
F9       10686    55.0
         59371    55.0
         53304    56.0
MQ       15949    20.0
         42097    21.0
         57860    21.0
NK       141      40.0
         2719     40.0
         36361    40.0
OH       2272     28.0
         20572    29.0
         6789     30.0
OO       31828    38.0
         58185    38.0
         63449    38.0
UA       60729    33.0
         23181    34.0
         28721    34.0
VX       2129     38.0
         3545     39.0
         3723     39.0
WN       23097    36.0
         55993    37.0
         5233     38.0
YV       38847    3

In [115]:
g = flights.groupby('origin')['air_time'].agg('nlargest')

In [116]:
g

origin       
ATL     10644    340.0
        7868     327.0
        15259    325.0
        15345    323.0
        4437     322.0
                 ...  
SFO     18084    386.0
        18103    348.0
        18079    345.0
        31160    345.0
        21411    344.0
Name: air_time, Length: 100, dtype: float64

In [117]:
g.nsmallest(3)

origin       
DEN     28399    235.0
        38792    237.0
        35394    246.0
Name: air_time, dtype: float64

## 9. Binning Numeric Columns

In [118]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes.head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
0,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11.0,Michigan Ave & Oak St,15.0,73.9,12.7,mostlycloudy
1,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31.0,Wells St & Walton St,19.0,69.1,6.9,partlycloudy
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15.0,Dearborn St & Monroe St,23.0,73.0,16.1,mostlycloudy


### Exercise 1

<span style="color:green; font-size:16px">Find the number of rides between trip durations of 0 to 100, 101 to 1000, and 1001 and above.</span>

In [119]:
max_td = bikes['tripduration'].max()
pd.cut(bikes['tripduration'], [0, 100, 1000, max_td]).value_counts()

(100, 1000]      39669
(1000, 86188]    10178
(0, 100]           242
Name: tripduration, dtype: int64

### Exercise 2

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the width of each bin is the same size. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

In [120]:
# no, this binning puts nearly all of the data into the first bin
pd.cut(bikes['tripduration'], 5).value_counts(sort=False)

(-26.128, 17285.6]    50060
(17285.6, 34511.2]       11
(34511.2, 51736.8]        9
(51736.8, 68962.4]        3
(68962.4, 86188.0]        6
Name: tripduration, dtype: int64

### Exercise 3

<span style="color:green; font-size:16px">Cut the trip duration into five bins where the number of observations in each bin is the approximately the same. Count the occurrence of each bin. Sort the resulting Series by the index. Does it make sense to use the type of binning?</span>

In [121]:
# yes, this makes more sense
pd.qcut(bikes['tripduration'], 5).value_counts(sort=False)

(59.999, 317.0]      10043
(317.0, 480.0]       10011
(480.0, 682.0]       10024
(682.0, 1007.0]       9997
(1007.0, 86188.0]    10014
Name: tripduration, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Quantile cut trip duration and temperature into five equal-sized bins and count the occurrences using `pd.crosstab`. Do you notice any patterns?</span>

Rides with higher temperature have longer duration and vice versa.

In [122]:
td_bins = pd.qcut(bikes['tripduration'], 5)
temp_bins = pd.qcut(bikes['temperature'], 5)
pd.crosstab(index=td_bins, columns=temp_bins)

temperature,"(-9999.001, 48.0]","(48.0, 62.1]","(62.1, 71.1]","(71.1, 78.1]","(78.1, 96.1]"
tripduration,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(59.999, 317.0]",2712,2204,1931,1670,1526
"(317.0, 480.0]",2412,1947,2068,1927,1657
"(480.0, 682.0]",2151,2103,2067,1917,1786
"(682.0, 1007.0]",1832,1940,2180,2101,1944
"(1007.0, 86188.0]",1458,1754,2328,2331,2143


### Exercise 5

<span style="color:green; font-size:16px">Create a pivot table containing the average trip duration by gender and temperature quantile cut into 10 equal-sized bins.</span>

In [123]:
temp_bins = pd.qcut(bikes['temperature'], 10)
bikes.pivot_table(index=temp_bins, columns='gender', 
                  values='tripduration', aggfunc='mean').round(0)

gender,Female,Male
temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
"(-9999.001, 37.0]",797.0,587.0
"(37.0, 48.0]",670.0,648.0
"(48.0, 55.9]",762.0,622.0
"(55.9, 62.1]",789.0,653.0
"(62.1, 66.9]",791.0,724.0
"(66.9, 71.1]",797.0,706.0
"(71.1, 73.9]",844.0,746.0
"(73.9, 78.1]",897.0,730.0
"(78.1, 82.0]",823.0,725.0
"(82.0, 96.1]",906.0,756.0


### Exercise 6

<span style="color:green; font-size:16px">The temperature column has a single obviously wrong value. Replace this value with the numpy nan object and then cut the resulting Series into five bins, labeling them 'cold', 'cool', 'mild', 'warm', 'hot'. Choose the boundaries of the bins that make sense for these labels. Then count the occurence of each label and include the missing values.</span>

In [124]:
# -9999 is wrong
bikes['temperature'].drop_duplicates().sort_values().head()

27168   -9999.0
2064       -8.0
10262      -6.0
2062       -5.1
21774      -4.0
Name: temperature, dtype: float64

In [125]:
import numpy as np
temp = bikes['temperature'].replace(-9999, np.nan)
tmin, tmax = temp.agg(['min', 'max'])
tmin, tmax

(-8.0, 96.1)

In [126]:
temp_bins = pd.cut(bikes['temperature'], [tmin, 40, 55, 65, 75, tmax], 
                   labels=['cold', 'cool', 'mild', 'warm', 'hot'], 
                   include_lowest=True)
temp_bins.head()

0    warm
1    warm
2    warm
3    warm
4    warm
Name: temperature, dtype: category
Categories (5, object): ['cold' < 'cool' < 'mild' < 'warm' < 'hot']

In [127]:
temp_bins.value_counts(dropna=False, sort=False)

cold     6433
cool     8483
mild     8571
warm    13302
hot     13299
NaN         1
Name: temperature, dtype: int64

## 10. Miscellaneous Grouping Functionality

In [128]:
flights = pd.read_csv('../data/flights.csv')
flights.head(3)

Unnamed: 0,date,airline,origin,dest,dep_time,arr_time,cancelled,air_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018-01-01,UA,LAS,IAH,100,547,0,134.0,1222.0,0,0,0,0,0
1,2018-01-01,WN,DEN,PHX,515,720,0,91.0,602.0,0,0,0,0,0
2,2018-01-01,B6,JFK,BOS,550,657,0,39.0,187.0,0,83,8,0,0


### Exercise 1

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a carrier delay of 15 minutes or more. The values should be `False` if under 15 minutes and `True` if 15 minutes or over. Find the average distance flown by each group.</span>

In [129]:
has_carrier_delay = flights['carrier_delay'] >= 15
has_carrier_delay.head()

0    False
1    False
2    False
3    False
4    False
Name: carrier_delay, dtype: bool

In [130]:
dist = flights['distance']
dist.groupby(has_carrier_delay).mean()

carrier_delay
False    1067.506608
True     1100.157233
Name: distance, dtype: float64

### Exercise 2

<span style="color:green; font-size:16px">Create a Series of booleans determining if there is a weather delay of 15 minutes or more. Compute a cross tabulation of this Series with the similar one created above on carrier delay.</span>

In [131]:
has_weather_delay = flights['weather_delay'] >= 15

In [132]:
pd.crosstab(index=has_carrier_delay, columns=has_weather_delay)

weather_delay,False,True
carrier_delay,Unnamed: 1_level_1,Unnamed: 2_level_1
False,61902,523
True,3478,20


### Exercise 3

<span style="color:green; font-size:16px">Find the total carrier delay by airline and origin as a Series with a multi-level index.</span>

In [133]:
s = flights.groupby(['airline', 'origin'])['carrier_delay'].sum()
s.head()

airline  origin
9E       ATL         0
         BOS       233
         CLT       463
         DCA       602
         DFW       250
Name: carrier_delay, dtype: int64

### Exercise 4

<span style="color:green; font-size:16px">Using the Series from Exercise 3, calculate the total carrier delay by airline. Verify the result by calculating it directly from the original DataFrame.</span>

In [134]:
s.groupby('airline').sum()

airline
9E     4338
AA    65027
AS     8890
B6    20094
DL    43580
EV      754
F9     6627
MQ     1098
NK     6821
OH     1486
OO    11788
UA    38674
VX     1500
WN    17353
YV     4107
YX     7627
Name: carrier_delay, dtype: int64

In [135]:
# direct verification
flights.groupby('airline')['carrier_delay'].sum()

airline
9E     4338
AA    65027
AS     8890
B6    20094
DL    43580
EV      754
F9     6627
MQ     1098
NK     6821
OH     1486
OO    11788
UA    38674
VX     1500
WN    17353
YV     4107
YX     7627
Name: carrier_delay, dtype: int64

### Exercise 5

<span style="color:green; font-size:16px">Read in the Sweden deaths dataset found in the covid folder. Place the year column in the index and then calculate the total number of deaths by 10 year age interval per year. Then take this DataFrame and calculate the average deaths per age group group by 5 year time spans</span>

In [136]:
df = pd.read_csv('../data/covid/sweden_deaths.csv', index_col='year')
df.columns = df.columns.astype('int64')
df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,...,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
1980,671,54,27,29,31,24,24,23,23,22,15,14,30,26,39,42,47,54,57,63,81,65,64,90,86,100,106,88,95,118,103,131,131,139,159,176,177,163,161,149,204,176,224,211,231,252,275,295,290,366,...,429,463,510,564,655,682,823,950,1035,1198,1059,1179,1332,1412,1517,1722,1811,2104,2098,2364,2555,2598,2831,2822,2961,3072,3273,3245,3280,3387,3254,3253,3039,2991,2887,2540,2229,2119,1843,1563,1254,1151,875,693,509,370,248,166,89,167
1981,653,40,21,22,18,20,24,20,22,16,23,13,18,20,40,41,41,57,74,77,75,75,73,73,75,74,70,85,93,103,105,117,103,140,126,157,150,172,166,162,147,185,208,177,222,239,255,275,298,354,...,424,441,482,536,638,681,760,826,982,1146,1203,1099,1260,1430,1490,1579,1872,1988,2241,2304,2613,2672,2821,2939,3017,3012,3135,3312,3375,3347,3344,3181,3164,3165,2954,2746,2435,2151,1864,1621,1255,1093,922,702,528,393,281,183,97,144
1982,635,41,27,28,30,18,13,20,24,16,17,14,16,15,18,48,43,47,56,68,59,79,55,60,68,85,79,96,97,84,94,97,115,117,147,166,160,189,202,173,170,169,198,211,209,245,237,259,293,325,...,405,460,456,510,619,708,706,841,891,1044,1206,1307,1174,1355,1454,1540,1796,2015,2120,2289,2363,2671,2731,2893,2993,3119,3081,3214,3353,3480,3271,3228,3179,2996,2832,2571,2357,2146,1949,1557,1413,1082,887,712,508,383,280,178,114,180
1983,646,33,26,23,19,18,24,10,21,20,16,17,21,15,16,45,51,39,88,70,66,53,82,84,76,77,98,83,92,90,109,92,100,114,113,150,169,155,156,161,215,198,193,192,204,210,244,264,262,319,...,383,460,487,520,541,589,621,748,911,959,1119,1234,1374,1261,1487,1491,1675,1865,2040,2100,2416,2496,2695,3004,2994,3095,3201,3215,3305,3304,3436,3354,3148,3118,3034,2546,2568,2248,2000,1667,1495,1192,927,691,547,439,277,207,133,202
1984,600,40,15,22,11,13,16,19,13,17,15,23,19,18,26,39,37,35,77,70,82,63,71,83,69,75,96,90,89,88,113,101,94,121,116,141,157,156,179,183,208,186,228,198,229,226,233,259,263,293,...,336,421,423,491,552,630,657,739,818,895,1035,1229,1329,1498,1402,1444,1586,1783,1922,2105,2332,2556,2619,2784,3085,3299,3223,3338,3324,3309,3413,3359,3254,3091,2924,2645,2474,2288,2043,1757,1436,1209,980,673,556,427,313,192,134,221


In [137]:
age_bins = pd.cut(df.columns.astype('int64'), 
                  [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 101], 
                  right=False)
deaths_grouped = df.groupby(age_bins, axis=1).sum()
deaths_grouped.head()

Unnamed: 0_level_0,"[0, 10)","[10, 20)","[20, 30)","[30, 40)","[40, 50)","[50, 60)","[60, 70)","[70, 80)","[80, 90)","[90, 101)"
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1980,928,387,893,1489,2524,6516,15432,29001,27542,7085
1981,856,404,796,1398,2360,6142,15308,29200,28351,7219
1982,852,342,762,1460,2316,5917,15011,28707,28009,7294
1983,840,378,801,1319,2301,5592,14505,28521,28756,7777
1984,766,359,806,1361,2323,5381,14123,28665,28800,7898


In [138]:
# this excludes 1980. first year is 1981
year_bins = pd.cut(deaths_grouped.index, 
                   [1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020])
deaths_grouped.groupby(year_bins).mean()

Unnamed: 0,"[0, 10)","[10, 20)","[20, 30)","[30, 40)","[40, 50)","[50, 60)","[60, 70)","[70, 80)","[80, 90)","[90, 101)"
"(1980, 1985]",832.8,369.4,782.6,1378.8,2327.2,5677.4,14652.8,28872.0,28925.2,7783.4
"(1985, 1990]",850.0,356.8,806.8,1243.2,2606.6,4903.0,13272.0,28054.0,32420.8,9608.6
"(1990, 1995]",755.0,264.0,671.2,1076.8,2615.2,4745.0,11287.8,26259.4,35024.6,11844.2
"(1995, 2000]",465.8,231.2,549.4,917.0,2119.8,5115.4,9670.6,24168.4,35999.4,14546.4
"(2000, 2005]",415.8,234.6,530.2,798.4,1856.0,5238.6,9442.8,20645.0,36548.8,17082.6
"(2005, 2010]",389.6,235.8,542.6,709.0,1693.4,4471.0,10413.2,18020.6,35769.4,18739.8
"(2010, 2015]",368.8,184.8,610.6,706.0,1574.2,3915.8,10539.0,18172.2,32979.8,21381.0
"(2015, 2020]",356.2,182.2,622.6,757.6,1415.2,3645.0,9419.6,21104.4,32077.4,22825.6
