# Solutions

1. [Groupby Aggregation Basics](#01.-Groupby-Aggregation-Basics)
1. [Grouping and Aggregating with Multiple Columns](#02.-Grouping-and-Aggregating-with-Multiple-Columns)

# 01. Groupby Aggregation Basics

In [1]:
import pandas as pd
nyc = pd.read_csv('../data/nyc_deaths.csv')
nyc.head()

Unnamed: 0,year,cause,sex,race,deaths
0,2007,Accidents,F,Asian,32
1,2007,Accidents,F,Black,87
2,2007,Accidents,F,Hispanic,71
3,2007,Accidents,F,White,162
4,2007,Accidents,M,Asian,53


### Problem 1
<span  style="color:green; font-size:16px">What year had the most deaths?</span>

In [2]:
year_deaths = nyc.groupby('year').agg({'deaths':'sum'})
year_deaths.idxmax()

deaths    2008
dtype: int64

In [3]:
# one line
nyc.groupby('year').agg({'deaths':'sum'}).idxmax()

deaths    2008
dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Find the total number of deaths by race and sort by most to least.</span>

In [4]:
nyc.groupby('race').agg({'deaths':'sum'}).sort_values('deaths', ascending=False)

Unnamed: 0_level_0,deaths
race,Unnamed: 1_level_1
White,206487
Black,111116
Hispanic,74802
Asian,26355
Unknown,6238


### Use the employee dataset for the remaining problems

In [5]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22


### Problem 3
<span  style="color:green; font-size:16px">Find the maximum salary for each gender.</span>

In [6]:
emp.groupby('gender').agg({'salary':'max'})

Unnamed: 0_level_0,salary
gender,Unnamed: 1_level_1
Female,178331.0
Male,275000.0


### Problem 4
<span  style="color:green; font-size:16px">Find the median salary for each department.</span>

In [7]:
emp.groupby('dept').agg({'salary':'median'}).head()

Unnamed: 0_level_0,salary
dept,Unnamed: 1_level_1
Admn. & Regulatory Affairs,37710.0
City Controller's Office,57054.0
City Council,54000.0
Convention and Entertainment,38397.0
Dept of Neighborhoods (DON),43742.0


### Problem 5
<span  style="color:green; font-size:16px">Find the average salary for each race. Return a DataFrame with the race as a column.</span>

In [8]:
emp.groupby('race').agg({'salary':'mean'}).reset_index()

Unnamed: 0,race,salary
0,Asian,61660.304762
1,Black,50137.801493
2,Hispanic,52345.562771
3,Native American,60272.1
4,Other,51278.0
5,White,64419.799012


# 02. Grouping and Aggregating with Multiple Columns

In [9]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date', 'job_date'])
emp['experience'] = 2016 - emp['hire_date'].dt.year
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date,experience
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13,10
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18,16
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03,1
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25,34
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22,27


### Problem 1
<span  style="color:green; font-size:16px">For each department and gender find the number of unique position titles, the total number of employees and the average salary. Make sure there is no multi-index for the index or columns.</span>

In [10]:
data = emp.groupby(['dept', 'gender']).agg({'title':['nunique','size'],
                                            'salary':'mean'}).reset_index()
data.columns = ['dept', 'gender', 'num unique positions', 'size', 'mean salary']
data.head(10)

Unnamed: 0,dept,gender,num unique positions,size,mean salary
0,Admn. & Regulatory Affairs,Female,16,22,48758.181818
1,Admn. & Regulatory Affairs,Male,7,7,57592.285714
2,City Controller's Office,Female,2,4,58979.5
3,City Controller's Office,Male,1,1,42640.0
4,City Council,Female,5,7,59260.0
5,City Council,Male,4,4,58491.5
6,Convention and Entertainment,Female,1,1,38397.0
7,Dept of Neighborhoods (DON),Female,8,8,50577.5
8,Dept of Neighborhoods (DON),Male,6,9,43995.444444
9,Finance,Female,4,4,83254.25


### Problem 2
<span  style="color:green; font-size:16px">For each department, race and gender find the maximum years of experience and salary.</span>

In [11]:
emp.groupby(['dept','race','gender']).agg({'experience': 'max',
                                           'salary': 'max'}).reset_index().head(10)

Unnamed: 0,dept,race,gender,experience,salary
0,Admn. & Regulatory Affairs,Asian,Female,15,130416.0
1,Admn. & Regulatory Affairs,Black,Female,23,72741.0
2,Admn. & Regulatory Affairs,Black,Male,4,30098.0
3,Admn. & Regulatory Affairs,Hispanic,Female,12,47341.0
4,Admn. & Regulatory Affairs,Hispanic,Male,9,35318.0
5,Admn. & Regulatory Affairs,White,Female,21,62129.0
6,Admn. & Regulatory Affairs,White,Male,15,140416.0
7,City Controller's Office,Asian,Female,3,59077.0
8,City Controller's Office,Black,Female,24,57054.0
9,City Controller's Office,Black,Male,3,42640.0


## Use the college dataset for the rest of the problems

In [12]:
college = pd.read_csv('../data/college.csv')
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


### Problem 3
<span  style="color:green; font-size:16px">Which city name appears the most frequently. Do this in two different ways. Do it once with and once without the `groupby` method?</span>

In [13]:
size = college.groupby('city').agg({'stabbr': 'size'})
size.head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
ARTESIA,1
Aberdeen,3
Abilene,5
Abingdon,2
Abington,1


In [14]:
size.sort_values('stabbr', ascending=False).head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
New York,87
Chicago,78
Houston,72
Los Angeles,56
Miami,51


Can also just `size` directly and sort the series.

In [15]:
college.groupby('city').size().sort_values(ascending=False).head()

city
New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
dtype: int64

### Without groupby
Just use **`value_counts`**! Much easier

In [16]:
college['city'].value_counts().head()

New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
Name: city, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Does the city **`Houston`** only appear in the state of **`Texas`**?</span>

NO! It also appears in Missouri.

In [17]:
filt = college['city'] == 'Houston'
college.loc[filt, 'stabbr'].unique()

array(['TX', 'MO'], dtype=object)

Can see exact counts

In [18]:
college.loc[filt, 'stabbr'].value_counts()

TX    71
MO     1
Name: stabbr, dtype: int64

You can use a groupby and find the number of unique states for each city. This is not very efficient.

In [19]:
city_unique_state = college.groupby('city').agg({'stabbr': 'nunique'})
city_unique_state.head()

Unnamed: 0_level_0,stabbr
city,Unnamed: 1_level_1
ARTESIA,1
Aberdeen,2
Abilene,1
Abingdon,1
Abington,1


In [20]:
city_unique_state.loc['Houston']

stabbr    2
Name: Houston, dtype: int64

### Problem 5
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [21]:
college.groupby('stabbr').agg({'ugds': 'max'}).head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,12865.0
AL,29851.0
AR,21405.0
AS,1276.0
AZ,151558.0
CA,44744.0
CO,25873.0
CT,18016.0
DC,10433.0
DE,18222.0


### Problem 6
<span  style="color:green; font-size:16px">Among colleges that have the largest undergrad population for each state, what is the difference between the most and least populous college?</span>

In [22]:
# from problem 8
largest_per_state = college.groupby('stabbr').agg({'ugds': 'max'})
largest_per_state.max() - largest_per_state.min()

ugds    150956.0
dtype: float64

### Problem 7: Advanced
<span  style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

The following returns the index of the maximum value of population for each state. 

In [23]:
max_indexes = college.groupby('stabbr').agg({'ugds': 'idxmax'})
max_indexes.head(10)

Unnamed: 0_level_0,ugds
stabbr,Unnamed: 1_level_1
AK,60
AL,5
AR,137
AS,4138
AZ,7116
CA,1299
CO,574
CT,641
DC,701
DE,691


For instance, the row with index label 60 as the maximum population for Alaska. Let's verify this by selecting the institution name and population of this specific row.

In [24]:
cols = ['instnm', 'ugds']
college.loc[60, cols]

instnm    University of Alaska Anchorage
ugds                               12865
Name: 60, dtype: object

Verify by selecting only Alaska colleges and getting the max value:

In [25]:
filt = college['stabbr'] == 'AK'
college.loc[filt, 'ugds'].max()

12865.0

We need to get the index locations as a Series or a NumPy array to use with **`.loc`**. Currently **`max_indexes`** is a DataFrame.

In [26]:
locs = max_indexes['ugds']
locs.head()

stabbr
AK      60
AL       5
AR     137
AS    4138
AZ    7116
Name: ugds, dtype: int64

We can pass this Series to **`.loc`** which will select just those indexes, along with the columns we want.

In [27]:
cols = ['stabbr', 'instnm', 'ugds']
college.loc[locs, cols].head()

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0


### Alternative method if index is INSTNM

In [28]:
college_instm = college.set_index('instnm')
cols = ['stabbr', 'ugds']
college_instm = college_instm[cols]
college_instm.head()

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,AL,4206.0
University of Alabama at Birmingham,AL,11383.0
Amridge University,AL,291.0
University of Alabama in Huntsville,AL,5451.0
Alabama State University,AL,4811.0


In [29]:
# group by state and use idxmax
max_colleges = college_instm.groupby('stabbr').agg({'ugds': 'idxmax'})
max_colleges.head()
max_indexes = max_colleges['ugds']
college_instm.loc[max_indexes].head()

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alaska Anchorage,AK,12865.0
The University of Alabama,AL,29851.0
University of Arkansas,AR,21405.0
American Samoa Community College,AS,1276.0
University of Phoenix-Arizona,AZ,151558.0


## Yet another way
Use the **`first`** groupby method to return the first row of each group after sorting.

In [30]:
cols = ['stabbr', 'instnm', 'ugds']
college_trim = college[cols]

# sort by state then by population descending
college_trim_sort = college_trim.sort_values(['stabbr', 'ugds'], ascending=[True, False])


# group by state and take the first in the group
college_trim_sort.groupby('stabbr').first().head()

Unnamed: 0_level_0,instnm,ugds
stabbr,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,University of Alaska Anchorage,12865.0
AL,The University of Alabama,29851.0
AR,University of Arkansas,21405.0
AS,American Samoa Community College,1276.0
AZ,University of Phoenix-Arizona,151558.0


## Use `sort_values` with `drop_duplicates`
We've done this in previous notebooks. No grouping.

In [31]:
college_trim.sort_values(['stabbr', 'ugds'], ascending=[True, False]) \
            .drop_duplicates(subset='stabbr').head(10)

Unnamed: 0,stabbr,instnm,ugds
60,AK,University of Alaska Anchorage,12865.0
5,AL,The University of Alabama,29851.0
137,AR,University of Arkansas,21405.0
4138,AS,American Samoa Community College,1276.0
7116,AZ,University of Phoenix-Arizona,151558.0
1299,CA,Ashford University,44744.0
574,CO,University of Colorado Boulder,25873.0
641,CT,University of Connecticut,18016.0
701,DC,George Washington University,10433.0
691,DE,University of Delaware,18222.0


### Problem 8
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [32]:
# They have more
college.groupby('distanceonly').agg({'ugds': 'mean'})

Unnamed: 0_level_0,ugds
distanceonly,Unnamed: 1_level_1
0.0,2334.648135
1.0,6245.74359


### Problem 9
<span  style="color:green; font-size:16px">Do distance only schools tend to be more or less religously affiliated than non-distance-only schools?</span>

In [34]:
# Less
college.groupby('distanceonly').agg({'relaffil': 'mean'})

Unnamed: 0_level_0,relaffil
distanceonly,Unnamed: 1_level_1
0.0,0.149635
1.0,0.05


### Problem 10
<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [35]:
filt = college['relaffil'] == 1
cr = college[filt]
rel_oper_mean = cr.groupby('stabbr').agg({'curroper': 'mean'})
rel_oper_mean.head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
AK,1.0
AL,0.916667
AR,0.944444
AZ,0.444444
CA,0.585366


In [36]:
# Utah. Answer makes sense.
rel_oper_mean.sort_values('curroper').head()

Unnamed: 0_level_0,curroper
stabbr,Unnamed: 1_level_1
UT,0.4
AZ,0.444444
NV,0.5
CA,0.585366
CT,0.647059


### Problem 11
<span  style="color:green; font-size:16px">Trim the **`college`** DataFrame to only the 'race' columns - those beginning with **`ugds_`**. Create a new column called **`ugds_other`** that is the sum of any race column that averages under 4% for the entire dataset.</span>

In [37]:
pd.options.display.max_columns = 100

In [38]:
college.head()

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [39]:
# trim dataframe
df_race = college.loc[:, 'ugds_white':'ugds_unkn']

race_average = df_race.mean()

race_average

ugds_white    0.510207
ugds_black    0.189997
ugds_hisp     0.161635
ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
ugds_unkn     0.045181
dtype: float64

In [40]:
# keep only those less than 4%
other_race = race_average[race_average < .04]

other_race

ugds_asian    0.033544
ugds_aian     0.013813
ugds_nhpi     0.004569
ugds_2mor     0.023950
ugds_nra      0.016086
dtype: float64

In [41]:
# get the column names
race_columns = other_race.index

race_columns

Index(['ugds_asian', 'ugds_aian', 'ugds_nhpi', 'ugds_2mor', 'ugds_nra'], dtype='object')

In [42]:
# grab the columns and sum accross the rows
df_race['ugds_other'] = df_race[race_columns].sum(axis='columns')

# can drop the low percentage columns
df_race.drop(race_columns, axis=1).head(10)

Unnamed: 0,ugds_white,ugds_black,ugds_hisp,ugds_unkn,ugds_other
0,0.0333,0.9353,0.0055,0.0138,0.0121
1,0.5922,0.26,0.0283,0.01,0.1094
2,0.299,0.4192,0.0069,0.2715,0.0034
3,0.6988,0.1255,0.0382,0.035,0.1025
4,0.0158,0.9208,0.0121,0.0137,0.0376
5,0.7825,0.1119,0.0348,0.0026,0.0682
6,0.7255,0.2613,0.0044,0.0019,0.0069
7,0.7823,0.12,0.0191,0.0334,0.0451
8,0.5328,0.3376,0.0074,0.0246,0.0975
9,0.8507,0.0704,0.0248,0.014,0.0401


### Problem 12
<span  style="color:green; font-size:16px">Which are top 5 historically black colleges that have the highest white percentage?</span>

In [43]:
filt = college['hbcu'] == 1
cols = ['instnm', 'ugds_white']
college.loc[filt, cols].sort_values('ugds_white', ascending=False).head()

Unnamed: 0,instnm,ugds_white
4021,Bluefield State College,0.8437
17,Gadsden State Community College,0.6921
4050,West Virginia State University,0.5816
48,Shelton State Community College,0.5613
55,H Councill Trenholm State Community College,0.3951
