# 6. Standardizing `groupby`

There are a number of syntaxes that get used for the groupby method. I suggest choosing a single syntax so that all of your code looks the same.


In [1]:
import pandas as pd
pd.set_option('display.max_columns', 100)
college = pd.read_csv('data/college.csv')
college.head(5)

Unnamed: 0,instnm,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


## The three components of `groupby`

Typically, when calling the `groupby` method, you will be performing an aggregation. This is the by far the most common scenario. When you are performing an aggregation during a `groupby`, there will always be three components.

* **Grouping column** - Unique values form independent groups
* **Aggregating column** - Column whose values will get aggregated. Usually numeric
* **Aggregating function** - How the values will get aggregated (sum, min, max, mean, median, etc...)

![][1]

### Identify each component from image above
* Grouping column - Dept
* Aggregating columns - salary, experience
* Aggregating functions - sum, average

All groupby aggregations will have these three components.

### My syntax of choice for `groupby`
There are a few different syntaxes that Pandas allows to perform a groupby aggregation. The following is the one I use.

```df.groupby('grouping column').agg({'aggregating column': 'aggregating function'})```

[1]: images/sac.png

### A buffet of `groupby` for finding the maximum math SAT score per state
Below, we will go through several different syntaxes that return the same result for finding the maximum SAT score per state.

My preferred way. It handles more complex cases.

In [4]:
df = college[['stabbr', 'satmtmid', 'satvrmid', 'ugds']]
df.head()

Unnamed: 0,stabbr,satmtmid,satvrmid,ugds
0,AL,420.0,424.0,4206.0
1,AL,565.0,570.0,11383.0
2,AL,,,291.0
3,AL,590.0,595.0,5451.0
4,AL,430.0,425.0,4811.0


In [6]:
df_result = df.groupby('stabbr').agg({'satmtmid': 'max'})
df_result.head()

Unnamed: 0_level_0,satmtmid
stabbr,Unnamed: 1_level_1
AK,503.0
AL,590.0
AR,600.0
AS,
AZ,580.0


The aggregating column can be selected within brackets following the call to `groupby`.

In [7]:
college.groupby('stabbr')['satmtmid'].agg('max').head()

stabbr
AK    503.0
AL    590.0
AR    600.0
AS      NaN
AZ    580.0
Name: satmtmid, dtype: float64

`aggregate` is an alias for `agg`. Always use `agg`.

In [None]:
college.groupby('stabbr')['satmtmid'].aggregate('max').head()

You can call the aggregating method directly without calling `agg`.

In [None]:
college.groupby('stabbr')['satmtmid'].max().head()

### Major benefits of preferred syntax
The reason I choose this syntax, is that it can handle more complex grouping problems. For instance, if we wanted to find the max and min math and verbal sat score along with the average undergrad population per state we would do the following.

In [8]:
df = college.groupby('stabbr').agg({'satmtmid': ['min', 'max'],
                                    'satvrmid': ['min', 'max'],
                                    'ugds': 'mean'}).round(0)
df.head(10)

Unnamed: 0_level_0,satmtmid,satmtmid,satvrmid,satvrmid,ugds
Unnamed: 0_level_1,min,max,min,max,mean
stabbr,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AK,503.0,503.0,555.0,555.0,2493.0
AL,400.0,590.0,420.0,595.0,2790.0
AR,427.0,600.0,410.0,600.0,1644.0
AS,,,,,1276.0
AZ,480.0,580.0,485.0,565.0,4130.0
CA,441.0,785.0,435.0,765.0,3518.0
CO,424.0,680.0,475.0,635.0,2325.0
CT,430.0,750.0,425.0,755.0,1874.0
DC,445.0,710.0,430.0,710.0,2645.0
DE,430.0,605.0,430.0,585.0,2491.0


This problem isn't solvable using the other syntaxes. If everyone on your team is using the same syntax, code becomes easier to read.