# Grouping by Time and another Column

In the previous chapters, we learned how to group by an amount of time with `resample`. Let's do that again here finding the average salary of every employee based on a span of 5 years.

In [1]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'], index_col='hire_date')
emp.head(3)

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-12-03,Police,POLICE SERGEANT,87545.38,Male,White
2010-11-15,Other,ASSISTANT CITY ATTORNEY II,82182.0,Male,Hispanic
2006-01-09,Houston Public Works,SENIOR SLUDGE PROCESSOR,49275.0,Male,Black


In [2]:
emp.resample('5Y').agg({'salary': 'mean'}).round(-3)

Unnamed: 0_level_0,salary
hire_date,Unnamed: 1_level_1
1968-12-31,
1973-12-31,65000.0
1978-12-31,78000.0
1983-12-31,74000.0
1988-12-31,71000.0
1993-12-31,69000.0
1998-12-31,68000.0
2003-12-31,63000.0
2008-12-31,60000.0
2013-12-31,57000.0


### Replicating `resample` with `groupby` + `Grouper`
The following syntax is a bit strange, so it might take reading it a few times to understand what is going on. You can group by time within the `groupby` method but you must use the `pd.Grouper` type to specify the frequency (the offset alias).

### Specify the frequency
The main parameter for `Grouper` is `freq`. Set it to the offset alias. I like using the variable name `tg` which stands for 'time grouper'.

In [3]:
tg = pd.Grouper(freq='5Y')

### Think of `pd.Grouper` like a dictionary that holds information
The only use case for this object is to pass it into `groupby`. It might be easier to just think of it as a dictionary that holds the frequency. Once we pass it to `groupby`, we can aggregate like we normally do and get the same result as we did with `resample`.

In [4]:
emp.groupby(tg).agg({'salary':'mean'}).round(-3)

Unnamed: 0_level_0,salary
hire_date,Unnamed: 1_level_1
1968-12-31,
1973-12-31,65000.0
1978-12-31,78000.0
1983-12-31,74000.0
1988-12-31,71000.0
1993-12-31,69000.0
1998-12-31,68000.0
2003-12-31,63000.0
2008-12-31,60000.0
2013-12-31,57000.0


## Grouping by an amount of time and another column

There are two different ways to group by time and another column. The difference is subtle but important and can make a difference in the result. The datetime column and the other column can either be grouped **together** or grouped **independently**. Let's say we wanted to find the average salary over 5-year time periods for each sex.

### Group together

To group sex and a 5-year time span together, we must use `groupby`. We will simply pass a list of both the `Grouper` object and the column name to groupby. 

In [5]:
tg = pd.Grouper(freq='5Y')
groups = ['sex', tg]
emp.groupby(groups).agg({'salary':'mean'}).round(-3)

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
sex,hire_date,Unnamed: 2_level_1
Female,1973-12-31,59000.0
Female,1978-12-31,76000.0
Female,1983-12-31,63000.0
Female,1988-12-31,63000.0
Female,1993-12-31,62000.0
Female,1998-12-31,60000.0
Female,2003-12-31,58000.0
Female,2008-12-31,57000.0
Female,2013-12-31,56000.0
Female,2018-12-31,48000.0


### Datetimes are the same
Notice, how the datetimes for both female and male groups are the same. This is not going to be the case below.

In [6]:
emp.query('sex == "Male"').index.min()

Timestamp('1968-12-13 00:00:00')

In [7]:
emp

Unnamed: 0_level_0,dept,title,salary,sex,race
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-12-03,Police,POLICE SERGEANT,87545.38,Male,White
2010-11-15,Other,ASSISTANT CITY ATTORNEY II,82182.00,Male,Hispanic
2006-01-09,Houston Public Works,SENIOR SLUDGE PROCESSOR,49275.00,Male,Black
1997-05-27,Police,SENIOR POLICE OFFICER,75942.10,Male,Hispanic
2006-01-23,Police,SENIOR POLICE OFFICER,69355.26,Male,White
...,...,...,...,...,...
2001-12-03,Police,SENIOR POLICE OFFICER,75942.10,Male,Black
2016-03-28,Other,SENIOR PROCUREMENT SPECIALIST,76175.00,Female,Black
2015-09-14,Houston Public Works,WATER SERVICE INSPECTOR I,35173.00,Male,Black
2008-05-19,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,67198.00,Female,Black


## Group independently
To group independently, we first group the non-datetime column with the `groupby` method. The Groupby object has a `resample` method which allows you to then group by an amount of time **within** the groups you just created. You use it just like it was being called from a DataFrame. Notice how the hire dates for males and females are different.

In [8]:
emp.groupby('sex').resample('5Y').agg({'salary':'mean'}).round(-3)

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
sex,hire_date,Unnamed: 2_level_1
Female,1969-12-31,75000.0
Female,1974-12-31,43000.0
Female,1979-12-31,67000.0
Female,1984-12-31,61000.0
Female,1989-12-31,62000.0
Female,1994-12-31,63000.0
Female,1999-12-31,58000.0
Female,2004-12-31,58000.0
Female,2009-12-31,57000.0
Female,2014-12-31,55000.0


### Different results
Its important to see that you will get different results depending on whether you group together or group independently. The reason the results are different is because the earliest male and female employees don't have a hire date that is an exact 5 year multiple difference. The earliest hire date for female employees was 1969 while it is 1968 for males. If the first male and female employees were both hired in 1968 (or 1969), then the returned datetime index would have been the same.

## Using a pivot table with `Grouper` for easier comparisons

You can pass a `Grouper` object to a pivot table to get a nice final product. This groups sex together with time.

In [12]:
emp.pivot_table(index=tg, columns='sex', values='salary').round(-3)

sex,Female,Male
hire_date,Unnamed: 1_level_1,Unnamed: 2_level_1
1973-12-31,59000.0,70000.0
1978-12-31,76000.0,79000.0
1983-12-31,63000.0,76000.0
1988-12-31,63000.0,74000.0
1993-12-31,62000.0,73000.0
1998-12-31,60000.0,71000.0
2003-12-31,58000.0,65000.0
2008-12-31,57000.0,62000.0
2013-12-31,56000.0,57000.0
2018-12-31,48000.0,48000.0


### Using `Grouper` on a datetime column
If your datetime column is not in the index, you can still use `Grouper`. Just specify the column name with the `key` parameter. See the example below with `hire_date` not in the index.

In [13]:
emp2 = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp2.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


In [14]:
tg2 = pd.Grouper(freq='10Y', key='hire_date')
emp2.groupby(['sex', tg2]).agg({'salary':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,salary
sex,hire_date,Unnamed: 2_level_1
Female,1978-12-31,72408.3
Female,1988-12-31,62609.175556
Female,1998-12-31,60755.045322
Female,2008-12-31,57086.636143
Female,2018-12-31,50754.220978
Male,1968-12-31,
Male,1978-12-31,78184.825833
Male,1988-12-31,74938.171306
Male,1998-12-31,71673.462248
Male,2008-12-31,63441.740493


## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the energy consumption dataset. Find the average energy consumption per sector per 10 year time span beginning from the first year of data. Return the results as both a groupby and a pivot table. Experiment with adding 'S' to the end of your offset alias. How does this change the results?</span>