# Grouping by Time and another Column

In this chapter, we take a look at a special scenario where we group together periods of time alongside another column. We'll use the employee dataset, which does not contain typical time series data, but does allow us to group the hire date along with other columns. Let's begin by reading it in, putting the `'hire_date'` column in the index and sorting it.

In [None]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'], 
                  index_col='hire_date').sort_index()
emp.head()

As a review, let's find the average salary and number of employees for each ten year period using the `groupby` method.

In [None]:
emp.groupby(pd.Grouper(freq='10YS')).agg({'salary': ['mean', 'size']}).round()

## Grouping by an amount of time and another column

There are two different ways to group by time and another column. The difference is subtle but important, and can make a difference in the result. The datetime column and the other column can either be grouped **together** or grouped **independently**. Let's say we wanted to find the average salary over five-year time periods for each sex.

### Group together

To group sex and a five-year time span together, we must use `groupby`. Pass a list of both the `Grouper` object and the column name to the `groupby` method. 

In [None]:
tg = pd.Grouper(freq='10YS')
groups = ['sex', tg]
emp.groupby(groups).agg({'salary':['mean', 'size']}).round()

### Datetimes are the same

Notice, how the datetimes for both female and male groups are the same. This is not going to be the case below.

## Group independently

To group independently, we first group the non-datetime column with the `groupby` method. The Groupby object has a `resample` method which allows you to then group by an amount of time **within** the groups you just created. You use it just like it was being called from a DataFrame. Notice how the hire dates for males and females are different.

In [None]:
emp.groupby('sex').resample('10YS').agg({'salary':['mean', 'size']}).round()

### Different results

Its important to see that you will get different results depending on whether you group together or group independently. The reason the results are different is because the earliest male and female employees don't a hire date of the same year. The earliest hire date for female employees was 1969 while it is 1968 for males. If the first male and female employees were both hired in 1968 (or 1969), then the returned datetime index would have been the same.

## Using a pivot table with `Grouper` for easier comparisons

You can pass a `Grouper` object to a pivot table to get a nice final product. This groups sex together with time.

In [None]:
emp.pivot_table(index=tg, columns='sex', values='salary', aggfunc=['mean', 'size']).round()

## Rolling windows within a group

Rolling window calculation within a group are also possible. In order to show this example, we'll need time series data where each date appears only one time. With the employee data set, multiple people may be hired on the same date. We begin by grouing by department and week, returning the size of each group. This represents the number of employees hired in each month for that department.

In [None]:
emp.groupby(['dept', pd.Grouper(freq='W')]).size().head()

We'll move the department out of the index and give a name to the values

In [None]:
df = emp.groupby(['dept', pd.Grouper(freq='W')]).size()
df = df.reset_index('dept', name='size')
df.head()

The tail of the DataFrame has the most recent months of hire date for the last department.

In [None]:
df.tail()

Using this proper time series data, let's find the total number hired in each department over a rolling 6-week period.

In [None]:
df.groupby('dept').rolling('42D')['size'].sum().tail(10)

In [None]:
import pandas as pd
energy = pd.read_csv('../data/energy_consumption.csv', parse_dates=['date'], 
                     index_col='date')
energy.head()

## Exercises

Execute the following cell to read in the energy consumption dataset.

In [None]:
energy = pd.read_csv('../data/energy_consumption.csv', parse_dates=['date'], 
                     index_col='date')
energy.head()

### Exercise 1

<span style="color:green; font-size:16px">Find the average energy consumption per sector per 10 year time span beginning from the first year of data. Return the results using both `groupby` and `pivot_table`.</span>

### Use the bikes dataset for the remaining exercises

Execute the following cell to read in the bikes dataset. Note, that it does NOT set the index to be a datetime.

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

### Exercise 2

<span style="color:green; font-size:16px">Filter the data so that it only contains rows from the five most frequent `from_station_name` values. Then find the mean temperature at every station for every quarter. Present the result as a pivot table.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the number of rides per day from each `from_station_name`.</span>

### Exercise 4

<span style="color:green; font-size:16px">Reset the `from_station_name` index level from the solution in exercise 3 and then perform a 100 day rolling window of each `from_station_name` calculating the number of rides in this group.</span>