# Grouping by Time

In previous chapters, we learned how to select a single period of time series data. In this chapter, we will group each row into independent periods of time and then perform an operation on each group. For example, we will find the average closing price of a stock for every month. This type of analysis is similar to the material presented in the **Grouping Data** part. Instead of grouping by unique values in a particular column, we will group by time periods. Each row will be placed into a single group based on its time period and then an operation will be performed on each group. Let's begin by reading in our stock dataset.

In [None]:
import pandas as pd
df = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], 
                 index_col='date')
df.head(3)

## Grouping with the `resample` method

The `resample` method is available to group by particular time periods. It's actually possible to use the `groupby` method to get the same result, but we will begin with `resample`, as it is a bit simpler and was built just for this purpose.

### Find the average closing price of Amazon for every month

If we are interested in finding the average closing price of Amazon for every month, then we need to group by month and aggregate the closing price with the mean function.

### Grouping column, aggregating column, and aggregating method

This procedure is very similar to how we grouped and aggregated columns in the Groupby chapters. The only difference is that our grouping column will now be the datetime index. The syntax is similar to the `groupby` method. Pass the `resample` method an [offset alias][1] to determine the grouping time period. As with `groupby`, calling the `resample` method does not produce a result, it just informs pandas how to create the groups. You must take action on these groups by chaining a method to it. Here, we chain the `agg` method to perform an aggregation that renames the resulting column.

```python
df.resample('offset alias').agg(new_column=('aggregating column', 'aggregating function'))
```

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Here, we use the offset alias `'M'` to group by month end and then choose to aggregate the AMZN and WMT columns with the mean and median, respectively.

In [None]:
df.resample('M').agg(AMZN_mean=('AMZN', 'mean'),
                     WMT_median=('WMT', 'median')).head(3)

### Other `resample` syntax

All the other groupby aggregation syntaxes that we covered previously are available with `resample`. We replicate the result from above using a dictionary to map the aggregating column to the aggregating function.

In [None]:
df.resample('M').agg({'AMZN': 'mean', 'WMT': 'median'}).head(3)

### Map each column name to a list of aggregations

Compute multiple aggregations per column by using a list as the values part of the dictionary passed to the `agg` method.

In [None]:
df.resample('M').agg({'AMZN': ['size', 'min', 'mean', 'max'], 
                      'WMT': ['max']}).head(3)

### Aggregation methods

All the normal DataFrame aggregations are available directly as methods and will perform their aggregation on each column. Here, the mean of all columns for each month is taken.

In [None]:
df.resample('M').mean().tail(3).round(1)

The `size` method returns the total number of rows per group. Since this number is the same per column, a Series is returned.

In [None]:
df.resample('M').size().head()

The `count` method returns the number of non-missing values for each time period per column. Notice that some of the stocks did not exist in 1999.

In [None]:
df.resample('M').count().head(3)

## Grouping by different time periods

Let's see several more of the offset aliases beginning with `'W'` for week ending on Sunday.

In [None]:
df.resample('W').mean().head().round(1)

Grouping by quarter end.

In [None]:
df.resample('Q').mean().head().round(1)

Grouping by year end.

In [None]:
df.resample('Y').mean().round(1)

In [None]:
df.resample('Y').mean().head().round(1)

### Use start of period instead of end as label for group

The single character offset aliases `'Y'`, `'Q'`, `'M'`, and `'W'` all use the end of the period as the index label for the group. Appending the character `'S'` groups by the same span of time, but uses the start of the time period as the label. Here, we group by year start.

In [None]:
df.resample('YS').mean().head().round(1)

Month start is used below.

In [None]:
df.resample('MS').mean().head(3).round(1)

### Grouping by anchored offset aliases

Year, quarter, and week can all be anchored to a different month or day of the week. Here, we group by quarter, where the quarter end months are Feb, May, August, and November.

In [None]:
df.resample('Q-Feb').mean().head(4).round(1)

Here, the calendar year is set to be July 1 through June 30th. Note that you must use `'A'` and not `'Y'` as the offset alias.

In [None]:
df.resample('A-Jun').mean().head().round(1)

## Grouping by more than one consecutive offset alias period

As we've learned, it's possible to place an integer before the offset alias to represent consecutive time periods. Here, we group by two consecutive months.

In [None]:
df.resample('2M').size().head()

The `size` method was chosen on purpose to focus on the first time period, which spans from September 1 to October 31, 1999. While it is a span of two months, it's probably not intuitive.  The very first row of data is on October 25, 1999, so you might expect the first time period to start on October 1, 1999 and end on November 30, 1999. The rest of the groups are also two-month time periods, but it is this crucial first group that often confuses users. In order for the time span to begin on the first month of actual data you must use a month start offset alias, which is exactly what we do below.

In [None]:
df.resample('2MS').size().head()

The first time period (confusingly in my opinion) always uses the first month as the end time. Here, we group 5 consecutive months at a time.

In [None]:
df.resample('5M').size().head()

Switching the offset alias to use month start, we get the more intuitive result. 

In [None]:
df.resample('5MS').size().head()

The same rule applies when grouping by multiple years. Here we group together two years using the end-of-year offset alias. The first time period spans from January 1, 1998 to December 31, 1999.

In [None]:
df.resample('2A').size().head(3)

Using the start-of-year offset alias, the first time period begins on January 1, 1999 and ends on December 31, 2000.

In [None]:
df.resample('2AS').size().head(3)

## Grouping by time with the `groupby` method

Grouping by time is also possible with the `groupby` method. Instead of passing the offset alias directly to the method, you need to pass it to the `pd.Grouper` constructor, setting the `freq` parameter. It technically creates a `TimeGrouper` object, which you can think of a dictionary containing information on how the time periods will be grouped. Here, we tell pandas to group by month end.

In [None]:
tg = pd.Grouper(freq='M')
type(tg)

Pass this newly created object to the `groupby` method and then finish the aggregation as usual.

In [None]:
df.groupby(tg).mean().round(1).head()

It's uncommon to assign the result of `pd.Grouper` to a variable name. You can can pass it in directly to `groupby`. All the normal functionality is available when using `groupby`.

In [None]:
(df.groupby(pd.Grouper(freq='4MS'))
   .agg(mean_msft=('MSFT', 'mean'), 
        max_slb=('SLB', 'max'), 
        obs=('SLB', 'size'))
   .head(3))

### Choosing between `resample` and `groupby` with `pd.Grouper`

Because the `groupby` method has more methods you can chain to it (and more options within those methods) than `resample`, you may want to use it when grouping by time. For example, selecting the first two rows from every four month period is only possible using `groupby`. The first `head` method below is applied to each group. The last `head` method works on the entire DataFrame to shorten the output.

In [None]:
df.groupby(pd.Grouper(freq='4MS')).head(2).head(6)

Attempting to chain the `head` method to `resample` results in an error as it does not exist for it.

In [None]:
df.resample('4MS').head(2)

## Calling `resample` on a datetime column

By default, the `resample` method works on DataFrames with a datetimes, timedeltas, or periods in the index. It is possible to make it work on DataFrames that have these values in a column and not in the index. Let's place our current index as the first column by calling the `reset_index` method.

In [None]:
df2 = df.reset_index()
df2.head(3)

Specify the column to be grouped with the `on` parameter. The result is the exact same.

In [None]:
(df2.resample('W-WED', on='date')
    .agg({'AMZN': ['size', 'min']})
    .head(3))

To achieve the same result with `groupby`, set the `key` parameter within `pd.Grouper` to column to be grouped.

In [None]:
(df2.groupby(pd.Grouper(freq='QS', key='date'))
    .agg({'XOM': 'max', 'SLB': 'min'})
    .head())

## Calling `resample` on a Series

Above, we called `resample` on a DataFrame. We can also use it on a Series. Let's select Amazon's closing price as a Series.

In [None]:
amzn_close = df['AMZN']
amzn_close.head(3)

For a Series, the aggregating column is just the values. It's not necessary to use the `agg` method in order to aggregate. Instead, we can call aggregation methods directly. Here, we find the mean closing price by month.

In [None]:
amzn_close.resample('M').mean().head()

To compute multiple aggregations, use the `agg` method and pass it a list of the aggregating functions as strings. Here we find the total number of trading days, the min, and max of the closing price for every three year period.

In [None]:
amzn_close.resample('3AS').agg(['size', 'min', 'max']).head(3)

Using `groupby` is also available for Series.

In [None]:
amzn_close.groupby(pd.Grouper(freq='3AS')).agg(['size', 'min', 'max']).head(3)

## Exercises

Execute the following cell that reads in 20 years of Microsoft stock data and use it for the first few exercises.

In [None]:
msft = pd.read_csv('../data/stocks/msft20.csv', parse_dates=['date'], index_col='date')
msft.head(3)

### Exercise 1

<span style="color:green; font-size:16px">In which week did MSFT have the greatest number of its shares (volume) traded?</span>

### Exercise 2

<span style="color:green; font-size:16px">With help from the `diff` method, find the quarter containing the most number of "up" days. An up day is when the adjusted close of the current day is greater than the previous day.</span>

### Exercise 3

<span style="color:green; font-size:16px">Find the mean price per year along with the minimum and maximum volume.</span>

### Exercise 4

<span style="color:green; font-size:16px">Find the mean of each column for every 6 month time period. The first time period should start on the month in the first row.</span>

### Exercise 5

<span style="color:green; font-size:16px">Repeat exercise 4 using a time span of 3 years where the year begins July 1.</span>

### Exercise 6

<span style="color:green; font-size:16px">Repeat exercise five using the `groupby` method instead of `resample`.</span>

### Use the temperature dataset for the remaining exercises

Execute the following cell to read in the temperature dataset which sets the datetime column in the index.

In [None]:
temp = pd.read_csv('../data/weather/temperature.csv', 
                   parse_dates=['datetime'], index_col='datetime')
temp.head()

### Exercise 7

<span style="color:green; font-size:16px">Find the mean temperature of every city for every 8 hour time period.</span>

### Exercise 8

<span style="color:green; font-size:16px">Verify that there are 24 rows for each day.</span>

### Exercise 9

<span style="color:green; font-size:16px">For each month, return the maximum temperature amongst all cities.</span>

### Exercise 10

<span style="color:green; font-size:16px">For each month, return the maximum temperature amongst all cities along with the city name where the maximum occurred. Return a two-column DataFrame, where the first column is the maximum temperature, and the second is the city. The index should be the month.</span>