# Grouping by Time

In previous notebooks, we learned how to downsample/upsample time series data. In this notebook, we will group spans of time together to get a result. For instance, we can find out the number of up or down days for a stock within each trading month, or calculate the number of flights per day for an airline. pandas gives you the ability to group by a period of time. Let's begin by reading in our stock dataset.

In [None]:
import pandas as pd
stocks = pd.read_csv('../data/stocks/stocks10.csv', parse_dates=['date'], index_col='date')
stocks.head(3)

### Find the average closing price of Amazon for every month
If we are interested in finding the average closing price of Amazon for every month, then we need to group by month and aggregate the closing price with the mean function.

### Grouping column, aggregating column, and aggregating method
This procedure is very similar to how we grouped and aggregated columns in previous notebooks. The only difference is that, our grouping column will now be a datetime column with an additional specification for the amount of time.

### Use the `resample` method
Instead of the `groupby` method, we use a special method for grouping time together called `resample`. We must pass the `resample` method an offset alias string. The rest of the process is the exact same as the `groupby` method. We call the `agg` method and pass it a dictionary mapping the aggregating columns to the aggregating functions.

### `resample` syntax

The first parameter we pass to `resample` is the [offset alias][1]. Here, we choose to group by month. We then chain the `agg` method and must use one of the alternative syntaxes as the pandas developers have not yet implemented column renaming for the `resample` method.

[1]: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

In [None]:
stocks.resample('M').agg({'AMZN': 'mean'}).head(3)

### Use any number of aggregation functions
Map the aggregating column to a list of aggregating functions.

In [None]:
stocks.resample('M').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(3)

### Group by Quarter

In [None]:
stocks.resample('Q').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(4)

### Label as the entire Period
Notice how the end date of both the month and day are used as the returned index labels for the time periods. We can change the index labels so that they show just the time period we are aggregating over by setting the `kind` parameter to 'period'.

In [None]:
amzn_period = stocks.resample('Q', kind='period').agg({'AMZN': ['size', 'min', 'mean', 'max']})
amzn_period.head(4)

## The PeriodIndex
We no longer have a DatetimeIndex. Pandas has a completely separate type of object for this called the **PeriodIndex**. The index label '2016Q1' refers to the entire period of the first quarter of 2016. Let's inspect the index to see the new type.

In [None]:
amzn_period.index[:10]

## The Period data type
Pandas also has a completely separate data type called a **Period** to represent **columns** of data in a DataFrmae that are specific **periods of time**. This is directly analagous to the PeriodIndex, but for DataFrame columns. Examples of a Period are the entire month of June 2014, or the entire 15 minute period from June 12, 2014 5:15 to June 12, 2014 5:30.

### Convert a datetime column to a Period
We can use the `to_period` available with the `dt` accessor to convert datetimes to Period data types. You must pass it an offset alias to denote the length of the time period. Let's convert the `date` column in the weather dataset to a monthly Period column .

In [None]:
weather = pd.read_csv('../data/weather.csv', parse_dates=['date'])
weather.head(3)

Let's make the conversion from datetime to period and assign the result as a new column in the DataFrame.

In [None]:
date = weather['date']
weather['date_period'] = weather['date'].dt.to_period('M')
weather.head(3)

### Why is the data type "object"?
Unfortunately, Pandas doesn't explicitly label the Period object as such when outputting the data types. But if we inspect each individual element, you will see that they are indeed Period objects.

In [None]:
weather.dtypes

Inspecting each individual element.

In [None]:
weather.loc[0, 'date_period']

### The `dt` accessor works for Period columns

Even though it is technically labeled as object, pandas still has attributes and methods specific to periods.

In [None]:
weather['date_period'].dt.month.head(3)

In [None]:
weather['date_period'].dt.month.head(3)

Return the span of time with the `freq` attribute.

In [None]:
weather['date_period'].dt.freq

## Anchored offsets

By default, when grouping by week, pandas chooses to end the week on Sunday. Let's verify this by grouping by week and taking the resulting index label and determining its weekday name.

In [None]:
week_mean = stocks.resample('W').agg({'AMZN': ['size', 'min', 'mean', 'max']})
week_mean.head(3)

In [None]:
week_mean.index[0].day_name()

### Anchor by a different day

You can anchor the week to any day you choose by appending a dash and then the first the letters of the day of the week. Let's anchor the week to Wednesday.

In [None]:
stocks.resample('W-WED').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(3)

### Longer intervals of time with numbers appended to offset aliases
We can actually add more details to our offset aliases by using a number to specify an amount of that particular offset alias. For instance, **`5M`** will group in 5 month intervals.

In [None]:
stocks.resample('5M').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(3)

Group by every 22 weeks anchored to Thursday.

In [None]:
stocks.resample('22W-THU').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(3)

## Calling `resample` on a datetime column
The `resample` method can still work without a Datetimeindex. If there is a column that is of the datetime data type, you can use the `on` parameter to specificy that column. Let's reset the index and then call `resample` on that DataFrame.

In [None]:
amzn_reset = stocks.reset_index()
amzn_reset.head(3)

The only difference is that we specify the grouping column with the `on` parameter. The result is the exact same.

In [None]:
amzn_reset.resample('W-WED', on='date').agg({'AMZN': ['size', 'min', 'mean', 'max']}).head(3)

## Calling `resample` on a Series

Above, we called `resample` on a DataFrame. We can also use it for Series. Let's select Amazon's closing price as a Series.

In [None]:
amzn_close = stocks['AMZN']
amzn_close.head(3)

For a Series, the aggregating column is just the values. It's not necessary to use the `agg` method in order to aggregate. Instead, we can call aggregation methods directly. Here, we find the mean closing price by month.

In [None]:
amzn_close.resample('M').mean().head()

To compute multiple aggregations, use the `agg` method and pass it a list of the aggregating functions as strings. Here we find the total number of trading days ('size'), the min, max, and mean of the closing price for every three year period.

In [None]:
amzn_close.resample('3Y', kind='period').agg(['size', 'min', 'max', 'mean'])

## Exercises

Execute the following cell that reads in 20 years of Microsoft stock data and use it for the first few exercises.

In [None]:
msft = pd.read_csv('../data/stocks/msft20.csv', parse_dates=['date'], index_col='date')
msft.head(3)

### Exercise 1
<span  style="color:green; font-size:16px">In which week did MSFT have the greatest number of its shares (volume) traded?</span>

### Exercise 2

<span  style="color:green; font-size:16px">With help from the `diff` method, find the quarter containing the most number of up days.</span>

### Exercise 3

<span  style="color:green; font-size:16px">Find the mean price per year along with the minimum and maximum volume.</span>

### Exercise 4

<span  style="color:green; font-size:16px">Use the `to_datetime` function to convert the hire date column into datetimes. Reassign this column in the `emp` DataFrame.</span>

### Exercise 5

<span  style="color:green; font-size:16px">Without putting `hire_date` into the index, find the mean salary based on `hire_date` over 5 year periods. Also return the number of salaries used in the mean calculation for each period.</span>