In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio

pio.templates.default = "plotly_dark"

import warnings
warnings.filterwarnings("ignore")

### Load data

In [2]:
df = pd.read_csv('../data/train.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

df['store'] = df['store'].astype('category')
df['product'] = df['product'].astype('category')
df.head()

Unnamed: 0_level_0,store,product,number_sold
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,0,0,801
2010-01-02,0,0,810
2010-01-03,0,0,818
2010-01-04,0,0,796
2010-01-05,0,0,808


<img src="../img/TrainingIcons/Grumpy Bear Icon.png" alt="Image" width="50" height="50"> Wondering what the above means? Check out the `EDA.ipynb`

# Understanding Rolling Features

Rolling featues are features calculated over a rolling window of time. For example, a 7-day rolling mean would be the average of the last 7 days. Every rolling feature has two components:
- the window to calculate over
- the aggregation function to apply

A visual example of what a 3 value rolling average looks like

![Rolling Mean](../img/TimeSeries_window3_rollingMean.png)


Similar to our lagged features (`FeatureLagging.ipynb`), we group by `store` and `product` to ensure the rolling averages are performed on the specific time series, so that we don't pollute the calculations from another `store`/`product` on another

# Implementing and Visualizing Rolling Features

We'll implment a rolling mean of the number of units sold for each product and store. We'll use a window of 7 days.

In [3]:
df['rolling_7D_average'] = df.groupby(['store', 'product'])['number_sold'].rolling(window = 7).mean().values
# df.head(10)
df.head(10)

Unnamed: 0_level_0,store,product,number_sold,rolling_7D_average
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,0,0,801,
2010-01-02,0,0,810,
2010-01-03,0,0,818,
2010-01-04,0,0,796,
2010-01-05,0,0,808,
2010-01-06,0,0,812,
2010-01-07,0,0,830,810.714286
2010-01-08,0,0,812,812.285714
2010-01-09,0,0,817,813.285714
2010-01-10,0,0,832,815.285714


Visually, a rolling mean has a smoothing effect on any given set of values

In [4]:
#Plotting the time series for a random store and product and the rolling 7D average
random_store, random_product = df['store'].sample(1).values[0],df['product'].sample(1).values[0]
df_ = df[(df['store'] == random_store) & (df['product'] == random_product)]

px.line(df_, 
        x=df_.index, 
        y=['number_sold', 'rolling_7D_average'], 
        title=f'Number of products sold for store {random_store} and product {random_product}')

<img src="../img/TrainingIcons/Warning.png" alt="Image" width="80" height="80"> 

Note: The above visual (and as demonstrated in the example) is the default operation of how `pandas` does rolling calculations. This good enough for descriptive data analytics, as generally speaking whether a rolling 7 day average includes today or excludes it makes little material difference. *However*, care has to be taken when performing rolling calculations as part of feature engineering for forecasting operations.

In the simplest of words, at no point should the dataset for training and inference should be privy to knowledge that the model won't have at the time of inference. 

What that implies is, if say, you're creating a rolling average based on the target as a feature, the rolling average *cannot* include the value of the target at that current point in time or of the future. Instead, one should use lagged values, meaning that if you're calculating a rolling average feature to predict tomorrow's value, the rolling average should only include data up to today, not including today itself.

In practice, this means adjusting the rolling window to exclude the current observation, thereby preventing data leakage and ensuring that the model's training aligns closely with the conditions under which it will be making predictions in the real world. This careful approach to feature engineering is crucial for developing robust, accurate forecasting models that perform well not just in training but also in evaluation and production environments.

These concerns, apply to all features that you might make rolling features out of, but in general the model should have access to value of an independent feature at the time of inference and as a result it is less of a concern.

In [5]:
# Shift Rolling 7D average

df['lag_7D_rolling'] = df.groupby(['store', 'product'])['number_sold'].shift(1).rolling(window = 7).mean()
df.head(10)
df.head(20)

Unnamed: 0_level_0,store,product,number_sold,rolling_7D_average,lag_7D_rolling
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-01-01,0,0,801,,
2010-01-02,0,0,810,,
2010-01-03,0,0,818,,
2010-01-04,0,0,796,,
2010-01-05,0,0,808,,
2010-01-06,0,0,812,,
2010-01-07,0,0,830,810.714286,
2010-01-08,0,0,812,812.285714,810.714286
2010-01-09,0,0,817,813.285714,812.285714
2010-01-10,0,0,832,815.285714,813.285714


# Conclusion

In our exploration of utilizing pandas for rolling aggregates, we've uncovered both the utility and the potential complications that arise in the context of forecasting. Beyond averages, a variety of statistical measures can serve as powerful tools to gauge historical trends, offering valuable insights for predicting future outcomes. For instance, employing a rolling standard deviation provides a dynamic view of the fluctuation in sales or prices over time, acting as a barometer for volatility. This approach not only enriches our analytical toolkit but also underscores the importance of understanding the underlying patterns and variability in our data as we forecast into the uncertain future.