# Feature Lagging

In the realm of time series forecasting, lagged features play a pivotal role by incorporating values from preceding time steps as inputs to forecast future observations. The fundamental premise of time series analysis is the assumption that historical observations exert influence on forthcoming events.

By incorporating lag features, models can capture temporal dependencies and patterns, such as seasonality and trends, inherent in the data. For instance, the sales figure of the previous month can be a strong indicator of the sales figure in the current month. Utilizing lagged features allows forecasting models to leverage this historical data, improving the accuracy and robustness of predictions. In essence, lag features bridge the gap between past events and future predictions, making them indispensable in time series forecasting. 

The utilization of lagged features, even in simple forecasting models, can yield surprisingly robust results by capitalizing on the temporal structure of the data.

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px

import warnings
warnings.filterwarnings("ignore")

### Load data

In [2]:
df = pd.read_csv('../data/train.csv')
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)

df['store'] = df['store'].astype('category')
df['product'] = df['product'].astype('category')
df.head()

Unnamed: 0_level_0,store,product,number_sold
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,0,0,801
2010-01-02,0,0,810
2010-01-03,0,0,818
2010-01-04,0,0,796
2010-01-05,0,0,808


<img src="../img/TrainingIcons/Grumpy Bear Icon.png" alt="Image" width="50" height="50"> Wondering what the above means? Check out the `EDA.ipynb`

## Understanding the lagged feature
Imagine a simple time series, with a feature of note having values across various various timestamps. Then a lagged feature can be visualized as

<img src="../img/TimeSeries_lag.png" alt="Image" width="550" height="500"> 

<img src="../img/TrainingIcons/Warning.png" alt="Image" width="80" height="80"> 

Note: When a feature is lagged, it will create nulls at earliest point of the lag. In the given example lag values for `2010-01-01` will not exist for both `Feature Lag 1` as well as `Feature Lag 2`. You can choose to impute (fill these values based on knowledge or a specific strategy) or drop records where nulls are created

Also, be careful as to what features you're lagging and how. You don't want to accidentally attribute store 2's sales to store 1. Since they all exist in the same dataframe, this is easy to do.

## Where would you use a lag feature?

In time series forecasting, lagged features can be created for both the independent features (the predictors) as well as the dependent feature(s) (the predicted values)

An example of dependent feature lagging would be, for industries where a weekly pattern exists (eg: Walmart sales spikes every weekend), knowing the value of the target variable was 7 days ago, can have a high predictive power. In this example, you could potentially use units sold yesterday, last week and 4 weeks ago as predictors to predict sales that would happen on any given day

<img src="../img/TimeSeries_TargetLag.png" alt="Image" width="700" height="550"> 

On the other hand, independent features could be lagged to account for delayed effects associated with events occuring in the past. As an example, to understand the effects of sport events, we might incorporate features to account for delayed effects

<img src="../img/TimeSeries_FeatureLag.png" alt="Image" width="700" height="150"> 

# Implementing lag in pandas

In `pandas`, features can be lagged using the `shift` method. To account for other `products` and `stores` existing within the same dataframe, we will group by them so as to not misattribute sales from one store onto another.

In [3]:
df['lag_1'] = df.groupby(['store', 'product'])['number_sold'].shift(1)
df

Unnamed: 0_level_0,store,product,number_sold,lag_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,0,0,801,
2010-01-02,0,0,810,801.0
2010-01-03,0,0,818,810.0
2010-01-04,0,0,796,818.0
2010-01-05,0,0,808,796.0
...,...,...,...,...
2018-12-27,6,9,890,896.0
2018-12-28,6,9,892,890.0
2018-12-29,6,9,895,892.0
2018-12-30,6,9,899,895.0


At this point we can elect to fill in the nulls or drop the null records entirely. There area a variety of different ways to impute these null values. A few examples are as follows. 

In [4]:
# Backfill fills missing values with the next non-missing value
# Since our dataset is already sorted by date, we can use backfill to fill in the missing values
df_new = df.bfill()
df_new

Unnamed: 0_level_0,store,product,number_sold,lag_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,0,0,801,801.0
2010-01-02,0,0,810,801.0
2010-01-03,0,0,818,810.0
2010-01-04,0,0,796,818.0
2010-01-05,0,0,808,796.0
...,...,...,...,...
2018-12-27,6,9,890,896.0
2018-12-28,6,9,892,890.0
2018-12-29,6,9,895,892.0
2018-12-30,6,9,899,895.0


In [5]:
df_new = df.fillna(0)
df_new

Unnamed: 0_level_0,store,product,number_sold,lag_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-01,0,0,801,0.0
2010-01-02,0,0,810,801.0
2010-01-03,0,0,818,810.0
2010-01-04,0,0,796,818.0
2010-01-05,0,0,808,796.0
...,...,...,...,...
2018-12-27,6,9,890,896.0
2018-12-28,6,9,892,890.0
2018-12-29,6,9,895,892.0
2018-12-30,6,9,899,895.0


In [6]:
df_new = df.dropna()
df_new

Unnamed: 0_level_0,store,product,number_sold,lag_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-02,0,0,810,801.0
2010-01-03,0,0,818,810.0
2010-01-04,0,0,796,818.0
2010-01-05,0,0,808,796.0
2010-01-06,0,0,812,808.0
...,...,...,...,...
2018-12-27,6,9,890,896.0
2018-12-28,6,9,892,890.0
2018-12-29,6,9,895,892.0
2018-12-30,6,9,899,895.0


More involved strategies can be used for imputation as well