## Feature Engineering


Before moving from traditional models to machine learning based models, we need features to train machine learning models. And to get the features, we need to perform feature engineering steps.

Feature engineering in time series forecasting means creating useful variables—like day of the week, past values, or trends—from raw time data. We do it to help models better understand patterns and improve prediction accuracy.

0. Setup and Sample Data

In [None]:
import pandas as pd
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf


In [None]:
url = 'https://drive.google.com/uc?id=1M1ryHCBP55fhv8wZPWURun4MgNQfN6zb'

data = pd.read_csv(url)
data['invoice_date'] = pd.to_datetime(data['invoice_date'])
data.set_index('invoice_date', inplace=True)

In [None]:
data.head()

Unnamed: 0_level_0,total_transaction,total_quantity,total_sales
invoice_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2009-12-01,154,20736,45958.31
2009-12-02,125,25657,54826.26
2009-12-03,144,44557,57521.87
2009-12-04,100,19550,37222.23
2009-12-05,31,4636,8803.86


Here similar to our examples before this notebooks our target variable will be total_quantity

In [None]:
print(data.columns)


Index(['total_transaction', 'total_quantity', 'total_sales'], dtype='object')


### 1. Lag Features

**Lag features are created by shifting the time series values backward by a specified number of time steps**. They capture the influence of previous values on the current value. Time series data often has autocorrelation, meaning past values affect future ones. Lag features help models learn these dependencies.

In [None]:
# Add lagged values
df_lag = data.copy()

df_lag['total_transaction_lag_1'] = df_lag['total_transaction'].shift(1)
df_lag['total_transaction_lag_2'] = df_lag['total_transaction'].shift(2)

df_lag['total_transaction_lag_7'] = df_lag['total_transaction'].shift(7)

df_lag.head()

Unnamed: 0_level_0,total_transaction,total_quantity,total_sales,total_transaction_lag_1,total_transaction_lag_2,total_transaction_lag_7
invoice_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-12-01,154,20736,45958.31,,,
2009-12-02,125,25657,54826.26,154.0,,
2009-12-03,144,44557,57521.87,125.0,154.0,
2009-12-04,100,19550,37222.23,144.0,125.0,
2009-12-05,31,4636,8803.86,100.0,144.0,


When you create lag_1, lag_2, etc., you're giving the model a historical memory.

For example: A Random Forest model sees those lags as just numeric features and finds patterns like:

> When lag_1 is high and lag_2 is rising, value is likely to increase.

We use **ACF (Autocorrelation Function) and PACF (Partial Autocorrelation Function) plots to identify which lag values are important** in a time series. These tools help determine the specific lag points that have significant correlations with the current value. The example above simply demonstrates the basic concept of creating lag features, but in practice, ACF and PACF guide us in selecting meaningful lags rather than choosing them arbitrarily.

**For Example:** If ACF plot shows significant **spikes at lag 1 and lag 2** while other near the main line then it means that the **sales value today is strongly correlated with sales 1 day and 2 days ago**.

### 2. Rolling Statistics

Rolling (or moving) statistics aggregate data over a sliding window of time steps. Common statistics: mean, std, min, max, median.They smooth the data and capture local trends and volatility.

In [None]:
# Add rolling mean and standard deviation
df_roll = data.copy()

df_roll['toal_transaction_roll_mean_3'] = df_roll['total_transaction'].rolling(window=3).mean()
df_roll['toal_transaction_roll_std_3'] = df_roll['total_transaction'].rolling(window=3).std()

df_roll.head()

Unnamed: 0_level_0,total_transaction,total_quantity,total_sales,toal_transaction_roll_mean_3,toal_transaction_roll_std_3
invoice_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2009-12-01,154,20736,45958.31,,
2009-12-02,125,25657,54826.26,,
2009-12-03,144,44557,57521.87,141.0,14.73092
2009-12-04,100,19550,37222.23,123.0,22.068076
2009-12-05,31,4636,8803.86,91.666667,56.95905


By calculating both the mean and standard deviation, the model gets important information about the pattern and variability in recent data, which helps it make better and more accurate predictions for the future.

For Example:
> When the rolling mean is increasing and rolling standard deviation is low, the value is likely to be stable and rise steadily.

### 3. Time based Feature

These are features extracted from the timestamp column of the time series (e.g., hour, day, month, season).They help models learn seasonality, daily/weekly/monthly patterns.

In [None]:
# Extract features from date
df_timeBasedFeature = data.copy()

df_timeBasedFeature['day_of_week'] = df_timeBasedFeature.index.dayofweek          # 0 = Monday, 6 = Sunday
df_timeBasedFeature['month'] = df_timeBasedFeature.index.month
df_timeBasedFeature['is_weekend'] = df_timeBasedFeature['day_of_week'].isin([5, 6]).astype(int)

df_timeBasedFeature.head()

Unnamed: 0_level_0,total_transaction,total_quantity,total_sales,day_of_week,month,is_weekend
invoice_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2009-12-01,154,20736,45958.31,1,12,0
2009-12-02,125,25657,54826.26,2,12,0
2009-12-03,144,44557,57521.87,3,12,0
2009-12-04,100,19550,37222.23,4,12,0
2009-12-05,31,4636,8803.86,5,12,1


For example:

> When the day of the week is Monday (0) and it’s not a weekend, the value might be higher due to weekday effects, while weekends (5 or 6) could show different patterns such as lower activity or sales.

## Use Cases

We don’t always create lag, rolling, or time-based features for every time series problem. These feature engineering steps are done only when necessary and based on the characteristics of the data we have.

* **Lag features** are useful when past values directly influence future values, but if the data has little autocorrelation, lags may not help much.

* **Rolling features** like rolling mean and standard deviation are valuable when recent trends or volatility matter for prediction, but if the data is very stable or noisy without patterns, they might add little value.

* **Time-based features** such as day of the week or month help when the data shows seasonal or calendar-related effects, but if the data is random over time without such cycles, these features may not improve the model.

In short, feature engineering should be guided by understanding the data and its underlying patterns, not applied blindly to every dataset.

##  Exogenous Variables in Time Series Foracasting

**Exogenous variables are external factors that influence target variables,** even though they’re not part of the sales data itself. These variables provide context and help improve forecasting accuracy when used properly. We can split them into two types:

#### 1. Past-Known Covariates (Known Until Today)

These are variables that impact sales but are only known up to the current day—we don’t know their future values. In forecasting, **we must lag or use rolling summaries of these.**

**Examples:**

* Past foot traffic (e.g., how many customers visited recently)
* Weather data (e.g., rain might reduce in-store sales)
* Stock levels / Inventory availability


#### 2. Future-Known Covariates (Known in Advance)

These are features that can be known ahead of time and used directly as features during prediction.

**Examples:**

* Holidays & festivals (e.g., Diwali, Christmas, national holidays)
* Planned promotions or discount periods
* Advertising campaign schedules
* Store opening hours or maintenance shutdowns



## Data Leakage in Sales Forecasting

Data leakage is one of the most critical and common mistakes in time series forecasting. It occurs when the model has access to future information (i.e., data that wouldn’t be available at prediction time) during training, leading to over-optimistic accuracy and poor real-world performance.


### Example of Data Leakage in Time Series Forecasting:



Suppose you're building a model to predict today’s total sales for a retail store. You include features like:

* Today’s total number of transactions

* Today’s average basket size

* A 7-day rolling average of total sales (calculated over the full dataset)

This seems reasonable — but here’s the problem:

* At the start of the day, you don’t yet know the total number of transactions or the average basket size for today.

* The **7-day rolling average includes today’s sales value**, which is exactly what you’re trying to predict.

This is data leakage — your model is “cheating” by using future or unavailable information.

**Correct Approach:**

Use yesterday’s transaction count: total_transaction.shift(1)

* Calculate the 7-day rolling average on past data only: total_sales.shift(1).rolling(7).mean()

* Always split the data chronologically, then scale after splitting

This way, your model mimics a real-life forecasting scenario where only past and known features are available at prediction time.



### Best Practices to Avoid Data Leakage in Time Series



* Split data **chronologically**, not randomly
* Use `.shift()` to access only **past values**
* Apply **rolling windows after shifting**, not before
* Fit scalers (e.g., `StandardScaler`) **only on training data**
* Use only **features available at prediction time**
* Avoid using **same-day values** of the target or related variables
* Build **time-based aggregations** using only historical data
* Always ask: **“Would I know this value when forecasting?”**
* Plot timelines to verify **no future data is included**


We will export only the Time-Based Feature Engineering part, as the other features are unnecessary for our simple dataset. This approach keeps things straightforward and easy to use.

In [None]:
df_timeBasedFeature.to_csv('dataset_after_feature_engineering.csv', index=True)