# Session 3: Feature Engineering for Time Series Forecasting

## Objective
Transform raw time series data into a leakage-safe, model-ready feature set
for multi-horizon forecasting.

## Key Principles
- All features must be known at prediction time
- No future information leakage
- Feature windows must reflect demand dynamics


## Importing libraries & loading processed data

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path
import sys

# Dynamically locate project root
PROJECT_ROOT = Path.cwd()
while not (PROJECT_ROOT / "config").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.append(str(PROJECT_ROOT))


from config.paths import DATA_DIR

daily_sales = pd.read_parquet(
    DATA_DIR / "daily_sales.parquet"
)

daily_sales.head()

Unnamed: 0,date,sales,dow
0,2011-01-29,32631,Saturday
1,2011-01-30,31749,Sunday
2,2011-01-31,23783,Monday
3,2011-02-01,25412,Tuesday
4,2011-02-02,19146,Wednesday


In [2]:
daily_sales.info()
daily_sales.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1913 entries, 0 to 1912
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   date    1913 non-null   datetime64[ns]
 1   sales   1913 non-null   int64         
 2   dow     1913 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 45.0+ KB


date     0
sales    0
dow      0
dtype: int64

## Forecast Horizons

We engineer features that support the following horizons:
- 1 day
- 7 days
- 14 days
- 28 days

## Forecasting Strategy

We use a **direct multi-horizon approach**:
- Separate models for each forecast horizon
- Each model predicts sales at t + h directly
- Prevents error accumulation from recursive forecasts


In [3]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## Calendar-Based Features (Zero Leakage, High Signal)
### Calendar Features

Calendar features are:
- Known at prediction time
- Stable across environments
- Strong drivers of seasonality


In [4]:
daily_sales.columns

Index(['date', 'sales', 'dow'], dtype='object')

In [5]:
daily_sales["date"] = pd.to_datetime(daily_sales["date"])

In [6]:
daily_sales = daily_sales.sort_values("date")
daily_sales = daily_sales.set_index("date")

In [7]:
type(daily_sales.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [8]:
daily_sales["dayofweek"] = daily_sales.index.dayofweek
daily_sales["week"] = daily_sales.index.isocalendar().week.astype(int)
daily_sales["month"] = daily_sales.index.month
daily_sales["year"] = daily_sales.index.year

In [9]:
daily_sales = daily_sales.drop(columns=["dow"]) #drops redundant columns

## Lag Features (Core Temporal Memory)

In [10]:
LAGS = [1, 7, 14, 28]

for lag in LAGS:
    daily_sales[f"lag_{lag}"] = daily_sales["sales"].shift(lag)

## Rolling Statistics

In [11]:
WINDOWS = [7, 14, 28]

for w in WINDOWS:
    daily_sales[f"rolling_mean_{w}"] = (
        daily_sales["sales"]
        .shift(1)
        .rolling(window=w)
        .mean()
    )
    
    daily_sales[f"rolling_std_{w}"] = (
        daily_sales["sales"]
        .shift(1)
        .rolling(window=w)
        .std()
    )

In [12]:
feature_df = daily_sales.dropna().copy()

In [13]:
feature_df.head(1)

Unnamed: 0_level_0,sales,dayofweek,week,month,year,lag_1,lag_7,lag_14,lag_28,rolling_mean_7,rolling_std_7,rolling_mean_14,rolling_std_14,rolling_mean_28,rolling_std_28
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2011-02-26,29908,5,8,2,2011,22529.0,31689.0,34833.0,32631.0,24143.142857,4576.62825,25112.214286,5575.475946,26238.678571,5320.795395


In [14]:
feature_df.columns

Index(['sales', 'dayofweek', 'week', 'month', 'year', 'lag_1', 'lag_7',
       'lag_14', 'lag_28', 'rolling_mean_7', 'rolling_std_7',
       'rolling_mean_14', 'rolling_std_14', 'rolling_mean_28',
       'rolling_std_28'],
      dtype='object')

In [15]:
feature_df.isna().sum()

sales              0
dayofweek          0
week               0
month              0
year               0
lag_1              0
lag_7              0
lag_14             0
lag_28             0
rolling_mean_7     0
rolling_std_7      0
rolling_mean_14    0
rolling_std_14     0
rolling_mean_28    0
rolling_std_28     0
dtype: int64

## Direct Multi-Horizon Target Construction

In [22]:
HORIZONS = [1, 7, 14, 28]

for h in HORIZONS:
    feature_df[f"target_t_plus_{h}"] = feature_df["sales"].shift(-h)


In [23]:
model_df = feature_df.dropna().copy()

In [24]:
for h in HORIZONS:
    TARGET_COL = f"target_t_plus_{h}"

    model_df = feature_df.dropna(subset=[TARGET_COL]).copy()

    X = model_df.drop(columns=["sales", TARGET_COL])
    y = model_df[TARGET_COL]

    # train model for horizon h

In [25]:
X.head(15)

Unnamed: 0_level_0,dayofweek,week,month,year,lag_1,lag_7,lag_14,lag_28,rolling_mean_7,rolling_std_7,rolling_mean_14,rolling_std_14,rolling_mean_28,rolling_std_28,target_t_plus_1,target_t_plus_7,target_t_plus_14
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2011-02-26,5,8,2,2011,22529.0,31689.0,34833.0,32631.0,24143.142857,4576.62825,25112.214286,5575.475946,26238.678571,5320.795395,28707.0,31202.0,32225.0
2011-02-27,6,8,2,2011,29908.0,29283.0,36380.0,31749.0,23888.714286,4113.263859,24760.428571,5045.10774,26141.428571,5223.630967,21240.0,34876.0,31417.0
2011-02-28,0,9,2,2011,28707.0,23966.0,21804.0,23783.0,23806.428571,3991.319742,24212.357143,3992.744296,26032.785714,5133.540619,22872.0,24562.0,24935.0
2011-03-01,1,9,3,2011,21240.0,20501.0,24070.0,25412.0,23417.0,4104.536312,24172.071429,4021.653821,25941.964286,5196.921314,22046.0,22752.0,24021.0
2011-03-02,2,9,3,2011,22872.0,20757.0,21443.0,19146.0,23755.714286,3917.358537,24086.5,4036.70987,25851.25,5228.586542,23475.0,22560.0,22765.0
2011-03-03,3,9,3,2011,22046.0,20277.0,20318.0,29211.0,23939.857143,3780.821402,24129.571429,4009.459214,25954.821429,5118.406719,23572.0,22626.0,21779.0
2011-03-04,4,9,3,2011,23475.0,22529.0,23721.0,28010.0,24396.714286,3442.533065,24355.071429,3864.765939,25749.964286,5098.002924,31202.0,25572.0,23855.0
2011-03-05,5,9,3,2011,23572.0,29908.0,31689.0,37932.0,24545.714286,3370.029511,24344.428571,3866.850969,25591.464286,5094.123183,34876.0,32225.0,30599.0
2011-03-06,6,9,3,2011,31202.0,28707.0,29283.0,32736.0,24730.571429,3729.508898,24309.642857,3797.262062,25351.107143,4627.710988,24562.0,31417.0,29643.0
2011-03-07,0,10,3,2011,34876.0,21240.0,23966.0,25572.0,25611.857143,5246.212997,24709.142857,4575.257939,25427.535714,4769.686405,22752.0,24935.0,23101.0


In [26]:
y.head(15)

date
2011-02-26    29620.0
2011-02-27    29866.0
2011-02-28    21449.0
2011-03-01    19581.0
2011-03-02    18928.0
2011-03-03    21742.0
2011-03-04    28309.0
2011-03-05    33478.0
2011-03-06    33058.0
2011-03-07    24852.0
2011-03-08    23581.0
2011-03-09    22656.0
2011-03-10    22901.0
2011-03-11    24754.0
2011-03-12    31745.0
Name: target_t_plus_28, dtype: float64

In [29]:
model_df.index

DatetimeIndex(['2011-02-26', '2011-02-27', '2011-02-28', '2011-03-01',
               '2011-03-02', '2011-03-03', '2011-03-04', '2011-03-05',
               '2011-03-06', '2011-03-07',
               ...
               '2016-03-18', '2016-03-19', '2016-03-20', '2016-03-21',
               '2016-03-22', '2016-03-23', '2016-03-24', '2016-03-25',
               '2016-03-26', '2016-03-27'],
              dtype='datetime64[ns]', name='date', length=1857, freq=None)

In [32]:
model_df["dayofweek"] = model_df.index.dayofweek
model_df["week"] = model_df.index.isocalendar().week.astype(int)
model_df["month"] = model_df.index.month
model_df["year"] = model_df.index.year

## Time-Aware Train / Validation Split

In [33]:
tscv = TimeSeriesSplit(n_splits=5)

for fold, (train_idx, val_idx) in enumerate(tscv.split(X), 1):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

    print(
        f"Fold {fold}: "
        f"Train end = {X_train.index[-1].date()}, "
        f"Val start = {X_val.index[0].date()}"
    )

Fold 1: Train end = 2012-01-03, Val start = 2012-01-04
Fold 2: Train end = 2012-11-07, Val start = 2012-11-08
Fold 3: Train end = 2013-09-12, Val start = 2013-09-13
Fold 4: Train end = 2014-07-18, Val start = 2014-07-19
Fold 5: Train end = 2015-05-23, Val start = 2015-05-24


## exporting parquets and metdata

In [41]:
from pathlib import Path
import sys

# Locate project root
PROJECT_ROOT = Path.cwd()
while not (PROJECT_ROOT / "config").exists():
    PROJECT_ROOT = PROJECT_ROOT.parent

sys.path.append(str(PROJECT_ROOT))

from config.paths import DATA_DIR

DATA_DIR.mkdir(parents=True, exist_ok=True)

HORIZONS = [1, 7, 14, 28]

FEATURE_SCHEMA_SAVED = False

for h in HORIZONS:
    TARGET_COL = f"target_t_plus_{h}"

    model_df = feature_df.dropna(subset=[TARGET_COL]).copy()

    # Drop other horizon targets
    other_targets = [
        col for col in model_df.columns
        if col.startswith("target_t_plus_") and col != TARGET_COL
    ]
    model_df = model_df.drop(columns=other_targets)

    FEATURE_COLS = [
        col for col in model_df.columns
        if col not in ["sales", TARGET_COL]
    ]

    # Save feature schema ONCE
    if not FEATURE_SCHEMA_SAVED:
        pd.Series(FEATURE_COLS).to_csv(
            DATA_DIR / "feature_schema.csv",
            index=False
        )
        FEATURE_SCHEMA_SAVED = True

    model_df.to_parquet(
        DATA_DIR / f"model_df_h{h}.parquet",
        index=True
    )

    metadata = {
        "horizon": h,
        "n_rows": len(model_df),
        "start_date": model_df.index.min(),
        "end_date": model_df.index.max(),
        "n_features": len(FEATURE_COLS)
    }

    pd.Series(metadata).to_csv(
        DATA_DIR / f"metadata_h{h}.csv"
    )

## Sanity Validation

In [42]:
for h in HORIZONS:
    df = pd.read_parquet(DATA_DIR / f"model_df_h{h}.parquet")
    print(
        f"Horizon {h}: "
        f"rows={df.shape[0]}, "
        f"features={df.shape[1]-2}, "
        f"last_date={df.index.max().date()}"
    )

Horizon 1: rows=1884, features=14, last_date=2016-04-23
Horizon 7: rows=1878, features=14, last_date=2016-04-17
Horizon 14: rows=1871, features=14, last_date=2016-04-10
Horizon 28: rows=1857, features=14, last_date=2016-03-27


## Session 3 â€“ Final Insights

1. Direct target construction avoids recursive error accumulation.
2. Targets must be shifted forward to represent future demand.
3. Feature matrices must exclude raw target signals.
4. TimeSeriesSplit ensures leakage-free validation.
5. Explicit feature tables improve reproducibility and reliability.

## Common Pitfalls Avoided

-  Shifting targets in the wrong direction
-  Including current sales as a feature
-  Recursive forecasting without justification
-  Manual or random data splitting