# Modeling vs. Baseline Comparison

The goal of this notebook is to evaluate whether machine learning models provide meaningful improvements over simple, well-designed baselines for weekly sales forecasting. Rather than immediately optimizing a complex model, this notebook focuses on understanding how much predictive signal is already captured by historical sales patterns.

Several baseline approaches are constructed using lagged sales features, including short-term momentum and yearly seasonality. These baselines are then compared against an initial machine learning model using a time-based validation split. This comparison helps determine whether additional model complexity is justified and identifies the primary drivers of sales behavior in the data.


In [1]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_absolute_error
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

train_df = pd.read_csv("../data/interim/model_features.csv", parse_dates=["Date"])
train_df = train_df.sort_values(["Store","Dept","Date"]).reset_index(drop=True)
train_df.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,...,sales_lag_1,sales_lag_2,sales_roll_4,Year,Holiday_Lead,Holiday_Lag,Sin_Month,Cos_Month,Sin_Week,Cos_Week
0,1,1,2010-02-05,24924.5,0,42.31,2.572,,,,...,,,,2010,1,0,0.866025,0.5,0.568065,0.822984
1,1,1,2010-02-12,46039.49,1,38.51,2.548,,,,...,24924.5,,,2010,0,0,0.866025,0.5,0.663123,0.748511
2,1,1,2010-02-19,41595.55,0,39.93,2.514,,,,...,46039.49,24924.5,,2010,0,1,0.866025,0.5,0.748511,0.663123
3,1,1,2010-02-26,19403.54,0,46.63,2.561,,,,...,41595.55,46039.49,,2010,0,0,0.866025,0.5,0.822984,0.568065
4,1,1,2010-03-05,21827.9,0,46.5,2.625,,,,...,19403.54,41595.55,32990.77,2010,0,0,1.0,6.123234000000001e-17,0.885456,0.464723


In [2]:
#start building markdown missingness indicators using only dates after markdown rollout (based on EDA)
markdown_cols = ["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"]

md_start_date = train_df.loc[
    train_df[markdown_cols].notna().any(axis=1), "Date"
].min()


cutoff_date = pd.Timestamp("2012-02-01")  

md_era = train_df[(train_df["Date"] >= md_start_date) & (train_df["Date"] < cutoff_date)].copy()

# store-week level presence: any dept non-null counts as "available"
store_week_md = (
    md_era.groupby(["Store","Date"])[markdown_cols]
    .apply(lambda g: g.notna().any())
    .reset_index()
)

md_availability = (
    store_week_md.groupby("Store")[markdown_cols]
    .mean()
    .reset_index()
)

#build availabity columns and merge
md_availability.columns = ["Store"] + [f"{c}_avail_rate" for c in markdown_cols]

train_df = train_df.merge(md_availability, on="Store", how="left")

avail_cols = [f"{c}_avail_rate" for c in markdown_cols]
train_df[avail_cols] = train_df[avail_cols].fillna(0)


## Markdown Availability Feature Construction

Markdown data is sparse and inconsistently reported across departments. To handle this, markdown availability is defined at the store–week level: if any department in a store reports a markdown in a given week, that markdown is treated as available for the store that week.

Availability rates are then computed per store and merged back into the training data. This approach preserves information about promotional coverage while avoiding misleading zeros caused by missing or unreported markdown values.


In [3]:
#build presence indicators based on EDA
for c in markdown_cols:
    train_df[f"{c}_present"] = train_df[c].notna().astype(int)

train_df[markdown_cols] = train_df[markdown_cols].fillna(0)


In [18]:
#lets build a new 52 week lag feature to use in new weighted baseline based on first model run to beat
train_df = train_df.sort_values(["Store", "Dept", "Date"])

train_df["sales_lag_52"] = (
    train_df.groupby(["Store", "Dept"])["Weekly_Sales"]
            .shift(52)
)


In [19]:
#define time based split using cutoff point 
cutoff_date = pd.Timestamp("2012-02-01")

train = train_df[train_df["Date"] < cutoff_date].copy()
val   = train_df[train_df["Date"] >= cutoff_date].copy()

In [5]:
#drop nulls only in training set
lag_cols = ["sales_lag_1","sales_lag_2","sales_roll_4"]
train_clean = train[train["Date"] < cutoff_date].dropna(subset=lag_cols)


In [6]:
#find baseline MAE using last week sales (assuming last weeks sales will be next week sales
val_base = val.copy()
val_base["y_pred_naive"] = val_base["sales_lag_1"]

mask = val_base["y_pred_naive"].notna()

mae_naive = mean_absolute_error(
    val_base.loc[mask,"Weekly_Sales"],
    val_base.loc[mask,"y_pred_naive"]
)

mae_naive


1676.1818784559068

## Seasonal Lag Baseline Construction

To capture strong yearly seasonality observed in the data, a 52-week lag feature is constructed at the store–department level. This feature represents sales from the same week in the prior year and is used to build a stronger baseline model.

A time-based split is then applied to separate training and validation periods. Baseline performance is evaluated using simple lagged sales assumptions before comparing against more complex models.


In [7]:
#build a rfr to compare to baseline model 
feature_cols = [
    "sales_lag_1","sales_lag_2","sales_roll_4",
    "Sin_Week","Cos_Week","Sin_Month","Cos_Month","Year",
    "IsHoliday","Holiday_Lag","Holiday_Lead",
    "Temperature","Fuel_Price","CPI","Unemployment",
    "Size","Store","Dept","Type",
    "MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5",
    "MarkDown1_present","MarkDown2_present","MarkDown3_present","MarkDown4_present","MarkDown5_present",
    "MarkDown1_avail_rate","MarkDown2_avail_rate","MarkDown3_avail_rate","MarkDown4_avail_rate","MarkDown5_avail_rate"
]

#split features by dtype for preprocessing
categorical = ["Store","Dept","Type"]
numeric = [c for c in feature_cols if c not in categorical]

X_train = train_clean[feature_cols]
y_train = train_clean["Weekly_Sales"]
X_val = val[feature_cols]
y_val = val["Weekly_Sales"]

#this will become more intricate when we build a final model
preprocess = ColumnTransformer([
    ("num", SimpleImputer(strategy="median"), numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
])

#just pick standard hyperparameters for now will do grid search later
rf = RandomForestRegressor(
    n_estimators=200,
    max_depth=12,
    min_samples_leaf=3,
    random_state=42,
    n_jobs=-1
)

#fit to pipeline
model = Pipeline([("prep", preprocess), ("rf", rf)])
model.fit(X_train, y_train)

y_pred = model.predict(X_val)
mean_absolute_error(y_val, y_pred)


1517.2076850574074

In [20]:
#lets evaluate a stronger weighted baseline to beat that takes into account 52 week lag and 1 week lag
null_mask = val["sales_lag_1"].notna() & val["sales_lag_52"].notna()

hybrid_pred = (
    0.7 * val.loc[null_mask, "sales_lag_1"] +
    0.3 * val.loc[null_mask, "sales_lag_52"]
)

mae_hybrid = mean_absolute_error(
    val.loc[null_mask, "Weekly_Sales"],
    hybrid_pred
)

mae_hybrid

1474.8523001528415

## Baseline vs Model Performance

I evaluated a stronger baseline that combines recent sales (lag-1) and yearly seasonality (lag-52).  
This simple hybrid baseline outperformed the Random Forest model on MAE.

This result suggests that most of the predictable signal in weekly sales is driven by temporal
patterns, especially short-term momentum and annual seasonality. Additional features such as
markdowns, holidays, and macro variables provided limited incremental improvement beyond these
lags.

This is a common outcome in retail demand forecasting and highlights the importance of using
strong baselines when evaluating more complex models.
