<h1 style='color:Green'>Baseline Modeling</h1>
<h3>ðŸ“Œ Objectives </h3>
<pre>
  Establish simple, interpretable benchmark models to:
  - Validate feature quality
  - Create RMSLE reference scores
  - Compare against advanced ML/DL later

 No heavy tuning, no leakage, no deep learning.
</pre>

<h2 style='color:Green'>Load Feature Data</h2>
<h3>ðŸŽ¯ Goal </h3>
<pre>
  - Load engineered features
  - Drop only lag/rolling NaNs
  - Keep time order intact
</pre>

<h3>Clone GitHub Repository</h3>

In [19]:
# Clone GitHub Repository
!git clone https://github.com/sabin74/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform.git

Cloning into 'Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 36 (delta 6), reused 26 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 21.62 MiB | 36.41 MiB/s, done.
Resolving deltas: 100% (6/6), done.
Filtering content: 100% (11/11), 315.52 MiB | 52.06 MiB/s, done.


## Set Project Root

This keeps paths identical to local setup.

In [20]:
# Set  PROJECT_ROOT
import os

PROJECT_ROOT = "/content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform"
os.chdir(PROJECT_ROOT)

print("Current directory:", os.getcwd())


Current directory: /content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform


## Load Feature Data
ðŸŽ¯ Goal
 - Load engineered data
 - Drop only lag/rolling NaNs
 - Preserve time order
 - Avoid memory waste

In [21]:
# Import and Load Data
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_log_error
import warnings

warnings.filterwarnings("ignore")
from pathlib import Path

In [22]:
# Load Feature Dataset from Repo
FEATURE_DATA = Path("data/features")

train = pd.read_parquet(FEATURE_DATA / "train_features.parquet")


In [23]:
# Convert Date and Sort
train['date'] = pd.to_datetime(train['date'])

train = train.sort_values(
    ['store_nbr', 'family', 'date']
).reset_index(drop=True)

In [24]:
# Identify Lag and Rolling Columns
lag_roll_cols = [
    col for col in train.columns
    if ("lag" in col) or ("roll" in col)
]

In [25]:
# Drop NaN in lag/rolling features
train_model = train.dropna(
    subset = lag_roll_cols
).reset_index(drop=True)

## Define Target & Feature Sets
### Target Options
  - sales_log (preferred for RMSLE)
###Feature Sets
  - Time features
  - Lag features
  - Rolling features
  - Promotions
  - Oil
  - Encodings

In [26]:
# Targe Variable
TARGET = 'sales_log'

In [27]:
# Define Feature Groups
TIME_FEATURES = [
    "day", "week_of_year", "monty", "year",
    "day_of_week", "is_weekend", "is_payday"
]

LAG_FEATURES = [
    "sales_lag_1", "sales_lag_7",
    "sales_lag_14", "sales_lag_28"
]

ROLL_FEATURES = [
    "sales_roll_mean_7", "sales_roll_mean_14", "sales_roll_mean_28",
    "sales_roll_std_7", "sales_roll_std_14", "sales_roll_std_28"
]

PROMO_FEATURES = [
    "onpromotion",
    "promo_lag_1", "promo_lag_7",
    "promo_freq_7", "promo_freq_14", "promo_freq_28",
    "promo_roll_sum_7", "promo_roll_sum_14", "promo_roll_sum_28"
]

OIL_FEATURES = [
    "dcoilwtico", "oil_lag_7", "oil_lag_14", "oil_lag_28"
]

ENCODING_FEATURES = [
    "store_te", "family_te"
]


In [28]:
# Final Feature List
FEATURES = (
    TIME_FEATURES + LAG_FEATURES + ROLL_FEATURES +
    PROMO_FEATURES + OIL_FEATURES + ENCODING_FEATURES
)

In [29]:
# Check which features actually exist
print(f"Total features in list: {len(FEATURES)}")
print(f"Features missing from train_model:")
missing_features = [f for f in FEATURES if f not in train_model.columns]
for f in missing_features:
    print(f"  - {f}")

Total features in list: 32
Features missing from train_model:


In [30]:
# Preparing Modeling Metrics
X = train_model[FEATURES]
y = train_model[TARGET]

## Time-Based Validation Strategy
**Why**: Time series â‰  random split
### Validation Method
  - Last N days as validation
  - Example:
    - Train: up to 2017-07-15
    - Valid: 2017-07-16 â†’ 2017-08-15

In [31]:
# Define Validation Cutoff Date
TRAIN_END_DATE = "2017-07-15"

In [32]:
# Create Time Based Split
train_mask = train_model['date'] <= TRAIN_END_DATE
valid_mask = train_model['date'] > TRAIN_END_DATE

X_train = X[train_mask]
y_train = y[train_mask]

X_valid = X[valid_mask]
y_valid = y[valid_mask]

In [33]:
# Also keep the full train_model rows for validation (for accessing lag columns)
train_valid = train_model[valid_mask].reset_index(drop=True)

print("DATA SPLIT SUMMARY:\n")
print(f"Train dates: {train_model[train_mask]['date'].min().date()} to {train_model[train_mask]['date'].max().date()}")
print(f"Validation dates: {train_model[valid_mask]['date'].min().date()} to {train_model[valid_mask]['date'].max().date()}")
print(f"Train samples: {len(X_train):,}")
print(f"Validation samples: {len(X_valid):,}")

DATA SPLIT SUMMARY:

Train dates: 2013-01-29 to 2017-07-15
Validation dates: 2017-07-16 to 2017-08-15
Train samples: 2,949,210
Validation samples: 55,242


## BASELINE MODEL 1 â€“ NAIVE & MOVING AVERAGE
### Goal
  - Establish simple benchmark scores
  - Validate feature correctness
  - Create RMSLE reference

In [45]:
## Define Evaluation Metrics (RMSLE)
def rmsle(y_true, y_pred):
    """
    Root Mean Squared Log Error
    """
    y_true = np.maximum(0, y_true)
    y_pred = np.maximum(0, y_pred)
    return np.sqrt(mean_squared_log_error(y_true, y_pred))



In [46]:
# Prepare Validation Ground Truth
y_valid_true = np.expm1(y_valid)

In [47]:
# Extract Prediction (log-space)
naive_pred_log = X_valid["sales_lag_1"]


In [48]:
# Convert Back to Original Scale
naive_pred = np.expm1(naive_pred_log)


In [51]:
# Define a safe inverse transform
def safe_expm1(x, clip_max=20):
    """
    Safely inverse log transform:
    - clips extreme log values
    - prevents overflow
    """
    x = np.clip(x, a_min=None, a_max=clip_max)
    return np.expm1(x)


In [52]:
# Evaluate RMSLE
naive_pred_log = X_valid["sales_lag_1"]
naive_pred = safe_expm1(naive_pred_log)

naive_rmsle = rmsle(y_valid_true, naive_pred)

print(f"Naive Forecast RMSLE: {naive_rmsle:.5f}")


Naive Forecast RMSLE: 11.49610


## BASELINE MODEL 1 â€“ MOVING AVERAGE
### Predict sales using rolling mean of past sales\

In [59]:
# 7 day, 14 day & 28 day Average
ma7_pred = safe_expm1(X_valid["sales_roll_mean_7"])
ma14_pred = safe_expm1(X_valid["sales_roll_mean_14"])
ma28_pred = safe_expm1(X_valid["sales_roll_mean_28"])

ma7_rmsle = rmsle(y_valid_true, ma7_pred)
ma14_rmsle = rmsle(y_valid_true, ma14_pred)
ma28_rmsle = rmsle(y_valid_true, ma28_pred)

print(f"MA-7 RMSLE:  {ma7_rmsle:.5f}")
print(f"MA-14 RMSLE: {ma14_rmsle:.5f}")
print(f"MA-28 RMSLE: {ma28_rmsle:.5f}")


MA-7 RMSLE:  11.57560
MA-14 RMSLE: 11.57804
MA-28 RMSLE: 11.58401
