<h1 style='color:Green'>Baseline Modeling</h1>
<h3>ðŸ“Œ Objectives </h3>
<pre>
  Establish simple, interpretable benchmark models to:
  - Validate feature quality
  - Create RMSLE reference scores
  - Compare against advanced ML/DL later

 No heavy tuning, no leakage, no deep learning.
</pre>

<h2 style='color:Green'>Load Feature Data</h2>
<h3>ðŸŽ¯ Goal </h3>
<pre>
  - Load engineered features
  - Drop only lag/rolling NaNs
  - Keep time order intact
</pre>

<h3>Clone GitHub Repository</h3>

In [19]:
# Clone GitHub Repository
!git clone https://github.com/sabin74/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform.git

Cloning into 'Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 36 (delta 6), reused 26 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 21.62 MiB | 36.41 MiB/s, done.
Resolving deltas: 100% (6/6), done.
Filtering content: 100% (11/11), 315.52 MiB | 52.06 MiB/s, done.


## Set Project Root

This keeps paths identical to local setup.

In [20]:
# Set  PROJECT_ROOT
import os

PROJECT_ROOT = "/content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform"
os.chdir(PROJECT_ROOT)

print("Current directory:", os.getcwd())


Current directory: /content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform


## Load Feature Data
ðŸŽ¯ Goal
 - Load engineered data
 - Drop only lag/rolling NaNs
 - Preserve time order
 - Avoid memory waste

In [60]:
# Import and Load Data
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")
from pathlib import Path

In [22]:
# Load Feature Dataset from Repo
FEATURE_DATA = Path("data/features")

train = pd.read_parquet(FEATURE_DATA / "train_features.parquet")


In [23]:
# Convert Date and Sort
train['date'] = pd.to_datetime(train['date'])

train = train.sort_values(
    ['store_nbr', 'family', 'date']
).reset_index(drop=True)

In [24]:
# Identify Lag and Rolling Columns
lag_roll_cols = [
    col for col in train.columns
    if ("lag" in col) or ("roll" in col)
]

In [25]:
# Drop NaN in lag/rolling features
train_model = train.dropna(
    subset = lag_roll_cols
).reset_index(drop=True)

## Define Target & Feature Sets
### Target Options
  - sales_log (preferred for RMSLE)
###Feature Sets
  - Time features
  - Lag features
  - Rolling features
  - Promotions
  - Oil
  - Encodings

In [26]:
# Targe Variable
TARGET = 'sales_log'

In [27]:
# Define Feature Groups
TIME_FEATURES = [
    "day", "week_of_year", "monty", "year",
    "day_of_week", "is_weekend", "is_payday"
]

LAG_FEATURES = [
    "sales_lag_1", "sales_lag_7",
    "sales_lag_14", "sales_lag_28"
]

ROLL_FEATURES = [
    "sales_roll_mean_7", "sales_roll_mean_14", "sales_roll_mean_28",
    "sales_roll_std_7", "sales_roll_std_14", "sales_roll_std_28"
]

PROMO_FEATURES = [
    "onpromotion",
    "promo_lag_1", "promo_lag_7",
    "promo_freq_7", "promo_freq_14", "promo_freq_28",
    "promo_roll_sum_7", "promo_roll_sum_14", "promo_roll_sum_28"
]

OIL_FEATURES = [
    "dcoilwtico", "oil_lag_7", "oil_lag_14", "oil_lag_28"
]

ENCODING_FEATURES = [
    "store_te", "family_te"
]


In [28]:
# Final Feature List
FEATURES = (
    TIME_FEATURES + LAG_FEATURES + ROLL_FEATURES +
    PROMO_FEATURES + OIL_FEATURES + ENCODING_FEATURES
)

In [29]:
# Check which features actually exist
print(f"Total features in list: {len(FEATURES)}")
print(f"Features missing from train_model:")
missing_features = [f for f in FEATURES if f not in train_model.columns]
for f in missing_features:
    print(f"  - {f}")

Total features in list: 32
Features missing from train_model:


In [30]:
# Preparing Modeling Metrics
X = train_model[FEATURES]
y = train_model[TARGET]

## Time-Based Validation Strategy
**Why**: Time series â‰  random split
### Validation Method
  - Last N days as validation
  - Example:
    - Train: up to 2017-07-15
    - Valid: 2017-07-16 â†’ 2017-08-15

In [31]:
# Define Validation Cutoff Date
TRAIN_END_DATE = "2017-07-15"

In [32]:
# Create Time Based Split
train_mask = train_model['date'] <= TRAIN_END_DATE
valid_mask = train_model['date'] > TRAIN_END_DATE

X_train = X[train_mask]
y_train = y[train_mask]

X_valid = X[valid_mask]
y_valid = y[valid_mask]

In [33]:
# Also keep the full train_model rows for validation (for accessing lag columns)
train_valid = train_model[valid_mask].reset_index(drop=True)

print("DATA SPLIT SUMMARY:\n")
print(f"Train dates: {train_model[train_mask]['date'].min().date()} to {train_model[train_mask]['date'].max().date()}")
print(f"Validation dates: {train_model[valid_mask]['date'].min().date()} to {train_model[valid_mask]['date'].max().date()}")
print(f"Train samples: {len(X_train):,}")
print(f"Validation samples: {len(X_valid):,}")

DATA SPLIT SUMMARY:

Train dates: 2013-01-29 to 2017-07-15
Validation dates: 2017-07-16 to 2017-08-15
Train samples: 2,949,210
Validation samples: 55,242


## BASELINE MODEL 1 â€“ NAIVE & MOVING AVERAGE
### Goal
  - Establish simple benchmark scores
  - Validate feature correctness
  - Create RMSLE reference

In [45]:
## Define Evaluation Metrics (RMSLE)
def rmsle(y_true, y_pred):
    """
    Root Mean Squared Log Error
    """
    y_true = np.maximum(0, y_true)
    y_pred = np.maximum(0, y_pred)
    return np.sqrt(mean_squared_log_error(y_true, y_pred))



In [46]:
# Prepare Validation Ground Truth
y_valid_true = np.expm1(y_valid)

In [47]:
# Extract Prediction (log-space)
naive_pred_log = X_valid["sales_lag_1"]


In [48]:
# Convert Back to Original Scale
naive_pred = np.expm1(naive_pred_log)


In [76]:
# Define a safe inverse transform
def safe_expm1(x, clip_max=1):
    """
    Safely inverse log transform:
    - clips extreme log values
    - prevents overflow
    """
    x = np.clip(x, a_min=None, a_max=clip_max)
    return np.expm1(x)


In [77]:
# Evaluate RMSLE
naive_pred_log = X_valid["sales_lag_1"]
naive_pred = safe_expm1(naive_pred_log)

naive_rmsle = rmsle(y_valid_true, naive_pred)

print(f"Naive Forecast RMSLE: {naive_rmsle:.5f}")


Naive Forecast RMSLE: 3.78181


## BASELINE MODEL 1 â€“ MOVING AVERAGE
### Predict sales using rolling mean of past sales\

In [73]:
# 7 day, 14 day & 28 day Average
ma7_pred = safe_expm1(X_valid["sales_roll_mean_7"])
ma14_pred = safe_expm1(X_valid["sales_roll_mean_14"])
ma28_pred = safe_expm1(X_valid["sales_roll_mean_28"])

ma7_rmsle = rmsle(y_valid_true, ma7_pred)
ma14_rmsle = rmsle(y_valid_true, ma14_pred)
ma28_rmsle = rmsle(y_valid_true, ma28_pred)

print(f"MA-7 RMSLE:  {ma7_rmsle:.5f}")
print(f"MA-14 RMSLE: {ma14_rmsle:.5f}")
print(f"MA-28 RMSLE: {ma28_rmsle:.5f}")


MA-7 RMSLE:  4.33326
MA-14 RMSLE: 4.33324
MA-28 RMSLE: 4.33324


## Baseline Model 3: Linear Regression (OLS)
### ðŸŽ¯ Goal

#### Learn linear relationships between:
 - time patterns
 - past sales (lags & rolling)
 - promotions
 - oil price
 - store / product encodings


In [61]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)

lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
def inverse_log(x):
    return np.expm1(x)
y_valid_pred_log = lr.predict(X_valid_scaled)

y_valid_true = inverse_log(y_valid)
y_valid_pred = np.maximum(inverse_log(y_valid_pred_log), 0)
rmsle_lr = np.sqrt(
    mean_squared_log_error(y_valid_true, y_valid_pred)
)

print(f"Linear Regression RMSLE: {rmsle_lr:.4f}")


Linear Regression RMSLE: 0.9663


In [78]:
from sklearn.linear_model import Ridge

# Initialize Ridge (alpha=1.0 is default, higher alpha = more regularization)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)

# Predict and Inverse Transform
y_valid_pred_log_ridge = ridge.predict(X_valid_scaled)
y_valid_pred_ridge = np.maximum(np.expm1(y_valid_pred_log_ridge), 0)

# Calculate RMSLE
rmsle_ridge = np.sqrt(mean_squared_log_error(y_valid_true, y_valid_pred_ridge))

print(f"Ridge Regression RMSLE: {rmsle_ridge:.4f}")

Ridge Regression RMSLE: 0.9663


In [79]:
from sklearn.linear_model import Lasso

# Initialize Lasso
lasso = Lasso(alpha=0.01) # Lasso is sensitive to alpha; smaller values usually work better for log-data
lasso.fit(X_train_scaled, y_train)

# Predict and Inverse Transform
y_valid_pred_log_lasso = lasso.predict(X_valid_scaled)
y_valid_pred_lasso = np.maximum(np.expm1(y_valid_pred_log_lasso), 0)

# Calculate RMSLE
rmsle_lasso = np.sqrt(mean_squared_log_error(y_valid_true, y_valid_pred_lasso))

print(f"Lasso Regression RMSLE: {rmsle_lasso:.4f}")

Lasso Regression RMSLE: 0.9657


In [81]:
import xgboost as xgb

# Initialize XGBoost Regressor
# We use small learning_rate and more estimators for better generalization
xgb_model = xgb.XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    n_jobs=-1,
    random_state=42
)

# Fit model
xgb_model.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    verbose=False
)

# Predict and Inverse Transform
y_pred_log_xgb = xgb_model.predict(X_valid)
y_pred_xgb = np.maximum(np.expm1(y_pred_log_xgb), 0)

# Calculate RMSLE
rmsle_xgb = np.sqrt(mean_squared_log_error(y_valid_true, y_pred_xgb))
print(f"XGBoost RMSLE: {rmsle_xgb:.4f}")

XGBoost RMSLE: 0.3863


In [83]:
import lightgbm as lgb

# Initialize LightGBM
lgb_model = lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    feature_fraction=0.8,
    bagging_fraction=0.8,
    random_state=42,
    n_jobs=-1
)

# Fit model
lgb_model.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    eval_metric='rmse')

# Predict and Inverse Transform
y_pred_log_lgb = lgb_model.predict(X_valid)
y_pred_lgb = np.maximum(np.expm1(y_pred_log_lgb), 0)

# Calculate RMSLE
rmsle_lgb = np.sqrt(mean_squared_log_error(y_valid_true, y_pred_lgb))
print(f"LightGBM RMSLE: {rmsle_lgb:.4f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.523150 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 5206
[LightGBM] [Info] Number of data points in the train set: 2949210, number of used features: 32
[LightGBM] [Info] Start training from score 2.932884
LightGBM RMSLE: 0.3838
