<h1 style='color:Green'>Baseline Modeling</h1>
<h3>üìå Objectives </h3>
<pre>
  Establish simple, interpretable benchmark models to:
  - Validate feature quality
  - Create RMSLE reference scores
  - Compare against advanced ML/DL later

 No heavy tuning, no leakage, no deep learning.
</pre>

<h2 style='color:Green'>Load Feature Data</h2>
<h3>üéØ Goal </h3>
<pre>
  - Load engineered features
  - Drop only lag/rolling NaNs
  - Keep time order intact
</pre>

<h3>Clone GitHub Repository</h3>

In [19]:
# Clone GitHub Repository
!git clone https://github.com/sabin74/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform.git

Cloning into 'Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform'...
remote: Enumerating objects: 36, done.[K
remote: Counting objects: 100% (36/36), done.[K
remote: Compressing objects: 100% (33/33), done.[K
remote: Total 36 (delta 6), reused 26 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (36/36), 21.62 MiB | 36.41 MiB/s, done.
Resolving deltas: 100% (6/6), done.
Filtering content: 100% (11/11), 315.52 MiB | 52.06 MiB/s, done.


## Set Project Root

This keeps paths identical to local setup.

In [20]:
# Set  PROJECT_ROOT
import os

PROJECT_ROOT = "/content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform"
os.chdir(PROJECT_ROOT)

print("Current directory:", os.getcwd())


Current directory: /content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform


## Load Feature Data
üéØ Goal
 - Load engineered data
 - Drop only lag/rolling NaNs
 - Preserve time order
 - Avoid memory waste

In [21]:
# Import and Load Data
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_log_error
import warnings

warnings.filterwarnings("ignore")
from pathlib import Path

In [22]:
# Load Feature Dataset from Repo
FEATURE_DATA = Path("data/features")

train = pd.read_parquet(FEATURE_DATA / "train_features.parquet")


In [23]:
# Convert Date and Sort
train['date'] = pd.to_datetime(train['date'])

train = train.sort_values(
    ['store_nbr', 'family', 'date']
).reset_index(drop=True)

In [24]:
# Identify Lag and Rolling Columns
lag_roll_cols = [
    col for col in train.columns
    if ("lag" in col) or ("roll" in col)
]

In [25]:
# Drop NaN in lag/rolling features
train_model = train.dropna(
    subset = lag_roll_cols
).reset_index(drop=True)

## Define Target & Feature Sets
### Target Options
  - sales_log (preferred for RMSLE)
###Feature Sets
  - Time features
  - Lag features
  - Rolling features
  - Promotions
  - Oil
  - Encodings

In [26]:
# Targe Variable
TARGET = 'sales_log'

In [27]:
# Define Feature Groups
TIME_FEATURES = [
    "day", "week_of_year", "monty", "year",
    "day_of_week", "is_weekend", "is_payday"
]

LAG_FEATURES = [
    "sales_lag_1", "sales_lag_7",
    "sales_lag_14", "sales_lag_28"
]

ROLL_FEATURES = [
    "sales_roll_mean_7", "sales_roll_mean_14", "sales_roll_mean_28",
    "sales_roll_std_7", "sales_roll_std_14", "sales_roll_std_28"
]

PROMO_FEATURES = [
    "onpromotion",
    "promo_lag_1", "promo_lag_7",
    "promo_freq_7", "promo_freq_14", "promo_freq_28",
    "promo_roll_sum_7", "promo_roll_sum_14", "promo_roll_sum_28"
]

OIL_FEATURES = [
    "dcoilwtico", "oil_lag_7", "oil_lag_14", "oil_lag_28"
]

ENCODING_FEATURES = [
    "store_te", "family_te"
]


In [28]:
# Final Feature List
FEATURES = (
    TIME_FEATURES + LAG_FEATURES + ROLL_FEATURES +
    PROMO_FEATURES + OIL_FEATURES + ENCODING_FEATURES
)

In [29]:
# Check which features actually exist
print(f"Total features in list: {len(FEATURES)}")
print(f"Features missing from train_model:")
missing_features = [f for f in FEATURES if f not in train_model.columns]
for f in missing_features:
    print(f"  - {f}")

Total features in list: 32
Features missing from train_model:


In [30]:
# Preparing Modeling Metrics
X = train_model[FEATURES]
y = train_model[TARGET]

## Time-Based Validation Strategy
**Why**: Time series ‚â† random split
### Validation Method
  - Last N days as validation
  - Example:
    - Train: up to 2017-07-15
    - Valid: 2017-07-16 ‚Üí 2017-08-15

In [31]:
# Define Validation Cutoff Date
TRAIN_END_DATE = "2017-07-15"

In [32]:
# Create Time Based Split
train_mask = train_model['date'] <= TRAIN_END_DATE
valid_mask = train_model['date'] > TRAIN_END_DATE

X_train = X[train_mask]
y_train = y[train_mask]

X_valid = X[valid_mask]
y_valid = y[valid_mask]

In [33]:
# Also keep the full train_model rows for validation (for accessing lag columns)
train_valid = train_model[valid_mask].reset_index(drop=True)

print("DATA SPLIT SUMMARY:\n")
print(f"Train dates: {train_model[train_mask]['date'].min().date()} to {train_model[train_mask]['date'].max().date()}")
print(f"Validation dates: {train_model[valid_mask]['date'].min().date()} to {train_model[valid_mask]['date'].max().date()}")
print(f"Train samples: {len(X_train):,}")
print(f"Validation samples: {len(X_valid):,}")

DATA SPLIT SUMMARY:

Train dates: 2013-01-29 to 2017-07-15
Validation dates: 2017-07-16 to 2017-08-15
Train samples: 2,949,210
Validation samples: 55,242


## BASELINE MODEL 1 ‚Äì NAIVE & MOVING AVERAGE
### Goal
  - Establish simple benchmark scores
  - Validate feature correctness
  - Create RMSLE reference

In [34]:
## Define Evaluation Metrics (RMSLE)
def rmsle(y_true, y_pred):
    """
    Root Mean Squared Log Error
    """
    y_true = np.maximum(0, y_true)
    y_pred = np.maximum(0, y_pred)

    # clip extremely large values
    MAX_VALUE = 1e10  # 10 billion
    y_true = np.clip(y_true, 0, MAX_VALUE)
    y_pred = np.clip(y_pred, 0, MAX_VALUE)

    return np.sqrt(mean_squared_log_error(y_true, y_pred))


In [35]:
# Prepare Validation Ground Truth
y_valid_true = np.expm1(y_valid)

In [36]:
# Naive forecast - Use sales_lag_1 from train_valid
print("\n1. Naive Forecast (Lag-1):")
if "sales_lag_1" in train_valid.columns:
    naive_pred_log = train_valid["sales_lag_1"]
    naive_pred = np.expm1(naive_pred_log)

    # Debug info
    print(f"   sales_lag_1 min: {naive_pred_log.min():.4f}, max: {naive_pred_log.max():.4f}")
    print(f"   After expm1 min: {naive_pred.min():.2f}, max: {naive_pred.max():.2f}")

    naive_rmsle = rmsle(y_valid_true, naive_pred)
    print(f"‚úÖ Naive Forecast RMSLE: {naive_rmsle:.5f}")
else:
    print("‚ùå ERROR: 'sales_lag_1' column not found!")
    naive_rmsle = None



1. Naive Forecast (Lag-1):
   sales_lag_1 min: 0.0000, max: 18340.0000
   After expm1 min: 0.00, max: inf
‚úÖ Naive Forecast RMSLE: 13.56406


In [37]:
# MA-7 forecast - Use sales_roll_mean_7 from train_valid
print("\n2. Moving Average (7-Day):")
if "sales_roll_mean_7" in train_valid.columns:
    ma7_pred_log = train_valid["sales_roll_mean_7"]
    ma7_pred = np.expm1(ma7_pred_log)

    # Debug info
    print(f"   sales_roll_mean_7 min: {ma7_pred_log.min():.4f}, max: {ma7_pred_log.max():.4f}")
    print(f"   After expm1 min: {ma7_pred.min():.2f}, max: {ma7_pred.max():.2f}")

    ma7_rmsle = rmsle(y_valid_true, ma7_pred)
    print(f"‚úÖ Moving Average (7-Day) RMSLE: {ma7_rmsle:.5f}")
else:
    print("‚ùå ERROR: 'sales_roll_mean_7' column not found!")
    # Try alternative column names
    print("   Searching for alternative MA columns...")
    alt_cols = [col for col in train_valid.columns if "roll" in col and "7" in col]
    print(f"   Found: {alt_cols}")
    ma7_rmsle = None



2. Moving Average (7-Day):
   sales_roll_mean_7 min: 0.0000, max: 12649.4286
   After expm1 min: 0.00, max: inf
‚úÖ Moving Average (7-Day) RMSLE: 13.65953


In [38]:
# MA-14 forecast - Use sales_roll_mean_14 from train_valid
print("\n3. Moving Average (14-Day):")
if "sales_roll_mean_14" in train_valid.columns:
    ma14_pred_log = train_valid["sales_roll_mean_14"]
    ma14_pred = np.expm1(ma14_pred_log)

    # Debug info
    print(f"   sales_roll_mean_14 min: {ma14_pred_log.min():.4f}, max: {ma14_pred_log.max():.4f}")
    print(f"   After expm1 min: {ma14_pred.min():.2f}, max: {ma14_pred.max():.2f}")

    ma14_rmsle = rmsle(y_valid_true, ma14_pred)
    print(f"‚úÖ Moving Average (14-Day) RMSLE: {ma14_rmsle:.5f}")
else:
    print("‚ùå ERROR: 'sales_roll_mean_14' column not found!")
    # Try alternative column names
    alt_cols = [col for col in train_valid.columns if "roll" in col and "14" in col]
    print(f"   Found: {alt_cols}")
    ma14_rmsle = None


3. Moving Average (14-Day):
   sales_roll_mean_14 min: 0.0000, max: 11905.1429
   After expm1 min: 0.00, max: inf
‚úÖ Moving Average (14-Day) RMSLE: 13.66350


In [39]:
# Only create results if all RMSLE values were computed
if all(r is not None for r in [naive_rmsle, ma7_rmsle, ma14_rmsle]):
    baseline_results = pd.DataFrame({
        "Model": [
            "Naive Forecast (Lag-1)",
            "Moving Average (7-Day)",
            "Moving Average (14-Day)"
        ],
        "RMSLE": [
            naive_rmsle,
            ma7_rmsle,
            ma14_rmsle
        ]
    })

In [40]:

    print("BASELINE RESULTS SUMMARY:\n")
    print(baseline_results)

BASELINE RESULTS SUMMARY:

                     Model      RMSLE
0   Naive Forecast (Lag-1)  13.564059
1   Moving Average (7-Day)  13.659528
2  Moving Average (14-Day)  13.663498


In [44]:
# Naive Forecast using previous sales value
y_pred_naive = X_valid['sales_lag_1']

# RMSLE calculation
naive_rmsle = np.sqrt(mean_squared_log_error(y_valid, y_pred_naive))

print(f"Naive Forecast RMSLE: {naive_rmsle:.4f}")


Naive Forecast RMSLE: 2.9857
