<h1 style='color:Green'>Baseline Modeling</h1>
<h3>ðŸ“Œ Objectives </h3>
<pre>
  Establish simple, interpretable benchmark models to:
  - Validate feature quality
  - Create RMSLE reference scores
  - Compare against advanced ML/DL later

 No heavy tuning, no leakage, no deep learning.
</pre>

<h2 style='color:Green'>Load Feature Data</h2>
<h3>ðŸŽ¯ Goal </h3>
<pre>
  - Load engineered features
  - Drop only lag/rolling NaNs
  - Keep time order intact
</pre>

<h3>Clone GitHub Repository</h3>

In [1]:
# Clone GitHub Repository
!git clone https://github.com/sabin74/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform.git

Cloning into 'Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (29/29), done.[K
remote: Total 32 (delta 4), reused 26 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (32/32), 21.62 MiB | 34.69 MiB/s, done.
Resolving deltas: 100% (4/4), done.
Filtering content: 100% (11/11), 315.52 MiB | 47.89 MiB/s, done.


## Set Project Root

This keeps paths identical to local setup.

In [2]:
# Set  PROJECT_ROOT
import os

PROJECT_ROOT = "/content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform"
os.chdir(PROJECT_ROOT)

print("Current directory:", os.getcwd())


Current directory: /content/Enterprise-Intelligent-Demand-Forecasting-Decision-Optimization-Platform


## Load Feature Data
ðŸŽ¯ Goal
 - Load engineered data
 - Drop only lag/rolling NaNs
 - Preserve time order
 - Avoid memory waste

In [3]:
# Import and Load Data
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings("ignore")
from pathlib import Path

In [4]:
# Load Feature Dataset from Repo
FEATURE_DATA = Path("data/features")

train = pd.read_parquet(FEATURE_DATA / "train_features.parquet")


In [5]:
# Convert Date and Sort
train['date'] = pd.to_datetime(train['date'])

train = train.sort_values(
    ['store_nbr', 'family', 'date']
).reset_index(drop=True)

In [6]:
# Identify Lag and Rolling Columns
lag_roll_cols = [
    col for col in train.columns
    if ("lag" in col) or ("roll" in col)
]

In [7]:
# Drop NaN in lag/rolling features
train_model = train.dropna(
    subset = lag_roll_cols
).reset_index(drop=True)

## Define Target & Feature Sets
### Target Options
  - sales_log (preferred for RMSLE)
###Feature Sets
  - Time features
  - Lag features
  - Rolling features
  - Promotions
  - Oil
  - Encodings

In [8]:
# Targe Variable
TARGET = 'sales_log'

In [9]:
# Define Feature Groups
TIME_FEATURES = [
    "day", "week_of_year", "monty", "year",
    "day_of_week", "is_weekend", "is_payday"
]

LAG_FEATURES = [
    "sales_lag_1", "sales_lag_7",
    "sales_lag_14", "sales_lag_28"
]

ROLL_FEATURES = [
    "sales_roll_mean_7", "sales_roll_mean_14", "sales_roll_mean_28",
    "sales_roll_std_7", "sales_roll_std_14", "sales_roll_std_28"
]

PROMO_FEATURES = [
    "onpromotion",
    "promo_lag_1", "promo_lag_7",
    "promo_freq_7", "promo_freq_14", "promo_freq_28",
    "promo_roll_sum_7", "promo_roll_sum_14", "promo_roll_sum_28"
]

OIL_FEATURES = [
    "dcoilwtico", "oil_lag_7", "oil_lag_14", "oil_lag_28"
]

ENCODING_FEATURES = [
    "store_te", "family_te"
]


In [10]:
# Final Feature List
FEATURES = (
    TIME_FEATURES + LAG_FEATURES + ROLL_FEATURES +
    PROMO_FEATURES + OIL_FEATURES + ENCODING_FEATURES
)

In [11]:
# Preparing Modeling Metrics
X = train_model[FEATURES]
y = train_model[TARGET]

## Time-Based Validation Strategy
**Why**: Time series â‰  random split
### Validation Method
  - Last N days as validation
  - Example:
    - Train: up to 2017-07-15
    - Valid: 2017-07-16 â†’ 2017-08-15

In [12]:
# Define Validation Cutoff Date
TRAIN_END_DATE = "2017-07-15"

In [13]:
# Create Time Based Split
X_train = X[train_model['date'] <= TRAIN_END_DATE]
y_train = y[train_model['date'] <= TRAIN_END_DATE]

X_valid = X[train_model['date'] > TRAIN_END_DATE]
y_valid = y[train_model['date'] > TRAIN_END_DATE]

## BASELINE MODEL 1 â€“ NAIVE & MOVING AVERAGE
### Goal
  - Establish simple benchmark scores
  - Validate feature correctness
  - Create RMSLE reference