# HMData — Q4 Forecast Pedagogical Notebook

Goal: simulate what a retailer (H&M) might do if it wants to **predict demand in Q4 (Oct, Nov, Dec)** for a set of products it hasn't seen in Q4 yet.

We will:
1. Load a wide CSV from GitHub.
2. Lightly clean (boolean → 0/1, `month_` → month name, drop `start`).
3. Filter to a single product type (using the `name` column) to keep the example small.
4. Do a **product-level** train/test split:
   - Train products: we use **all months**.
   - Test products: we only evaluate on **Q4 rows** (Oct, Nov, Dec).
5. Fit 4 models:
   - Model 1: Linear regression on existing columns.
   - Model 2: Same as 1, but with LASSO.
   - Model 3: Add feature engineering (lags, 3-month MA, price change) + LASSO.
   - Model 4: Add a few **focused** interactions + simple nonlinear terms + LASSO.

The emphasis is on clarity, not maximum accuracy.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

MONTHS = [
    "January","February","March","April","May","June",
    "July","August","September","October","November","December"
]

def report(y_true, y_pred, label=""):
    r2 = r2_score(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{label}R²={r2:.3f}, RMSE={rmse:.2f}, MAE={mae:.2f}")

print("[INFO] Imports OK")

## 1. Load and clean from GitHub

In [None]:
GITHUB_URL = "https://raw.githubusercontent.com/ucla-anderson-SSAI/SSAI/main/HMData.csv"

df = pd.read_csv(GITHUB_URL)
print("[INFO] loaded:", df.shape)

# Replace booleans/strings with 1/0 if present
df = df.replace({True: 1, False: 0, "TRUE": 1, "FALSE": 0})

# Rename month_... columns to plain month names
new_cols = []
for c in df.columns:
    if c.startswith("month_"):
        new_cols.append(c.replace("month_", ""))
    else:
        new_cols.append(c)
df.columns = new_cols

# Drop 'start' if present
if "start" in df.columns:
    df = df.drop(columns=["start"])
    print("[INFO] Dropped 'start' column")

# Filter to one product type (column is called 'name' in this wide file)
SELECTED_NAME = "Vest top"   # change if you want a different product type
df = df[df["name"] == SELECTED_NAME].copy()
if df.empty:
    raise ValueError(f"No rows found for name == {SELECTED_NAME!r}")
print(f"[INFO] after name filter: {df.shape}")

## 2. Q4-style train/test split (product-level)

We simulate: "these products exist, but we haven't seen their Q4 sales yet".

- Train products → all months.
- Test products → only October/November/December rows.

In [None]:
Q4_MONTHS = ["October", "November", "December"]

# product-level split
unique_ids = df["id"].unique()
train_ids, test_ids = train_test_split(unique_ids, test_size=0.2, random_state=0)
print(f"[INFO] Train products: {len(train_ids)}, Test products: {len(test_ids)}")

# train: all rows for train products
train_df = df[df["id"].isin(train_ids)].copy()

# test: only Q4 rows for test products
test_df = df[df["id"].isin(test_ids) & (
    (df["October"] == 1) | (df["November"] == 1) | (df["December"] == 1)
)].copy()

print("[INFO] Train rows (all months):", len(train_df))
print("[INFO] Test rows (Q4 only):", len(test_df))

## 3. Model 1 — Linear regression on existing columns (predict Q4 rows for held-out products)

We drop identifier and target columns, convert everything else to numeric, and fit.

In [None]:
drop_cols = ["id", "name", "demand"]

X_train = train_df.drop(columns=[c for c in drop_cols if c in train_df.columns])
X_train = X_train.apply(pd.to_numeric, errors="coerce").fillna(0)
y_train = train_df["demand"].astype(float)

X_test = test_df.drop(columns=[c for c in drop_cols if c in test_df.columns])
X_test = X_test.apply(pd.to_numeric, errors="coerce").fillna(0)
y_test = test_df["demand"].astype(float)

linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred1 = linreg.predict(X_test)
report(y_test, y_pred1, label="[Model 1 — Linear] ")

## 4. Model 2 — Same features, but LASSO

We scale first, then fit `LassoCV`.

In [None]:
scaler2 = StandardScaler(with_mean=False)
Xtr2 = scaler2.fit_transform(X_train)
Xte2 = scaler2.transform(X_test)

lasso2 = LassoCV(cv=3, max_iter=5000, n_jobs=-1, random_state=0)
lasso2.fit(Xtr2, y_train)
y_pred2 = lasso2.predict(Xte2)
report(y_test, y_pred2, label="[Model 2 — LASSO] ")

coef2 = pd.Series(lasso2.coef_, index=X_train.columns).sort_values(ascending=False)
print("\n[Model 2] Top 10 coefficients:\n", coef2.head(10))

## 5. Model 3 — Add feature engineering (lags, MA3, price_change) + LASSO

Here we need a month order, so we infer `month_num` from the month dummies, sort by `id, month_num`, and create:
- lag_demand_1, lag_demand_2, lag_demand_3
- ma3_demand (3-month moving average of **previous** demand)
- price_change (pct change of price)

Then we re-do the **same** product-level Q4 split on the engineered frame.

In [None]:
df_fe = df.copy()

# infer month_num
def infer_month(row):
    for i, m in enumerate(MONTHS, start=1):
        if m in row and row[m] == 1:
            return i
    return np.nan

df_fe["month_num"] = df_fe.apply(infer_month, axis=1)
df_fe = df_fe.sort_values(["id", "month_num"]).reset_index(drop=True)

# lags
df_fe["lag_demand_1"] = df_fe.groupby("id")["demand"].shift(1)
df_fe["lag_demand_2"] = df_fe.groupby("id")["demand"].shift(2)
df_fe["lag_demand_3"] = df_fe.groupby("id")["demand"].shift(3)

# 3-month moving average of previous demand
df_fe["ma3_demand"] = (
    df_fe.groupby("id")["demand"]
         .shift(1)
         .rolling(3, min_periods=1)
         .mean()
         .reset_index(level=0, drop=True)
)

# price change
df_fe["price_change"] = df_fe.groupby("id")["price"].pct_change()

# fill engineered NaNs
eng_cols = ["lag_demand_1","lag_demand_2","lag_demand_3","ma3_demand","price_change"]
df_fe[eng_cols] = df_fe[eng_cols].fillna(0)

# re-run product-level split on engineered frame
train_fe = df_fe[df_fe["id"].isin(train_ids)].copy()
test_fe = df_fe[df_fe["id"].isin(test_ids) & (
    (df_fe["October"] == 1) | (df_fe["November"] == 1) | (df_fe["December"] == 1)
)].copy()

drop_cols_fe = ["id","name","demand","month_num"]
X3_train = train_fe.drop(columns=[c for c in drop_cols_fe if c in train_fe.columns])
X3_train = X3_train.apply(pd.to_numeric, errors="coerce").fillna(0)
y3_train = train_fe["demand"].astype(float)

X3_test = test_fe.drop(columns=[c for c in drop_cols_fe if c in test_fe.columns])
X3_test = X3_test.apply(pd.to_numeric, errors="coerce").fillna(0)
y3_test = test_fe["demand"].astype(float)

scaler3 = StandardScaler(with_mean=False)
X3tr = scaler3.fit_transform(X3_train)
X3te = scaler3.transform(X3_test)

lasso3 = LassoCV(cv=3, max_iter=6000, n_jobs=-1, random_state=0)
lasso3.fit(X3tr, y3_train)
y3_pred = lasso3.predict(X3te)
report(y3_test, y3_pred, label="[Model 3 — FE + LASSO] ")

coef3 = pd.Series(lasso3.coef_, index=X3_train.columns).sort_values(ascending=False)
print("\n[Model 3] Top 10 coefficients:\n", coef3.head(10))

## 6. Model 4 — FE + a few interactions + simple nonlinear terms + LASSO

To keep it pedagogical, we **do not** explode the feature space. We add just:
- `lag_demand_1 * price`
- `price_change * price`
- `lag_demand_1 * Ladieswear` (if available)
- `log(1 + lag_demand_1)`
- `log(1 + ma3_demand)`
- `price^2`

In [None]:
df_dec_int = test_fe.copy()  # we'll rebuild features on the engineered test subset name
df_train_int = train_fe.copy()

# 1) Nonlinear features on both train and test engineered frames
for _df in [df_train_int, df_dec_int]:
    if "lag_demand_1" in _df.columns:
        _df["log_lag1"] = np.log1p(_df["lag_demand_1"])
    if "ma3_demand" in _df.columns:
        _df["log_ma3"] = np.log1p(_df["ma3_demand"])
    if "price" in _df.columns:
        _df["price_sq"] = _df["price"] ** 2

# 2) Interactions on both train and test
def add_interactions(_df):
    if {"lag_demand_1", "price"}.issubset(_df.columns):
        _df["lag1_x_price"] = _df["lag_demand_1"] * _df["price"]
    if {"price_change", "price"}.issubset(_df.columns):
        _df["pch_x_price"] = _df["price_change"] * _df["price"]
    if "Ladieswear" in _df.columns and "lag_demand_1" in _df.columns:
        _df["lag1_x_ladies"] = _df["lag_demand_1"] * _df["Ladieswear"]
    return _df

df_train_int = add_interactions(df_train_int)
df_dec_int = add_interactions(df_dec_int)

# 3) Build X/y again
drop_cols_int = ["id","name","demand","month_num"]
X4_train = df_train_int.drop(columns=[c for c in drop_cols_int if c in df_train_int.columns])
X4_train = X4_train.apply(pd.to_numeric, errors="coerce").fillna(0)
y4_train = df_train_int["demand"].astype(float)

X4_test = df_dec_int.drop(columns=[c for c in drop_cols_int if c in df_dec_int.columns])
X4_test = X4_test.apply(pd.to_numeric, errors="coerce").fillna(0)
y4_test = df_dec_int["demand"].astype(float)

# 4) Scale + LASSO
scaler4 = StandardScaler(with_mean=False)
X4tr = scaler4.fit_transform(X4_train)
X4te = scaler4.transform(X4_test)

lasso4 = LassoCV(cv=3, max_iter=8000, n_jobs=-1, random_state=0)
lasso4.fit(X4tr, y4_train)
y4_pred = lasso4.predict(X4te)

report(y4_test, y4_pred, label="[Model 4 — nonlinear + interactions + LASSO] ")

coef4 = pd.Series(lasso4.coef_, index=X4_train.columns).sort_values(ascending=False)
print("\n[Model 4] Top 10 coefficients:\n", coef4.head(10))