# MGMT298D: Science and Strategy of AI
**Instructor:** Professor Siddiq  
**Student:** _Your Name Here_

This notebook walks through a small demand-modeling exercise using H&M-style SKU data.
You will:
1. Load and clean the data.
2. Focus on a single product type.
3. Split SKUs into train/test so we test on **new products**.
4. Fit a baseline LASSO.
5. (Optionally) add a feature-engineered version.
6. Reflect on the modeling choices.


In [ ]:
# ---------------------------------------------------------
# 1. Imports and basic setup
# ---------------------------------------------------------
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

MONTHS = [
    "January","February","March","April","May","June",
    "July","August","September","October","November","December"
]

def report_metrics(y_true, y_pred, label=""):
    """Print R², RMSE, and MAE for easy comparison."""
    r2 = r2_score(y_true, y_pred)
    rmse = mean_squared_error(y_true, y_pred) ** 0.5
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{label}R²={r2:.3f}, RMSE={rmse:.2f}, MAE={mae:.2f}")

def show_top_coeffs(model, feature_names, k=12):
    """Show the k largest coefficients in absolute value."""
    coefs = pd.Series(model.coef_, index=feature_names)
    top = coefs.abs().sort_values(ascending=False).head(k)
    print("\n[Top features by |coefficient|]")
    print(top)


In [ ]:
# ---------------------------------------------------------
# Helper: visualize LASSO cross-validation path and sparsity
# ---------------------------------------------------------
import matplotlib.pyplot as plt
import numpy as np

def visualize_lasso_cv(model, model_name="LASSO"):
    """Plot CV error vs alpha and print sparsity info."""
    if not hasattr(model, "alphas_") or not hasattr(model, "mse_path_"):
        print(f"[WARN] {model_name}: no CV path available.")
        return

    mean_mse = np.mean(model.mse_path_, axis=1)
    std_mse = np.std(model.mse_path_, axis=1)

    plt.figure(figsize=(6, 4))
    plt.semilogx(model.alphas_, mean_mse)
    plt.fill_between(model.alphas_, mean_mse - std_mse, mean_mse + std_mse, alpha=0.2)
    plt.axvline(model.alpha_, linestyle="--")
    plt.xlabel("Alpha (penalty)")
    plt.ylabel("Cross-validated MSE")
    plt.title(f"{model_name}: CV Error vs Alpha")
    plt.show()

    n_nonzero = np.sum(model.coef_ != 0)
    print(f"[INFO] {model_name} selected alpha = {model.alpha_:.4g}")
    print(f"[INFO] Nonzero coefficients: {n_nonzero} / {len(model.coef_)}")


## 2. Load and lightly clean the dataset

- CSV you shared has columns like `id, demand, price, name, ... April, August, ...`.
- We will rename the odd `Transparent.1` column if it exists.
- Then we will filter to **one** product name that actually exists (default: `Vest top`).
- Change `CSV_PATH` or `PRODUCT_NAME` as needed.


## 2A. Product Types AvailableBelow is the full list of product types found in the dataset.Please **choose one** and assign it to `PRODUCT_NAME` in the code cell below (e.g., `"Hoodie"` or `"Bra"`).```Alice bandBaby BibBackpackBagBallerinasBeanieBeltBikini topBlanketBlazerBlouseBodysuitBootieBootsBraBra extenderBraceletBracesBucket hatCapCap/peakedCardiganChem. cosmeticsCoatCostumesCushionDog WearDressDungareesEarringEarringsFelt hatFine cosmeticsFlat shoeFlat shoesFlip flopGarment SetGiftboxGlovesHair clipHair stringHair/alice bandHairbandHat/beanieHat/brimHeeled sandalsHeelsHoodieJacketJumpsuit/PlaysuitKeychainKids Underwear topLeg warmersLeggings/TightsLong JohnNecklaceNight gownNipple coversOther accessoriesOther shoeOutdoor overallOutdoor trousersOutdoor WaistcoatPolo shirtPumpsPyjama bottomPyjama jumpsuit/playsuitPyjama setRingRobeSandalsSarongScarfSewing kitShirtShortsShoulder bagSide tableSkirtSleep BagSleeping sackSlippersSneakersSocksSoft ToysStraw hatSunglassesSweaterSwimsuitSwimwear bottomSwimwear setSwimwear topT-shirtTailored WaistcoatTieTopTote bagTrousersUmbrellaUnderdressUnderwear bodyUnderwear bottomUnderwear corsetUnderwear setUnderwear TightsUnknownVest topWalletWatchWaterbottleWedgeWeekend/Gym bagWood balls```

In [ ]:
# ---------------------------------------------------------
# 2. Load and basic cleaning
# ---------------------------------------------------------
CSV_PATH = "HMData.csv"  # change to your actual path if needed

print("[INFO] Loading data…")
df = pd.read_csv(CSV_PATH)
print("[INFO] Raw shape:", df.shape)

# Fix awkward column name
if "Transparent.1" in df.columns:
    df = df.rename(columns={"Transparent.1": "Transparent_color"})

# Choose a product that exists in this CSV
PRODUCT_NAME = "Vest top"  # e.g., "Bra", "Underwear Tights"
df = df[df["name"] == PRODUCT_NAME].copy()
if df.empty:
    raise ValueError(f"No rows found for name == {PRODUCT_NAME!r}")

print(f"[INFO] After filtering to product '{PRODUCT_NAME}':", df.shape)


## 3. Train/test split by SKU and keep Q4 in test

We want to test on **new products**:
1. Split unique `id` into train/test.
2. Train = all rows for train SKUs.
3. Test = Q4 rows (Oct/Nov/Dec) for test SKUs.


In [ ]:
# ---------------------------------------------------------
# 3. SKU-level split
# ---------------------------------------------------------
Q4_MONTHS = ["October", "November", "December"]

unique_ids = df["id"].unique()
train_ids, test_ids = train_test_split(unique_ids, test_size=0.2, random_state=0)

train_df = df[df["id"].isin(train_ids)].copy()

# test: Q4 rows for unseen SKUs
test_condition = df["id"].isin(test_ids)
q4_masks = []
for m in Q4_MONTHS:
    if m in df.columns:
        q4_masks.append(df[m] == 1)

if q4_masks:
    q4_mask_any = q4_masks[0]
    for m in q4_masks[1:]:
        q4_mask_any = q4_mask_any | m
    test_df = df[test_condition & q4_mask_any].copy()
else:
    # fallback: if Q4 columns absent, just use all rows for test SKUs
    test_df = df[test_condition].copy()

print(f"[INFO] Train SKUs: {len(train_ids)}, Test SKUs: {len(test_ids)}")
print(f"[INFO] Train rows: {len(train_df)}")
print(f"[INFO] Test rows (Q4 only): {len(test_df)}")

print(f"[INFO] Unique products in TRAIN: {train_df['id'].nunique()} | TEST: {test_df['id'].nunique()}")


## 4. Baseline LASSO

We drop obvious non-features (`id`, `name`, `demand`) and keep everything else. Then we scale and fit `LassoCV`.


In [ ]:
# ---------------------------------------------------------
# 4. Baseline LASSO
# ---------------------------------------------------------
drop_cols = ["id", "name", "demand"]

X_train_base = train_df.drop(columns=[c for c in drop_cols if c in train_df.columns])
X_train_base = X_train_base.apply(pd.to_numeric, errors="coerce").fillna(0)
y_train_base = train_df["demand"].astype(float)

X_test_base = test_df.drop(columns=[c for c in drop_cols if c in test_df.columns])
X_test_base = X_test_base.apply(pd.to_numeric, errors="coerce").fillna(0)
y_test_base = test_df["demand"].astype(float)

scaler_base = StandardScaler(with_mean=False)
Xtr_base = scaler_base.fit_transform(X_train_base)
Xte_base = scaler_base.transform(X_test_base)

lasso_base = LassoCV()
lasso_base.fit(Xtr_base, y_train_base)
y_pred_base = lasso_base.predict(Xte_base)

report_metrics(y_test_base, y_pred_base, label="[Baseline LASSO] ")
show_top_coeffs(lasso_base, X_train_base.columns, k=12)

visualize_lasso_cv(lasso_base, "Baseline LASSO")


## 5. (Optional) Light time-style features

If the file ever has multiple months per `id`, this shows how to make lag-style features. With one row per `id`, they will just be 0 — still fine for teaching.


In [ ]:
# ---------------------------------------------------------
# 5. Feature-engineered version (lags, MA, price change)
# ---------------------------------------------------------
df_fe = df.copy()

def infer_month_num(row):
    for i, m in enumerate(MONTHS, start=1):
        if m in row and row[m] == 1:
            return i
    return np.nan

df_fe["month_num"] = df_fe.apply(infer_month_num, axis=1)
df_fe = df_fe.sort_values(["id", "month_num"]).reset_index(drop=True)

# Create simple time features per id
df_fe["lag_demand_1"] = df_fe.groupby("id")["demand"].shift(1)
df_fe["ma3_demand"] = (
    df_fe.groupby("id")["demand"].shift(1).rolling(3, min_periods=1).mean().reset_index(level=0, drop=True)
)
df_fe["price_change"] = df_fe.groupby("id")["price"].pct_change()
df_fe[["lag_demand_1", "ma3_demand", "price_change"]] = df_fe[["lag_demand_1", "ma3_demand", "price_change"]].fillna(0)

# Rebuild train/test on FE frame
train_fe = df_fe[df_fe["id"].isin(train_ids)].copy()
test_fe = df_fe[
    df_fe["id"].isin(test_ids)
    & (
        ((df_fe["October"] == 1) if "October" in df_fe.columns else False)
        | ((df_fe["November"] == 1) if "November" in df_fe.columns else False)
        | ((df_fe["December"] == 1) if "December" in df_fe.columns else False)
    )
].copy()

drop_cols_fe = ["id", "name", "demand", "month_num"]
X_train_fe = train_fe.drop(columns=[c for c in drop_cols_fe if c in train_fe.columns])
X_train_fe = X_train_fe.apply(pd.to_numeric, errors="coerce").fillna(0)
y_train_fe = train_fe["demand"].astype(float)

X_test_fe = test_fe.drop(columns=[c for c in drop_cols_fe if c in test_fe.columns])
X_test_fe = X_test_fe.apply(pd.to_numeric, errors="coerce").fillna(0)
y_test_fe = test_fe["demand"].astype(float)

scaler_fe = StandardScaler(with_mean=False)
Xtr_fe = scaler_fe.fit_transform(X_train_fe)
Xte_fe = scaler_fe.transform(X_test_fe)

lasso_fe = LassoCV()
lasso_fe.fit(Xtr_fe, y_train_fe)
y_pred_fe = lasso_fe.predict(Xte_fe)

report_metrics(y_test_fe, y_pred_fe, label="[FE LASSO] ")
show_top_coeffs(lasso_fe, X_train_fe.columns, k=12)

visualize_lasso_cv(lasso_fe, "FE LASSO")


## 6. Questions and Reflections

1. **Question:** In the baseline model, which variables had the largest coefficients, and do they make sense for retail demand?
   **Response:**

2. **Question:** Why is splitting by SKU (`id`) a better test of generalization than a random row-wise split here?
   **Response:**

3. **Question:** If we had more months per SKU, what other time-based features would you add?
   **Response:**
