# HandM Sales — Numeric vs Embeddings vs Interactions

This notebook trains three LASSO models on an enriched HandM monthly dataset **pulled directly from GitHub**.

1. **Model A**: numeric only
2. **Model B**: numeric + text embeddings
3. **Model C**: numeric + text embeddings + interactions with `age_m`

Assumptions:
- Your GitHub repo hosts `HandMSales_final.csv` at a raw URL.
- That CSV already has numeric features (`lag_m1`, `price_change`, `age_m`) and dummy columns for month/channel, plus `detail_desc` and `article_id`.


In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt

# GitHub raw CSV location — change if your path is different
GITHUB_CSV_URL = "https://raw.githubusercontent.com/ucla-anderson-SSAI/SSAI/main/HandMSales_final.csv"

print("[INFO] Downloading CSV from GitHub…")
df = pd.read_csv(GITHUB_CSV_URL, parse_dates=["month_ts"])
print("[INFO] Data loaded from GitHub:", df.shape)

# train = all but last month, test = last month
months = np.sort(df["month_ts"].unique())
train = df[df["month_ts"] < months[-1]].copy()
test  = df[df["month_ts"] == months[-1]].copy()

y_tr = np.log1p(train["demand"].to_numpy())
y_te = test["demand"].to_numpy()

print(f"[INFO] Train rows: {len(train)} | Test rows: {len(test)}")


## Model A — numeric only
Uses engineered numeric features and month/channel dummies already in the CSV.

In [None]:
# numeric features + pre-made dummies
num_cols = [
    "mean_price", "lag_m1", "price_change", "age_m"
] + [c for c in df.columns if c.startswith("month_") or c.startswith("channel_")]

X_tr_A = train[num_cols].to_numpy()
X_te_A = test[num_cols].to_numpy()

model_A = LassoCV(cv=3, random_state=0).fit(X_tr_A, y_tr)
pred_A = np.expm1(model_A.predict(X_te_A))
r2_A = r2_score(y_te, pred_A)
rmse_A = np.sqrt(mean_squared_error(y_te, pred_A))
mae_A = mean_absolute_error(y_te, pred_A)

print(f"Model A (numeric) — R²={r2_A:.3f}, RMSE={rmse_A:.2f}, MAE={mae_A:.2f}")


## Build embeddings (detail_desc → vector)
We embed at the article level from the text in the CSV fetched from GitHub.

In [None]:
embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

uniq = df[["article_id", "detail_desc"]].drop_duplicates("article_id").fillna("")
emb = embed_model.encode(uniq["detail_desc"].tolist(), show_progress_bar=False)
emb_cols = [f"emb_{i}" for i in range(emb.shape[1])]
emb_df = pd.DataFrame(emb, columns=emb_cols)
emb_df.insert(0, "article_id", uniq["article_id"].values)

trainE = train.merge(emb_df, on="article_id", how="left")
testE  = test.merge(emb_df, on="article_id", how="left")
print("[INFO] Embeddings merged:", trainE.shape, testE.shape)


## Model B — numeric + embeddings

In [None]:
num_plus_emb = num_cols + [c for c in emb_cols]
X_tr_B = trainE[num_plus_emb].to_numpy()
X_te_B = testE[num_plus_emb].to_numpy()

model_B = LassoCV(cv=3, random_state=0).fit(X_tr_B, y_tr)
pred_B = np.expm1(model_B.predict(X_te_B))
r2_B = r2_score(y_te, pred_B)
rmse_B = np.sqrt(mean_squared_error(y_te, pred_B))
mae_B = mean_absolute_error(y_te, pred_B)

print(f"Model B (numeric + emb) — R²={r2_B:.3f}, RMSE={rmse_B:.2f}, MAE={mae_B:.2f}")


## Model C — numeric + embeddings + interactions
We multiply each embedding dimension by `age_m` to let text effects vary over the product lifecycle.

In [None]:
for c in emb_cols:
    trainE[f"age_x_{c}"] = trainE["age_m"] * trainE[c]
    testE[f"age_x_{c}"] = testE["age_m"] * testE[c]

num_emb_int = num_cols + [c for c in trainE.columns if c.startswith("emb_") or c.startswith("age_x_")]

X_tr_C = trainE[num_emb_int].to_numpy()
X_te_C = testE[num_emb_int].to_numpy()

model_C = LassoCV(cv=3, random_state=0).fit(X_tr_C, y_tr)
pred_C = np.expm1(model_C.predict(X_te_C))
r2_C = r2_score(y_te, pred_C)
rmse_C = np.sqrt(mean_squared_error(y_te, pred_C))
mae_C = mean_absolute_error(y_te, pred_C)

print(f"Model C (numeric + emb + interactions) — R²={r2_C:.3f}, RMSE={rmse_C:.2f}, MAE={mae_C:.2f}")


## Compare models

In [None]:
labels = ["A numeric", "B +emb", "C +emb×age"]
scores = [r2_A, r2_B, r2_C]
plt.figure(figsize=(5,3))
plt.bar(labels, scores)
plt.ylabel("R² (test)")
plt.title("Embeddings Ablation (GitHub CSV)")
plt.show()
