# 03_train_value_model.ipynb

## Notebook Purpose
- Train a **conditional value model** on **buyers only** (target: `log1p(y_revenue)`).
- Generate `rev_hat` for **all VALID/TEST snapshots** (predicted conditional revenue) and compute expected value:
  - `ev = p_hat * rev_hat` (Expected Value = purchase probability * predicted revenue)
- Update and overwrite prediction files in `artifacts/predictions/`:
  - `predictions_valid_raw.csv` (adds `rev_hat`, `ev`)
  - `predictions_test_raw.csv` (adds `rev_hat`, `ev`)
- Save the value model artifact to `artifacts/models/`.
- Save value model evaluation metrics to `artifacts/metrics/value_metrics_valid.csv`.

## Context
- Shared inputs/outputs and execution conventions are documented in the project README.

---

## Key Outcome (Read this first)
- This notebook implements a **two-stage (hurdle) setup**:
  - Stage A (Notebook 02): purchase probability `p_hat`
  - Stage B (this notebook): conditional revenue `rev_hat | purchase=1`
  - Combined: `ev = p_hat * rev_hat` for ranking and campaign value targeting
- Value model was trained and validated on a **small buyer-only sample**:
  - Buyers: **train=939**, **valid=576** (base rate is very low overall, so buyer-only data is limited)
- Validation quality (buyers only):
  - `rmse_log1p = 0.8170`, `mae_log1p = 0.5655`
  - Revenue-space errors are large (heavy-tailed revenue), so the log-target is the practical training target:
    - `rmse_revenue = 413.48`, `mae_revenue = 182.61` (valid buyers)
- `rev_hat` is produced for every snapshot (including non-buyers) because it is interpreted as:
  - “If this user buys in the next 7 days, what revenue is expected?”

---

## Experiment Design
- **Conditional modeling choice (buyers-only)**:
  - Revenue is zero-inflated (most users do not buy), so modeling `y_revenue` directly mixes two problems:
    - (1) purchase likelihood and (2) order value
  - This notebook isolates (2) by fitting on `y_purchase=1` only.
- **Target transformation**:
  - Fit on `log1p(y_revenue)` to reduce heavy-tail impact and stabilize training.
  - Convert predictions back to revenue space to create `rev_hat` (positive-valued).
- **Splits and leakage control**:
  - Uses the same time-based TRAIN/VALID/TEST snapshot splits as the purchase model.
  - Trains on TRAIN buyers, evaluates on VALID buyers, then scores VALID/TEST snapshots.

---

## Model Choice (Value Model: **CatBoost Regressor**)
  - Selected to produce a **stable `rev_hat`** that does not destabilize downstream **EV ranking** (`ev = p_hat * rev_hat`), where overestimation in the value head can dominate the combined score.
  - A practical choice for **buyer-only training** (small sample, noisy tabular signals): CatBoost tends to be **robust with strong default performance** and lower sensitivity to extensive feature scaling/normalization.
  - **Lower tuning burden**: good performance is often achievable with minimal hyperparameter search, which is desirable when the objective is a **quick, reliable baseline value model** rather than squeezing out marginal RMSE gains.
  - **LightGBM trade-off**: while often faster, LightGBM typically requires more careful tuning (e.g., leaves/regularization) to achieve similarly stable generalization in small, heavy-tailed revenue settings; given the EV multiplication, this extra instability risk is not worth it at this stage.
  - Early stopping selects the best iteration on **VALID buyers** and shrinks the model accordingly.

---

## Sanity Checks / Interpretation Notes
- `rev_hat` distribution summaries (all snapshots) indicate stable scale across VALID/TEST.
- Buyer vs non-buyer separation is directionally correct on VALID:
  - Mean `rev_hat` is higher for `y_purchase=1` than `y_purchase=0`.
  - Mean `ev` is substantially higher for buyers than non-buyers (as expected since `ev` includes `p_hat`).

---

## Outputs
- Model artifact:
  - `artifacts/models/value_model_catboost.cbm`
- Metrics:
  - `artifacts/metrics/value_metrics_valid.csv`
- Updated prediction files (adds `rev_hat`, `ev`):
  - `artifacts/predictions/predictions_valid_raw.csv`
  - `artifacts/predictions/predictions_test_raw.csv`


In [1]:
# ============ Common PATH / ENV block ============
from pathlib import Path

PROJECT_ROOT = Path(r"C:\Users\seony\Desktop\personal_project\purchase_prediction")

DATA_DIR = PROJECT_ROOT / "data"
RAW_DIR = DATA_DIR / "raw"
PROCESSED_DIR = DATA_DIR / "processed"

ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
MODELS_DIR = ARTIFACTS_DIR / "models"
PRED_DIR = ARTIFACTS_DIR / "predictions"
REPORTS_DIR = ARTIFACTS_DIR / "reports"
METRICS_DIR = ARTIFACTS_DIR / "metrics"
FIGURES_DIR = ARTIFACTS_DIR / "figures"

for d in [RAW_DIR, PROCESSED_DIR, MODELS_DIR, PRED_DIR, REPORTS_DIR, METRICS_DIR, FIGURES_DIR]:
    d.mkdir(parents=True, exist_ok=True)

print("PROJECT_ROOT:", PROJECT_ROOT)
print("PROCESSED_DIR:", PROCESSED_DIR)
print("PRED_DIR:", PRED_DIR)

PROJECT_ROOT: C:\Users\seony\Desktop\personal_project\purchase_prediction
PROCESSED_DIR: C:\Users\seony\Desktop\personal_project\purchase_prediction\data\processed
PRED_DIR: C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\predictions


In [2]:
# ============ Imports ============
import numpy as np
import pandas as pd

from catboost import CatBoostRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error

In [3]:
# ============ Inputs / outputs ============
USER_DATASET_PATH = PROCESSED_DIR / "user_dataset_hist23_label7_snapshots_v1.parquet"

PRED_VALID_RAW = PRED_DIR / "predictions_valid_raw.csv"
PRED_TEST_RAW  = PRED_DIR / "predictions_test_raw.csv"

VALUE_MODEL_OUT = MODELS_DIR / "value_model_catboost.cbm"
VALUE_METRICS_OUT = METRICS_DIR / "value_metrics_valid.csv"

print("USER_DATASET_PATH:", USER_DATASET_PATH)
print("PRED_VALID_RAW:", PRED_VALID_RAW, "| exists:", PRED_VALID_RAW.exists())
print("PRED_TEST_RAW :", PRED_TEST_RAW,  "| exists:", PRED_TEST_RAW.exists())

USER_DATASET_PATH: C:\Users\seony\Desktop\personal_project\purchase_prediction\data\processed\user_dataset_hist23_label7_snapshots_v1.parquet
PRED_VALID_RAW: C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\predictions\predictions_valid_raw.csv | exists: True
PRED_TEST_RAW : C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\predictions\predictions_test_raw.csv | exists: True


In [4]:
# ============ Load user snapshot dataset ============
df = pd.read_parquet(USER_DATASET_PATH)
print("Loaded:", df.shape)

# Required columns
required = {"user_id", "cutoff", "split", "y_purchase", "y_revenue"}
missing = sorted(list(required - set(df.columns)))
if missing:
    raise ValueError(f"Missing required columns: {missing}")

# Feature columns: exclude ids/labels/meta
drop_cols = {"user_id", "cutoff", "split", "y_purchase", "y_revenue"}
feature_cols = [c for c in df.columns if c not in drop_cols]

print("n_features:", len(feature_cols))
print("features:", feature_cols)

# Ensure numeric (simple guard)
non_numeric = [c for c in feature_cols if not pd.api.types.is_numeric_dtype(df[c])]
if non_numeric:
    raise ValueError(f"Non-numeric features found (please remove or encode): {non_numeric}")

# Split datasets
train = df[df["split"] == "train"].copy()
valid = df[df["split"] == "valid"].copy()
test  = df[df["split"] == "test"].copy()

print("Splits:", train.shape, valid.shape, test.shape)
print("Base rates:", train["y_purchase"].mean(), valid["y_purchase"].mean(), test["y_purchase"].mean())

# Buyers only for value model training
train_b = train[train["y_purchase"] == 1].copy()
valid_b = valid[valid["y_purchase"] == 1].copy()

print("Buyers (train/valid):", len(train_b), len(valid_b))
print("Train buyers revenue summary:")
display(train_b["y_revenue"].astype(float).describe(percentiles=[0.5, 0.9, 0.95, 0.99]))

Loaded: (1183222, 19)
n_features: 14
features: ['n_events', 'n_sessions', 'n_products', 'n_categories', 'price_mean', 'price_max', 'price_min', 'n_cart', 'n_purchase_hist', 'n_view', 'recency_days', 'cart_view_ratio', 'purchase_cart_ratio_hist', 'events_per_session']
Splits: (651349, 19) (278847, 19) (253026, 19)
Base rates: 0.0014416234614622883 0.0020656489042378077 0.0019760815094101002
Buyers (train/valid): 939 576
Train buyers revenue summary:


count     939.000000
mean      242.158371
std       337.849356
min         1.570000
50%       126.680000
90%       598.607983
95%       965.560980
99%      1830.556357
max      2862.850098
Name: y_revenue, dtype: float64

In [5]:
# ============ Build training matrices (buyers only) ============
X_train = train_b[feature_cols]
y_train = np.log1p(train_b["y_revenue"].astype(float).values)

X_valid = valid_b[feature_cols]
y_valid = np.log1p(valid_b["y_revenue"].astype(float).values)

print("X_train/y_train:", X_train.shape, y_train.shape)
print("X_valid/y_valid:", X_valid.shape, y_valid.shape)

X_train/y_train: (939, 14) (939,)
X_valid/y_valid: (576, 14) (576,)


In [6]:
# ============ Train value model (CatBoostRegressor) ============
model = CatBoostRegressor(
    loss_function="RMSE",
    iterations=3000,
    learning_rate=0.05,
    depth=8,
    l2_leaf_reg=5.0,
    random_seed=42,
    early_stopping_rounds=200,
    verbose=200,
)

model.fit(
    X_train, y_train,
    eval_set=(X_valid, y_valid),
    use_best_model=True,
)

model.save_model(VALUE_MODEL_OUT)
print("Saved model:", VALUE_MODEL_OUT)

0:	learn: 1.2091521	test: 1.3226943	best: 1.3226943 (0)	total: 132ms	remaining: 6m 34s
200:	learn: 0.5864884	test: 0.8296902	best: 0.8170357 (93)	total: 625ms	remaining: 8.7s
Stopped by overfitting detector  (200 iterations wait)

bestTest = 0.817035672
bestIteration = 93

Shrink model to first 94 iterations.
Saved model: C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\models\value_model_catboost.cbm


In [7]:
# ============ Validate value model (buyers only) ============
pred_valid_log = model.predict(X_valid)

mse_log = mean_squared_error(y_valid, pred_valid_log)
rmse_log = float(np.sqrt(mse_log))
mae_log = float(mean_absolute_error(y_valid, pred_valid_log))

# Back to revenue scale (still conditional)
pred_valid_rev = np.expm1(pred_valid_log)
pred_valid_rev = np.clip(pred_valid_rev, 0.0, None)

y_valid_rev = valid_b["y_revenue"].astype(float).values

mse_rev = mean_squared_error(y_valid_rev, pred_valid_rev)
rmse_rev = float(np.sqrt(mse_rev))
mae_rev = float(mean_absolute_error(y_valid_rev, pred_valid_rev))

metrics = pd.DataFrame([{
    "metric": "rmse_log1p",
    "value": rmse_log,
}, {
    "metric": "mae_log1p",
    "value": mae_log,
}, {
    "metric": "rmse_revenue",
    "value": rmse_rev,
}, {
    "metric": "mae_revenue",
    "value": mae_rev,
}, {
    "metric": "n_valid_buyers",
    "value": int(len(valid_b)),
}])

metrics.to_csv(VALUE_METRICS_OUT, index=False)
print("Saved:", VALUE_METRICS_OUT)
display(metrics)

Saved: C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\metrics\value_metrics_valid.csv


Unnamed: 0,metric,value
0,rmse_log1p,0.817036
1,mae_log1p,0.565461
2,rmse_revenue,413.484198
3,mae_revenue,182.608939
4,n_valid_buyers,576.0


In [8]:
# ============ Predict rev_hat for VALID/TEST (all users) ============
# The model is trained on buyers, but we generate rev_hat for all rows for EV ranking.

valid_all = valid.copy()
test_all  = test.copy()

rev_hat_valid = np.expm1(model.predict(valid_all[feature_cols]))
rev_hat_test  = np.expm1(model.predict(test_all[feature_cols]))

rev_hat_valid = np.clip(rev_hat_valid, 0.0, None).astype("float32")
rev_hat_test  = np.clip(rev_hat_test, 0.0, None).astype("float32")

valid_all["rev_hat"] = rev_hat_valid
test_all["rev_hat"]  = rev_hat_test

print("rev_hat VALID summary:")
display(pd.Series(rev_hat_valid).describe(percentiles=[0.5, 0.9, 0.95, 0.99]))
print("rev_hat TEST summary:")
display(pd.Series(rev_hat_test).describe(percentiles=[0.5, 0.9, 0.95, 0.99]))

rev_hat VALID summary:


count    278847.000000
mean        158.066742
std         153.967285
min           4.639130
50%         105.041504
90%         395.953174
95%         490.927582
99%         656.772644
max         764.813171
dtype: float64

rev_hat TEST summary:


count    253026.000000
mean        167.910706
std         159.685059
min           4.541190
50%         110.524063
90%         410.902618
95%         509.188370
99%         656.772644
max         766.440430
dtype: float64

In [9]:
# ============ Update predictions files from notebook 02 (add rev_hat + ev) ============
# We join on (user_id, cutoff). Cutoff must match as datetime.

pred_valid = pd.read_csv(PRED_VALID_RAW)
pred_test  = pd.read_csv(PRED_TEST_RAW)

# Parse cutoff
pred_valid["cutoff"] = pd.to_datetime(pred_valid["cutoff"], utc=True, errors="coerce")
pred_test["cutoff"]  = pd.to_datetime(pred_test["cutoff"],  utc=True, errors="coerce")

join_cols = ["user_id", "cutoff"]

add_valid = valid_all[join_cols + ["rev_hat"]]
add_test  = test_all[join_cols + ["rev_hat"]]

m_valid = pred_valid.merge(add_valid, on=join_cols, how="left")
m_test  = pred_test.merge(add_test,  on=join_cols, how="left")

if m_valid["rev_hat"].isna().any() or m_test["rev_hat"].isna().any():
    n1 = int(m_valid["rev_hat"].isna().sum())
    n2 = int(m_test["rev_hat"].isna().sum())
    raise ValueError(f"rev_hat merge produced NaNs (valid={n1}, test={n2}). Check join keys and cutoff parsing.")

m_valid["rev_hat"] = m_valid["rev_hat"].astype("float32")
m_test["rev_hat"]  = m_test["rev_hat"].astype("float32")

# EV for ranking
m_valid["ev"] = (m_valid["p_hat"].astype(float) * m_valid["rev_hat"].astype(float)).astype("float32")
m_test["ev"]  = (m_test["p_hat"].astype(float) * m_test["rev_hat"].astype(float)).astype("float32")

# Overwrite raw prediction files (still "raw" for p_hat; needed before calibration)
m_valid.to_csv(PRED_VALID_RAW, index=False)
m_test.to_csv(PRED_TEST_RAW, index=False)

print("Updated predictions:")
print(" -", PRED_VALID_RAW)
print(" -", PRED_TEST_RAW)

display(m_valid.head(5))

Updated predictions:
 - C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\predictions\predictions_valid_raw.csv
 - C:\Users\seony\Desktop\personal_project\purchase_prediction\artifacts\predictions\predictions_test_raw.csv


Unnamed: 0,user_id,cutoff,y_purchase,y_revenue,p_hat,p_hat_model,rev_hat,ev
0,1515915625353230683,2020-12-27 00:00:00+00:00,0,0.0,0.128091,catboost,370.060394,47.401447
1,1515915625353234047,2020-12-27 00:00:00+00:00,0,0.0,0.110201,catboost,31.809345,3.505434
2,1515915625353294441,2020-12-27 00:00:00+00:00,0,0.0,0.531393,catboost,224.598969,119.350327
3,1515915625353400724,2020-12-27 00:00:00+00:00,0,0.0,0.573701,catboost,99.288689,56.962048
4,1515915625353416040,2020-12-27 00:00:00+00:00,0,0.0,0.734995,catboost,177.451233,130.425858


In [10]:
# ============ Quick sanity: buyers should have higher revenue on average ============
tmp = m_valid.copy()
tmp["y_purchase"] = tmp["y_purchase"].astype(int)

print("Mean rev_hat by y_purchase (VALID):")
print(tmp.groupby("y_purchase")["rev_hat"].mean())

print("\nMean ev by y_purchase (VALID):")
print(tmp.groupby("y_purchase")["ev"].mean())

Mean rev_hat by y_purchase (VALID):
y_purchase
0    157.881317
1    247.642517
Name: rev_hat, dtype: float32

Mean ev by y_purchase (VALID):
y_purchase
0     50.853775
1    184.175293
Name: ev, dtype: float32
