
# Add-to-Bag 

This notebook follows the strategy you outlined:

1) **Data Collection & Feature Engineering** — user behavior, item attributes, user history (from available fields), time & query context, and smoothed popularity histories (product/brand/type/color/user).  
2) **Model Choice** — Random Forest, LightGBM, and XGBoost.  
3) **Training & Prediction** — time-based split (train ≤ 2010‑01‑05, validate 2010‑01‑06..07), probability outputs.  
4) **Evaluation** — PR‑AUC, LogLoss, ROC‑AUC (optional), and ranking metrics: MAP@10, NDCG@10, HR@10.  
5) **Deployment** — retrain on full train (2010‑01‑01..07), score holdout (2010‑01‑08), save `predictions.csv` with required columns.


In [15]:

import pandas as pd
import numpy as np

from sklearn.metrics import average_precision_score, log_loss, roc_auc_score

from sklearn.ensemble import RandomForestClassifier

# Optional boosters (install if missing)
try:
    import lightgbm as lgb
    HAS_LGB = True
except Exception:
    HAS_LGB = False

try:
    from xgboost import XGBClassifier
    HAS_XGB = True
except Exception:
    HAS_XGB = False

print("HAS_LGB:", HAS_LGB, "HAS_XGB:", HAS_XGB)


HAS_LGB: True HAS_XGB: True


## Input (adjust paths if needed)

In [None]:
# Define file paths for the three parquet datasets
TRAIN_PATH   = "interactions_train.parquet"
HOLDOUT_PATH = "interactions_holdout_predictions.parquet"
PRODUCTS_PATH = "products.parquet"

# Load parquet files into pandas DataFrames
train = pd.read_parquet(TRAIN_PATH)
holdout = pd.read_parquet(HOLDOUT_PATH)
products = pd.read_parquet(PRODUCTS_PATH)

# Inspect the training dataset
train.info(memory_usage='deep')

# Print the shapes of holdout and products datasets
print("\nHoldout shape:", holdout.shape)
print("\nProducts shape:", products.shape)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4974954 entries, 0 to 4974953
Data columns (total 8 columns):
 #   Column                     Dtype         
---  ------                     -----         
 0   search_query_time          datetime64[ns]
 1   user_id                    int32         
 2   search_query_id            int32         
 3   product_id                 int32         
 4   rank                       int32         
 5   price_discount_percentage  float64       
 6   viewed                     bool          
 7   added_to_bag               bool          
dtypes: bool(2), datetime64[ns](1), float64(1), int32(4)
memory usage: 161.3 MB

Holdout shape: (567384, 8)

Products shape: (175934, 4)


## Time-based split (train ≤ Jan 5, validate Jan 6–7)

We split the dataset by time to mimic the real-world scenario: training on past interactions (Jan 1–5) and validating on future interactions (Jan 6–7). This prevents data leakage, ensures evaluation is realistic, and tests whether the model generalizes to unseen days.

In [None]:
# Define the validation window
val_start = pd.Timestamp("2010-01-06")
val_end   = pd.Timestamp("2010-01-07")

# Create training subset
train_fit = train[train["search_query_time"] < val_start].copy()

# Create validation subset
val_df    = train[(train["search_query_time"] >= val_start) & (train["search_query_time"] <= val_end)].copy()

# Print row counts for sanity check
print("Train_fit rows:", len(train_fit), " | Val rows:", len(val_df))


Train_fit rows: 3558753  | Val rows: 679725


## Helpers — ID normalization, metrics, features, smoothed histories

In [None]:

def normalize_ids(df):
    """
    Ensure consistency of ID columns by converting them to numeric int64.

    Args:
        df (pd.DataFrame): Input DataFrame containing ID columns.

    Returns:
        pd.DataFrame: Copy of the DataFrame with normalized ID columns.
    """
    for c in ["product_id","brand_id","product_type_id","colour_id","user_id"]:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors="coerce").fillna(-1).astype("int64")
    return df

# Ranking metric: Average Precision at K (AP@K)
def ap_at_k(group, k=10):
    """
    Compute the Average Precision at K (AP@K) for a single ranked group.

    AP@K measures the quality of the ranking by averaging precision
    values at the positions of relevant items, up to the top-k results.
    It rewards placing positive examples (y_true=1) earlier in the list.

    Args:
        group (pd.DataFrame): Subset of rows belonging to a single query
            or session, containing at least:
              - 'y_true': ground-truth labels (0 or 1)
              - 'y_score': model scores or predicted probabilities
        k (int, optional): Cutoff rank (default=10). Only the top-k
            items are considered when computing AP.

    Returns:
        float: Average precision score for this group at cutoff k.
    """
    # Sort rows by predicted score (highest first) and keep only top-k
    g = group.sort_values("y_score", ascending=False).head(k)

    # Extract true labels for these k items
    y = g["y_true"].to_numpy()

    # If there are no positives in this group, AP is defined as 0
    if y.sum()==0: return 0.0
    precisions, hits = [], 0

    # Iterate through each rank position (1..k)
    for i, t in enumerate(y, 1):
        if t == 1:
            hits += 1
            precisions.append(hits / i)
    return float(np.mean(precisions)) if precisions else 0.0

# Ranking metric: Normalized Discounted Cumulative Gain at K (NDCG@K)
def ndcg_at_k(group, k=10):
    """
    Compute Normalized Discounted Cumulative Gain (NDCG) at cutoff K
    for a single ranked group.

    NDCG@K evaluates how well the model ranks relevant items in the
    top-K results. It rewards placing relevant items earlier in the list
    by applying a logarithmic discount to lower-ranked positions.

    Args:
        group (pd.DataFrame): Subset of rows belonging to a single query
            or session, containing at least:
              - 'y_true': ground-truth labels (0 or 1)
              - 'y_score': model scores or predicted probabilities
        k (int, optional): Cutoff rank (default=10). Only the top-k
            items are considered in the calculation.

    Returns:
        float: NDCG score between 0 and 1 for this group at cutoff k.
               1.0 indicates perfect ranking of relevant items.
    """
    g = group.sort_values("y_score", ascending=False).head(k)
    y = g["y_true"].to_numpy()

    # If there are no positives in this group, NDCG is defined as 0
    if y.sum()==0: return 0.0

    # Compute logarithmic discount factors: 1/log2(rank+1)
    discounts = 1.0 / np.log2(np.arange(2, len(y)+2))

    # DCG: sum of discounted gains for predicted ranking
    dcg  = float((y * discounts).sum())

    # IDCG: ideal DCG if all positives were ranked at the top
    idcg = float((np.sort(y)[::-1] * discounts).sum())

    # Normalize: NDCG = DCG / IDCG
    return dcg/idcg if idcg>0 else 0.0

# Ranking metric: Hit Ratio at K (HR@K)
def hr_at_k(group, k=10):
    """
    Compute the Hit Ratio at cutoff K (HR@K) for a single ranked group.

    HR@K measures whether at least one relevant item appears in the
    top-K results for a query/session. It is a binary metric per group:
    - Returns 1.0 if there is at least one positive label in the top-K
    - Returns 0.0 otherwise

    Args:
        group (pd.DataFrame): Subset of rows for a single query/session,
            containing at least:
              - 'y_true': ground-truth labels (0 or 1)
              - 'y_score': model scores or predicted probabilities
        k (int, optional): Cutoff rank (default=10).
                           Only the top-K items are checked.

    Returns:
        float: 1.0 if at least one relevant item is in top-K, else 0.0
    """

    # Sort rows by predicted score (highest first) and keep only top-k
    g = group.sort_values("y_score", ascending=False).head(k)

    # If any relevant item (y_true=1) exists in the top-k, it's a hit
    return 1.0 if g["y_true"].sum() > 0 else 0.0

# Feature Engineering: Query-level, Time-based, and Discount Features
def add_basic_feats(d):
    """
    Enrich a dataframe of search interactions with basic contextual features.

    This function creates features that describe the search query context,
    temporal information, and relative discount signals. All features are
    safe to compute at inference time (no label leakage).

    Args:
        d (pd.DataFrame): Input interactions dataframe containing at least:
            - 'search_query_id' (int)
            - 'rank' (int)
            - 'search_query_time' (datetime64)
            - 'price_discount_percentage' (float)

    Returns:
        pd.DataFrame: Copy of input dataframe with new feature columns added.
    """
    d = d.copy()

    # results_per_query: how many products were shown in this query
    qsize = d.groupby("search_query_id").size().rename("results_per_query").astype("int32")
    d["results_per_query"] = d["search_query_id"].map(qsize)

    # rank_norm: normalize rank by query size (0 = top, 1 = bottom)
    d["rank_norm"] = d["rank"] / d["results_per_query"].clip(lower=1)

    # Time-based features
    d["hour"] = d["search_query_time"].dt.hour.astype("int16")
    d["dow"]  = d["search_query_time"].dt.dayofweek.astype("int16")

    # Query discount statistics
    q_disc = d.groupby("search_query_id")["price_discount_percentage"].agg(["mean","std"]).rename(
        columns={"mean":"q_disc_mean","std":"q_disc_std"}
    )

    # Map mean and std discount back to each row
    d["q_disc_mean"] = d["search_query_id"].map(q_disc["q_disc_mean"])
    d["q_disc_std"]  = d["search_query_id"].map(q_disc["q_disc_std"]).fillna(0.0)
    
    # Session-ish relative features
    d["rank_pct"] = d.groupby("search_query_id")["rank"].rank(pct=True).astype("float32")
    d["disc_rel"] = (d["price_discount_percentage"] - d.groupby("search_query_id")["price_discount_percentage"].transform("mean")).astype("float32")
    return d

# Smoothed historical add-rates (fit on past only)
def smoothed_histories(df_fit, global_add, alpha=5.0, keys=("product_id","brand_id","product_type_id","colour_id","user_id")):
    """
    Compute smoothed historical add-to-bag rates for specified keys.

    This function calculates, for each entity type (e.g., product, brand),
    the probability that an item is added to bag given it was displayed.
    Rates are smoothed with a Bayesian prior to avoid extreme values for
    rare entities.

    Args:
        df_fit (pd.DataFrame): Training subset used to compute histories.
                               Must contain 'added_to_bag' and the keys.
        global_add (float): Global add-to-bag rate, used as prior mean.
        alpha (float, optional): Smoothing factor; higher values pull
                                 entity rates closer to the global rate.
                                 Default is 5.0.
        keys (tuple of str, optional): Column names to compute histories for.
                                       Defaults to ('product_id', 'brand_id',
                                       'product_type_id', 'colour_id', 'user_id').

    Returns:
        dict: Mapping {key_name: DataFrame} where each DataFrame contains:
              - key column (e.g., 'product_id')
              - smoothed add-rate column (e.g., 'product_id_add_rate')
    """
    tables = {}
    for key in keys:
        if key not in df_fit.columns: 
            # Skip keys not present in the dataset
            continue

        # Aggregate counts per entity: how many adds, how many exposures
        g = df_fit.groupby(key).agg(adds=("added_to_bag","sum"),
                                    n=("added_to_bag","size")).reset_index()
        
        # Apply Bayesian smoothing with global prior
        g[f"{key}_add_rate"] = (g["adds"] + alpha*global_add) / (g["n"] + alpha)

        # Keep only the identifier and its computed rate
        tables[key] = g[[key, f"{key}_add_rate"]]
    return tables

# Merge Smoothed History Tables Back into Main Dataset
def merge_histories(df, tables, fill_val):
    """
    Enrich a DataFrame with smoothed add-to-bag history features.

    This function merges precomputed historical add-rates (from
    `smoothed_histories`) back onto the main interactions DataFrame
    for each specified key (e.g., product_id, brand_id).

    Any missing values (e.g., new IDs unseen in training) are filled
    with a global prior rate to ensure consistent coverage.

    Args:
        df (pd.DataFrame): Main DataFrame of interactions. Must contain
                           the identifier columns present in `tables`.
        tables (dict): Dictionary of lookup DataFrames as returned by
                       `smoothed_histories`, keyed by column name.
                       Each table should contain:
                         - the key column (e.g., "product_id")
                         - a corresponding add-rate column (e.g.,
                           "product_id_add_rate")
        fill_val (float): Value to fill in for missing add-rates
                          (typically the global add-to-bag rate).

    Returns:
        pd.DataFrame: Copy of input DataFrame with additional
                      `*_add_rate` columns merged in.
    """
    d = df.copy()

    # Iterate through each entity type and merge its add-rate features
    for key, t in tables.items():
        # Left join ensures all rows in d are preserved
        d = d.merge(t, on=key, how="left")
        # Replace NaNs (IDs unseen in training) with global prior rate
        d[f"{key}_add_rate"] = d[f"{key}_add_rate"].fillna(fill_val)
    return d


## Join product attributes & build features

In [None]:

# Join products
train_fit = train_fit.merge(products, on="product_id", how="left")
val_df    = val_df.merge(products,    on="product_id", how="left")
holdout_df   = holdout.merge(products,   on="product_id", how="left")

# Normalize IDs
train_fit = normalize_ids(train_fit)
val_df    = normalize_ids(val_df)
holdout_df   = normalize_ids(holdout_df)

# Add safe features
train_fit = add_basic_feats(train_fit)
val_df    = add_basic_feats(val_df)
holdout_df   = add_basic_feats(holdout_df)

# Histories from train_fit only
global_add = train_fit["added_to_bag"].mean()
H = smoothed_histories(train_fit, global_add, alpha=5.0)

train_fit = merge_histories(train_fit, H, fill_val=global_add)
val_df    = merge_histories(val_df,    H, fill_val=global_add)
holdout_df   = merge_histories(holdout_df,   H, fill_val=global_add)

# Feature set
FEATS = [
    "rank","rank_norm","results_per_query","rank_pct",
    "hour","dow",
    "price_discount_percentage","q_disc_mean","q_disc_std","disc_rel",
    "product_id_add_rate","brand_id_add_rate","product_type_id_add_rate","colour_id_add_rate","user_id_add_rate",
    "product_type_id","brand_id","colour_id"
]

X_train = train_fit[FEATS]; y_train = train_fit["added_to_bag"].astype(int)
X_val   = val_df[FEATS];    y_val   = val_df["added_to_bag"].astype(int)

print("X_train:", X_train.shape, "  X_val:", X_val.shape)


X_train: (3558753, 18)   X_val: (679725, 18)


## Training & evaluation helpers

In [None]:
# Model Evaluation Helper
def evaluate_probs(name, probs, y_val, val_df): 
    """
    Evaluate a model's probability predictions on validation data.

    Computes both probability-based metrics and ranking-based metrics,
    then prints a concise summary and returns results in a dictionary.

    Metrics computed:
        - PR-AUC : Area under the Precision-Recall curve (good for imbalance)
        - LogLoss: Penalizes poorly calibrated probabilities
        - MAP@10 : Mean Average Precision at cutoff 10, averaged per query
        - NDCG@10: Normalized Discounted Cumulative Gain at cutoff 10
        - HR@10  : Hit Ratio at cutoff 10

    Args:
        name (str): Label/name of the model (for reporting).
        probs (array-like): Predicted probabilities for the positive class.
        y_val (array-like): Ground-truth binary labels (0 or 1).
        val_df (pd.DataFrame): Validation dataframe containing
                               at least 'search_query_id' for grouping.

    Returns:
        dict: Dictionary with model name and all metric values.
    """

    # Build evaluation DataFrame with required fields
    ve = pd.DataFrame({
        "search_query_id": val_df["search_query_id"].values,
        "y_true": y_val.values,
        "y_score": probs
    })

    # Compute ranking metrics 
    map10  = ve.groupby("search_query_id").apply(ap_at_k,  10).mean()
    ndcg10 = ve.groupby("search_query_id").apply(ndcg_at_k, 10).mean()
    hr10   = ve.groupby("search_query_id").apply(hr_at_k,  10).mean()

    # Probability-based metrics
    prauc  = average_precision_score(y_val, probs)
    ll     = log_loss(y_val, np.clip(probs, 1e-9, 1-1e-9))

    # Print formatted one-line summary (easy to compare models)
    print(f"[{name}] PR-AUC={prauc:.6f}  LogLoss={ll:.6f}  MAP@10={map10:.6f}  NDCG@10={ndcg10:.6f}  HR@10={hr10:.6f}")

    # Return metrics as a dict for leaderboard aggregation
    return dict(model=name, pr_auc=prauc, logloss=ll, map10=map10, ndcg10=ndcg10, hr10=hr10)




scale_pos_weight: 670.8431187464603


## Modeling

I selected Random Forest as a robust baseline, and LightGBM and XGBoost as state-of-the-art gradient boosting models well-suited for large, imbalanced tabular data. This combination balances interpretability, scalability, and predictive power, ensuring we can benchmark simple and advanced models fairly

In [None]:
# Compute class imbalance ratio for tree-based models
results = [] # container to store evaluation metrics for each model

# Count positives (label=1) and negatives (label=0) in the training set
pos = int(y_train.sum()); neg = int((y_train==0).sum())
spw = neg / max(1, pos)
print("scale_pos_weight:", spw)

scale_pos_weight: 670.8431187464603


### Model 1 — Random Forest (baseline)

In [None]:
# Initialize RandomForestClassifier
rf = RandomForestClassifier(
    n_estimators=400, max_depth=None, min_samples_leaf=2,
    class_weight="balanced", n_jobs=-1, random_state=42
)

# Train the model on the training subset
rf.fit(X_train, y_train)

# Predict probabilities for the validation set
probs_rf = rf.predict_proba(X_val)[:,1]

# Evaluate model performance on ranking + probability metrics and append results to the leaderboard list
results.append(evaluate_probs("RandomForest", probs_rf, y_val, val_df))


[RandomForest] PR-AUC=0.022005  LogLoss=0.028168  MAP@10=0.028708  NDCG@10=0.033864  HR@10=0.049992


### Model 2 — LightGBM

In [None]:

if HAS_LGB:
    # Initialize LightGBM
    lgbm = lgb.LGBMClassifier(
        objective="binary", learning_rate=0.05, n_estimators=2000,
        num_leaves=63, min_child_samples=200,
        subsample=0.9, colsample_bytree=0.9, reg_lambda=2.0,
        scale_pos_weight=spw, random_state=42, n_jobs=-1
    )

    # Train the model on the training subset
    lgbm.fit(X_train, y_train,
             eval_set=[(X_val, y_val)],
             eval_metric=["average_precision","binary_logloss"],
             callbacks=[lgb.early_stopping(200, verbose=False)])
    
    # Predict probabilities for the validation set
    probs_lgb = lgbm.predict_proba(X_val)[:,1]

    # Evaluate model performance on ranking + probability metrics and append results to the leaderboard list
    results.append(evaluate_probs("LightGBM", probs_lgb, y_val, val_df))
else:
    print("LightGBM not installed; skipping.")


[LightGBM] [Info] Number of positive: 5297, number of negative: 3553456
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.039320 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3500
[LightGBM] [Info] Number of data points in the train set: 3558753, number of used features: 18
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.001488 -> initscore=-6.508535
[LightGBM] [Info] Start training from score -6.508535
[LightGBM] PR-AUC=0.001385  LogLoss=0.014787  MAP@10=0.022670  NDCG@10=0.026775  HR@10=0.039892


### Model 3 — XGBoost

In [None]:

if HAS_XGB:
    # Initialize XGBoost
    xgb = XGBClassifier(
        objective="binary:logistic", eval_metric="logloss",
        learning_rate=0.05, n_estimators=1200, max_depth=8,
        subsample=0.9, colsample_bytree=0.9, reg_lambda=2.0,
        scale_pos_weight=spw, random_state=42, n_jobs=-1, tree_method="hist"
    )

    # Train the model on the training subset
    xgb.fit(X_train, y_train)

    # Predict probabilities for the validation set
    probs_xgb = xgb.predict_proba(X_val)[:,1]

    # Evaluate model performance on ranking + probability metrics and append results to the leaderboard list
    results.append(evaluate_probs("XGBoost", probs_xgb, y_val, val_df))
else:
    print("XGBoost not installed; skipping.")




[XGBoost] PR-AUC=0.018531  LogLoss=0.032062  MAP@10=0.027957  NDCG@10=0.034063  HR@10=0.053358


## Leaderboard & position-blended MAP@10

In [24]:

leader = pd.DataFrame(results).sort_values(["map10","pr_auc"], ascending=False).reset_index(drop=True)
leader


Unnamed: 0,model,pr_auc,logloss,map10,ndcg10,hr10
0,RandomForest,0.022005,0.028168,0.028708,0.033864,0.049992
1,XGBoost,0.018531,0.032062,0.027957,0.034063,0.053358
2,LightGBM,0.001385,0.014787,0.02267,0.026775,0.039892


Product rank in search results is already a very strong predictor
(items shown higher are much more likely to be clicked/added).
To leverage this, we blend the model’s predicted probability
with a simple rank-based prior to improve MAP@K.

In [25]:

# Optional: quick position blend for the top model in 'results'
if results:
    best = max(results, key=lambda r: (r["map10"], r["pr_auc"]))
    best_name = best["model"]
    if best_name == "RandomForest":
        probs = probs_rf
    elif best_name == "LightGBM" and HAS_LGB:
        probs = probs_lgb
    elif best_name == "XGBoost" and HAS_XGB:
        probs = probs_xgb
    else:
        probs = probs_rf

    pos_prior = 1.0 / (1.0 + val_df["rank"].values)
    alpha = 0.7
    blended = alpha*probs + (1-alpha)*pos_prior
    ve = pd.DataFrame({"search_query_id": val_df["search_query_id"].values,
                       "y_true": y_val.values, "y_score": blended})
    map10_blend = ve.groupby("search_query_id").apply(ap_at_k, 10).mean()
    print(f"Best model={best_name}  |  MAP@10 blended (α=0.7): {map10_blend:.6f}")


Best model=RandomForest  |  MAP@10 blended (α=0.7): 0.031544


## Uplift

I calculated uplift to translate model performance into business impact showing how much more effective we are at identifying add-to-bag events compared to random. I used Random Forest because it was the best-performing model in our evaluation, delivering the highest PR-AUC and MAP@10, making it the most reliable candidate for uplift analysis.

In [None]:
def uplift_table(y_true, y_score, n_bins=20):
    """
    Build an uplift table by sorting on y_score (desc), splitting into n_bins equal-sized
    groups (ventiles), and computing Class-1 rate and uplift vs the overall average.
    
    Returns:
        table: DataFrame with per-ventile metrics (1 = top scores, n_bins = lowest)
        summary: dict with overall_rate, uplift_top_5, uplift_top_10
    """
    df = pd.DataFrame({"y_true": np.asarray(y_true).astype(int),
                       "y_score": np.asarray(y_score).astype(float)}).copy()
    n = len(df)
    if n == 0:
        raise ValueError("Empty inputs.")

    # sort by score (highest first)
    df = df.sort_values("y_score", ascending=False).reset_index(drop=True)

    # assign ventiles by position (ensures exactly-equal sized bins when possible)
    # ventile 1 = top 5%, ventile 20 = bottom 5%
    idx = np.arange(1, n + 1)
    df["ventile"] = np.ceil(idx / (n / n_bins)).astype(int)
    df["ventile"] = df["ventile"].clip(1, n_bins)

    overall_rate = df["y_true"].mean()

    # per-ventile stats
    g = df.groupby("ventile", as_index=False).agg(
        rows=("y_true", "size"),
        positives=("y_true", "sum"),
    )
    g["class1_rate"] = g["positives"] / g["rows"]
    g["uplift_vs_overall"] = np.where(
        overall_rate > 0, g["class1_rate"] / overall_rate, np.nan
    )

    # make ventile 1 the top bin in the display
    g = g.sort_values("ventile", ascending=True).reset_index(drop=True)
    g["ventile_pct"] = 100.0 / n_bins  # each bin is equal sized by construction

    # cumulative (useful context)
    g["cum_rows"] = g["rows"].cumsum()
    g["cum_rows_pct"] = 100 * g["cum_rows"] / n
    g["cum_positives"] = g["positives"].cumsum()
    g["cum_class1_rate"] = g["cum_positives"] / g["cum_rows"]
    g["cum_uplift_vs_overall"] = np.where(
        overall_rate > 0, g["cum_class1_rate"] / overall_rate, np.nan
    )

    # summary uplifts
    top5 = g.loc[g["ventile"] == 1, "class1_rate"].mean() if not g.empty else np.nan
    top10 = g.loc[g["ventile"].isin([1, 2]), "positives"].sum() / g.loc[g["ventile"].isin([1, 2]), "rows"].sum()
    uplift_top_5 = (top5 / overall_rate) if overall_rate > 0 else np.nan
    uplift_top_10 = (top10 / overall_rate) if overall_rate > 0 else np.nan

    summary = {
        "overall_rate": overall_rate,
        "uplift_top_5pct": uplift_top_5,   # ventile 1
        "uplift_top_10pct": uplift_top_10, # ventiles 1–2
    }
    return g, summary

In [32]:
# Calculate uplift
table, summ = uplift_table(y_val, probs_rf, n_bins=20)

print("Overall add-to-bag rate:", f"{summ['overall_rate']:.6f}")
print("Uplift (Top 5% vs overall):", f"{summ['uplift_top_5pct']:.3f}x")
print("Uplift (Top 10% vs overall):", f"{summ['uplift_top_10pct']:.3f}x")

# Show the first few ventiles (top bins)
display_cols = ["ventile", "ventile_pct", "rows", "positives",
                "class1_rate", "uplift_vs_overall",
                "cum_rows_pct", "cum_class1_rate", "cum_uplift_vs_overall"]
print(table[display_cols].head(5))

Overall add-to-bag rate: 0.001502
Uplift (Top 5% vs overall): 3.957x
Uplift (Top 10% vs overall): 2.272x
   ventile  ventile_pct   rows  positives  class1_rate  uplift_vs_overall  \
0        1          5.0  33986        202     0.005944           3.956934   
1        2          5.0  33986         30     0.000883           0.587663   
2        3          5.0  33986         31     0.000912           0.607252   
3        4          5.0  33987         42     0.001236           0.822705   
4        5          5.0  33986         51     0.001501           0.999028   

   cum_rows_pct  cum_class1_rate  cum_uplift_vs_overall  
0      4.999963         0.005944               3.956934  
1      9.999926         0.003413               2.272299  
2     14.999890         0.002579               1.717283  
3     20.000000         0.002244               1.493634  
4     24.999963         0.002095               1.394713  


The Random Forest model is very effective at identifying a small slice of high-probability add-to-bag items. Targeting just the top 5% yields nearly 4× the conversion rate, making it a strong candidate for focused marketing or ranking use cases.

## Holdout predictions

Generating the output predictions for hold dataset

In [26]:
X_hold = holdout_df[FEATS]
# Choose the best model from the earlier leaderboard; default to LightGBM if available
best_model_name = leader.iloc[0]["model"] if len(leader) else ("LightGBM" if HAS_LGB else "RandomForest")
print("Best selected for full training:", best_model_name)

hold_probs = rf.predict_proba(X_hold)[:,1]

# Save predictions.csv with required columns
submission = holdout.copy()
submission["add_to_bag_probability"] = np.clip(hold_probs, 1e-9, 1-1e-9)
submission = submission[[
    "search_query_time","user_id","search_query_id","product_id",
    "rank","price_discount_percentage","add_to_bag_probability"
]]
submission.to_csv("predictions.csv", index=False)
print("Wrote predictions.csv")


Best selected for full training: RandomForest
Wrote predictions.csv


## Conclusion

In this project, I built models to predict the probability of a product being added to the shopping bag after appearing in search results. After evaluating Random Forest, LightGBM, and XGBoost, Random Forest performed best in terms of PR-AUC and MAP@10. To translate performance into business value, I calculated uplift, showing that the top 5% of predictions achieved an add-to-bag rate nearly 4× higher than average, demonstrating clear targeting power. These results confirm the model’s ability to meaningfully prioritize products with higher purchase intent.

## Future work

If more time and resources were available, I would:

- **Enhance feature engineering**: include richer session features (e.g., number of products viewed in the same query, dwell time), user history (previous purchases, browsing patterns), and query semantics (NLP embeddings of search terms if available).

- **Improve model tuning**: perform systematic hyperparameter optimization (e.g., Optuna, Bayesian optimization) for Random Forest, LightGBM, and XGBoost.

- **Explore advanced models**: try CatBoost for better handling of categorical features, and neural network approaches for sequential/session modeling.

- **Calibrate probabilities**: apply Platt scaling or isotonic regression to improve probability calibration, making outputs more actionable.

- **Deploy and monito**r: integrate the model into a live pipeline, then monitor uplift stability across time, categories, and user segments.