## Phase 2 – Step 2.3 : Cold-Start Mapping & Fallbacks  
**Goal** Map anonymous / first-session users to one of the five K-means segments using only attributes available on the very first page-view (Age, traffic source, country).  


Imports & paths

In [1]:
import pandas as pd, numpy as np, yaml, joblib, pathlib
from collections import defaultdict

ROOT        = pathlib.Path("..")                 # repo root (one level up)
PARQUET_DIR = ROOT / "data" / "parquet"
CONFIG_DIR  = ROOT / "config"
CONFIG_DIR.mkdir(exist_ok=True)

# session-level features (already in RAM earlier, but reload for self-containment)
FEATS = pd.read_parquet(PARQUET_DIR / "features.parquet")
print("features  →", FEATS.shape)


features  → (1004534, 14)


 Load event-level columns that carry Age & source, then merge segment

1 Load event-level columns, attach segment

In [2]:
# ❶  Read only the columns we need from every events chunk
event_cols = ["user_pseudo_id", "Age", "source", "transaction_id"]
events = pd.read_parquet(
    list(PARQUET_DIR.glob("dataset1_final_part*.parquet")),
    columns=event_cols
).drop_duplicates()
print("events  →", events.shape)

# ❷  Bring in the 5-cluster label we built in Step 2.2
seg_map = pd.read_parquet(PARQUET_DIR / "segment_map.parquet")
events = events.merge(seg_map, on="user_pseudo_id", how="inner")
print("events+seg →", events.shape)


events  → (3036407, 4)
events+seg → (4958492, 5)


2 Majority-vote rules (Age × source → segment)

In [3]:
rule_tbl = (
    events.groupby(["Age", "source"])["segment"]
          .agg(lambda x: x.value_counts().idxmax())
          .reset_index()
)
print("distinct rules:", len(rule_tbl))

#  nested dict  →  YAML
nested = defaultdict(dict)
for _, r in rule_tbl.iterrows():
    nested[r["Age"]][r["source"]] = int(r["segment"])

yaml.safe_dump(dict(nested),
               open(CONFIG_DIR / "segment_fallback.yaml", "w"))

print("✅  segment_fallback.yaml written →", (CONFIG_DIR / "segment_fallback.yaml").relative_to(ROOT))


distinct rules: 3174
✅  segment_fallback.yaml written → config\segment_fallback.yaml


3 Map Transaction_ID to segment (for popular items)

In [4]:
# ❶  Build lookup transaction_id  →  segment
tx2seg = events[["transaction_id", "segment"]].dropna().drop_duplicates()
tx2seg["transaction_id"] = tx2seg["transaction_id"].astype("string")

print("tx2seg  →", tx2seg.shape)

# ❷  Load purchase table and attach segment
purch = pd.read_parquet(PARQUET_DIR / "dataset2_final_part000.parquet")
purch["Transaction_ID"] = purch["Transaction_ID"].astype("string")

purch_seg = purch.merge(
    tx2seg, left_on="Transaction_ID", right_on="transaction_id", how="left"
)
print("purch_seg →", purch_seg.shape, "|  segment nulls:", purch_seg["segment"].isna().sum())


tx2seg  → (18262, 2)
purch_seg → (27500, 10) |  segment nulls: 150


Top-N popular items (overall and per segment)

4 Top-30 popular items per segment

In [5]:
TOP_N = 30
popular_tbl = (
    purch_seg.groupby(["segment", "ItemID"])
             .size().reset_index(name="qty")
             .sort_values(["segment", "qty"], ascending=[True, False])
)

popular_tbl.to_parquet(PARQUET_DIR / "popular_items.parquet", index=False)
print("✅  popular_items.parquet written:", popular_tbl.shape)


✅  popular_items.parquet written: (1226, 3)


5 (OPTIONAL) tiny decision tree for explanation

In [10]:
seg_map = pd.read_parquet(PARQUET_DIR / "segment_map.parquet")
FEATS_seg = FEATS.merge(seg_map, on="user_pseudo_id", how="left")

X = FEATS_seg[["recency_days", "frequency", "monetary_value"]]
y = FEATS_seg["segment"]

Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=0.2, random_state=42)
tree = DecisionTreeClassifier(max_depth=3, random_state=42).fit(Xtr, ytr)

print(export_text(tree, feature_names=list(X.columns)))

joblib.dump(tree, ROOT / "src" / "routing_tree.pkl")
print("✅  routing_tree.pkl saved")



|--- frequency <= 625.50
|   |--- recency_days <= 117.50
|   |   |--- recency_days <= 100.50
|   |   |   |--- class: 1
|   |   |--- recency_days >  100.50
|   |   |   |--- class: 1
|   |--- recency_days >  117.50
|   |   |--- recency_days <= 185.50
|   |   |   |--- class: 4
|   |   |--- recency_days >  185.50
|   |   |   |--- class: 3
|--- frequency >  625.50
|   |--- class: 2

✅  routing_tree.pkl saved


6 Markdown – cold-start flow (add as a Markdown cell)

### Cold-Start Logic

1. First page-view provides **Age** and **traffic source**.  
2. `segment_fallback.yaml` maps (Age, source) → one of 5 segments.  
   If no match, fall back to global modal segment **{FEATS["segment"].mode()[0]}**.  
3. Recommend Top-30 items from `popular_items.parquet` for that segment until the first session ends.  
4. After session close, user is re-clustered with the full K-means model.
