## Phase 2 – Step 2.3 : Cold-Start Mapping & Fallbacks  
**Goal** Map anonymous / first-session users to one of the five K-means segments using only attributes available on the very first page-view (Age, traffic source, country).  


In [1]:
# Cell 1: Enhanced Imports & Paths
import pandas as pd, numpy as np, yaml, joblib, pathlib
from collections import defaultdict
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split

ROOT = pathlib.Path("..")
PARQUET_DIR = ROOT / "data" / "parquet"
CONFIG_DIR = ROOT / "config"
CONFIG_DIR.mkdir(exist_ok=True)

# Load new segmentation data
seg_map = pd.read_parquet(PARQUET_DIR / "segmented_users.parquet")
print("✅ Segmented users loaded:", seg_map.shape)


✅ Segmented users loaded: (836214, 8)


In [2]:
# Cell 2: Enhanced Event-Segment Mapping
event_cols = ["user_pseudo_id", "Age", "source", "region", "category", "transaction_id"]
events = pd.read_parquet(
    list(PARQUET_DIR.glob("dataset1_final_part*.parquet")),
    columns=event_cols
).drop_duplicates().fillna("Unknown")

# Merge with new segmentation
events = events.merge(
    seg_map[["user_pseudo_id", "segment"]], 
    on="user_pseudo_id", 
    how="inner"
)
print("Enhanced events+seg →", events.shape)


Enhanced events+seg → (4149078, 7)


In [3]:
# Cell 3: Multi-Attribute Rule Engine
print("Building enhanced rule table...")
rule_tbl = (
    events.groupby(["region", "category", "Age", "source"])["segment"]
    .agg(lambda x: x.mode()[0] if not x.empty else 4)
    .reset_index()
)
print("Distinct rules:", len(rule_tbl))

# Build nested fallback structure
nested = defaultdict(dict)
for _, r in rule_tbl.iterrows():
    region_dict = nested.setdefault(r["region"], {})
    device_dict = region_dict.setdefault(r["category"], {})
    age_dict = device_dict.setdefault(r["Age"], {})
    age_dict[r["source"]] = int(r["segment"])

# Add global fallback
global_fallback = int(events["segment"].mode()[0])
nested["fallback"] = global_fallback

# Save enhanced config
yaml.safe_dump(dict(nested), open(CONFIG_DIR / "enhanced_fallback.yaml", "w"))
print("✅ Enhanced rule config saved")


Building enhanced rule table...
Distinct rules: 48005
✅ Enhanced rule config saved


In [5]:
# Cell 4: Transaction-Segment Mapping with Null Handling
tx2seg = events[["transaction_id", "segment"]].dropna().drop_duplicates()
tx2seg["transaction_id"] = tx2seg["transaction_id"].astype("string")

purch = pd.read_parquet(PARQUET_DIR / "dataset2_final_part000.parquet")
purch["Transaction_ID"] = purch["Transaction_ID"].astype("string")

purch_seg = purch.merge(
    tx2seg, 
    left_on="Transaction_ID", 
    right_on="transaction_id", 
    how="left"
)
purch_seg["segment"] = purch_seg["segment"].fillna(4)  # At-Risk fallback
print("Fixed null transactions:", purch_seg["segment"].isnull().sum())


Fixed null transactions: 0


In [6]:
# REVISED CELL 4: ENHANCED TRANSACTION-SEGMENT MAPPING
# Get region from events data
region_map = events[["transaction_id", "region"]].dropna().drop_duplicates()

# Merge region into purch_seg
purch_seg = purch_seg.merge(
    region_map,
    left_on="Transaction_ID",
    right_on="transaction_id",
    how="left"
)

# Fill null regions
purch_seg["region"] = purch_seg["region"].fillna("Unknown")
print("Region column added. Null regions:", purch_seg["region"].isnull().sum())


Region column added. Null regions: 0


In [7]:
# VERIFY COLUMNS
print("purch_seg columns:", purch_seg.columns.tolist())
print("Region values:", purch_seg["region"].unique()[:5])


purch_seg columns: ['Date', 'Transaction_ID', 'Item_purchase_quantity', 'Item_revenue', 'ItemName', 'ItemBrand', 'ItemCategory', 'ItemID', 'transaction_id_x', 'segment', 'transaction_id_y', 'region']
Region values: ['Unknown' 'California' 'Florida' 'New York' 'Maryland']


In [8]:
# Cell 5: Geo-Device Enhanced Popular Items
print("Building geo-device aware popular items...")
popular_tbl = (
    purch_seg.groupby(["segment", "ItemID", "region", "ItemCategory"])
    .size().reset_index(name="qty")
    .sort_values(["segment", "qty"], ascending=[True, False])
)

# Save with new structure
popular_tbl.to_parquet(PARQUET_DIR / "enhanced_popular_items.parquet", index=False)
print("✅ Geo-device popular items saved")


Building geo-device aware popular items...
✅ Geo-device popular items saved


In [9]:
# Cell 6: Real-Time Trigger Integration
# Paid social triggers
nested["PaidSocial"] = {"mobile": 1, "desktop": 3}
# Email triggers
nested["Email"] = {"desktop": 2}

# Save final config
yaml.safe_dump(dict(nested), open(CONFIG_DIR / "final_fallback.yaml", "w"))
print("✅ Real-time triggers integrated")


✅ Real-time triggers integrated


# Cell 7: Cold-Start Logic Documentation (Markdown)
"""
### Enhanced Cold-Start Logic
1. **First Page View**: Uses `region`, `device`, `Age`, `source`  
2. **Multi-Stage Fallback**:  
3. **Real-Time Triggers**:  
- Paid Social → Mobile: VIP items, Desktop: High-value items  
- Email → Desktop: High-frequency items  
4. **Null Handling**:  
- Unknown regions → Global trending items  
- Unknown devices → Region-agnostic recommendations
"""


In [10]:
# Cell 8: Validation Suite
# 1. Rule coverage
print("Rule coverage by region:")
print(rule_tbl["region"].value_counts())

# 2. Trigger test
print("PaidSocial+mobile → Segment", nested["PaidSocial"]["mobile"])

# 3. Fallback test
print("Fallback segment:", nested["fallback"])


Rule coverage by region:
region
California      1192
New York        1021
Texas            978
Florida          876
Illinois         795
                ... 
Sing Buri          1
Sivas              1
Jilin              1
Wakayama           1
North Maluku       1
Name: count, Length: 1593, dtype: int64
PaidSocial+mobile → Segment 1
Fallback segment: 2


graph LR
A[Load Segments] --> B[Build Rules]
B --> C[Map Transactions]
C --> D[Popular Items]
D --> E[Add Triggers]
E --> F[Validate]


### Fallback Logic Hierarchy
1. **Primary Signals**: region → device → age → source  
2. **Missing Region**: Use device → age → source  
3. **Missing Device**: Use age → source  
4. **All Missing**: Global popular items (Segment 2)  
5. **Real-Time Overrides**:  
   - PaidSocial: Mobile→VIP, Desktop→High-Value  
   - Email: Desktop→High-Frequency  


In [11]:
# Add after Cell 6
with open(CONFIG_DIR / "production_fallback.yaml", "w") as f:
    yaml.safe_dump(dict(nested), f)
print("✅ Production config saved")


✅ Production config saved
