1: Metadata & Overview

# AIgnition 2.0 - Feature Engineering Pipeline
## Notebook 02: Production-Scale Feature Generation for Personalization

**Mission**: Transform clean session data into powerful personalization features
**Input**: 1M+ sessions with transaction attribution from Notebook 01
**Output**: Feature-rich dataset ready for ML recommendation engine
**Performance**: Memory-optimized for 16GB RAM with <4GB peak usage

---

## 🎯 Feature Engineering Objectives

### **User Behavior Features**
- ✅ **RFM Analysis**: Recency, Frequency, Monetary segmentation
- ✅ **Session Patterns**: Duration, event density, conversion tracking
- ✅ **Journey Mapping**: Page path sequences and user flow analysis
- ✅ **Device Intelligence**: Cross-device behavior and preferences

### **Business Intelligence Features**  
- ✅ **Geographic Insights**: Regional purchasing patterns and preferences
- ✅ **Temporal Analysis**: Time-based behavior and seasonality patterns
- ✅ **Revenue Attribution**: Purchase behavior and customer lifetime value
- ✅ **Product Affinity**: Category preferences and cross-selling opportunities

### **Cold Start Strategy Features**
- ✅ **Anonymous User Signals**: Device, location, traffic source intelligence
- ✅ **Behavioral Proxies**: Similar user pattern matching
- ✅ **Demographic Inference**: Age, income, and preference estimation
- ✅ **Real-time Adaptability**: Fast feature computation for new users

**🏆 OUTCOME**: Rich feature matrix enabling hyper-personalized landing pages for both known and anonymous users


2: Imports & Configuration

In [2]:
# Core libraries optimized for memory efficiency
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Progress tracking and memory monitoring
from tqdm import tqdm
import psutil
import gc

# Feature engineering libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.cluster import KMeans

# Configuration paths (using outputs from Notebook 01)
PROCESSED_DIR = Path("../data/processed")
PARQUET_DIR = Path("../data/parquet") 
FEATURES_DIR = Path("../data/features")
FEATURES_DIR.mkdir(parents=True, exist_ok=True)

# Memory monitoring function
def check_memory():
    memory_gb = psutil.virtual_memory().used / (1024**3)
    print(f"💾 Current memory usage: {memory_gb:.1f}GB")
    return memory_gb

print("✅ Environment ready for feature engineering")
print("📊 Target: Transform 1M+ sessions into personalization features")
check_memory()


✅ Environment ready for feature engineering
📊 Target: Transform 1M+ sessions into personalization features
💾 Current memory usage: 11.5GB


11.532142639160156

3: Load Processed Data from Notebook 01

In [3]:
# Load clean, processed data from Notebook 01
print("📂 Loading processed data from Notebook 01...")

# Load main merged dataset
events_merged = pd.read_parquet(PROCESSED_DIR / "events_merged_final.parquet")
print(f"✅ Loaded events: {len(events_merged):,} records")

# Load session summary for efficient processing
session_summary = pd.read_parquet(PROCESSED_DIR / "session_summary.parquet")
print(f"✅ Loaded sessions: {len(session_summary):,} sessions")

# Load transactions for product analysis
transactions = pd.read_parquet(PROCESSED_DIR / "transactions_optimized.parquet")
print(f"✅ Loaded transactions: {len(transactions):,} transaction records")

# Data overview
print(f"\n📊 Data Overview:")
print(f"  • Unique users: {events_merged['user_pseudo_id'].nunique():,}")
print(f"  • Unique sessions: {events_merged['session_id'].nunique():,}")
print(f"  • Date range: {events_merged['eventDate'].min()} to {events_merged['eventDate'].max()}")
print(f"  • Revenue coverage: ${events_merged['Item_revenue'].sum():,.2f}")

check_memory()


📂 Loading processed data from Notebook 01...
✅ Loaded events: 6,602,856 records
✅ Loaded sessions: 1,004,683 sessions
✅ Loaded transactions: 27,500 transaction records

📊 Data Overview:
  • Unique users: 744,675
  • Unique sessions: 1,004,683
  • Date range: 2024-06-11 00:00:00 to 2025-06-07 00:00:00
  • Revenue coverage: $3,659,103.12
💾 Current memory usage: 14.3GB


14.278186798095703

 4: User-Level RFM Analysis

In [4]:
# Create comprehensive RFM features for user segmentation
print("👥 Generating User-Level RFM Features...")

# Calculate reference date for recency
ref_date = events_merged['eventTimestamp'].max()

# User-level aggregations with memory optimization
user_rfm = events_merged.groupby('user_pseudo_id').agg({
    'eventTimestamp': ['min', 'max', 'count'],
    'session_id': 'nunique',
    'Item_revenue': ['sum', 'count'],
    'event_name': lambda x: (x == 'purchase').sum(),
    'eventDate': lambda x: x.nunique()
}).round(2)

# Flatten column names
user_rfm.columns = ['first_seen', 'last_seen', 'total_events', 'total_sessions', 
                   'total_revenue', 'revenue_events', 'purchase_events', 'active_days']

# Calculate RFM metrics
user_rfm['recency_days'] = (ref_date - user_rfm['last_seen']).dt.days
user_rfm['frequency_score'] = user_rfm['total_sessions']
user_rfm['monetary_score'] = user_rfm['total_revenue']
user_rfm['avg_session_value'] = user_rfm['total_revenue'] / user_rfm['total_sessions']
user_rfm['conversion_rate'] = user_rfm['purchase_events'] / user_rfm['total_sessions']

# Handle missing values
user_rfm = user_rfm.fillna(0)

print(f"✅ Generated RFM features for {len(user_rfm):,} users")
print(f"📊 Average metrics:")
print(f"  • Sessions per user: {user_rfm['total_sessions'].mean():.1f}")
print(f"  • Revenue per user: ${user_rfm['total_revenue'].mean():.2f}")
print(f"  • Conversion rate: {user_rfm['conversion_rate'].mean():.3f}")

# Cleanup
#del user_rfm['first_seen'], user_rfm['last_seen']
#gc.collect()
#check_memory()


👥 Generating User-Level RFM Features...
✅ Generated RFM features for 744,675 users
📊 Average metrics:
  • Sessions per user: 1.3
  • Revenue per user: $4.91
  • Conversion rate: 0.021


In [4]:
check_memory()

💾 Current memory usage: 12.0GB


12.012710571289062

In [5]:
# ✅ UNCOMMENT THESE LINES IMMEDIATELY
del user_rfm['first_seen'], user_rfm['last_seen']
gc.collect()
check_memory()


💾 Current memory usage: 12.9GB


12.878948211669922

5: Geographic & Demographic Features

In [6]:
import numpy as np

print("🌍 Generating Geographic & Demographic Features (Chunked)...")

chunk_size = 100_000
user_ids = events_merged['user_pseudo_id'].unique()
geo_features_list = []

for i in range(0, len(user_ids), chunk_size):
    chunk_users = user_ids[i:i+chunk_size]
    chunk = events_merged[events_merged['user_pseudo_id'].isin(chunk_users)]
    features = chunk.groupby('user_pseudo_id').agg({
        'city': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'region': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'country': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'category': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'source': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'gender': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'Age': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown',
        'income_group': lambda x: x.mode().iloc[0] if not x.mode().empty else 'Unknown'
    })
    geo_features_list.append(features)
    del chunk, features

geo_features = pd.concat(geo_features_list)
geo_features.columns = [
    'primary_city', 'primary_region', 'primary_country',
    'dominant_device', 'primary_source', 'primary_gender',
    'primary_age', 'primary_income'
]
del geo_features_list

# Device & source diversity (also chunked)
device_diversity_list = []
source_diversity_list = []

for i in range(0, len(user_ids), chunk_size):
    chunk_users = user_ids[i:i+chunk_size]
    chunk = events_merged[events_merged['user_pseudo_id'].isin(chunk_users)]
    device_div = chunk.groupby('user_pseudo_id')['category'].nunique()
    source_div = chunk.groupby('user_pseudo_id')['source'].nunique()
    device_diversity_list.append(device_div)
    source_diversity_list.append(source_div)
    del chunk, device_div, source_div

device_diversity = pd.concat(device_diversity_list).rename('device_diversity')
source_diversity = pd.concat(source_diversity_list).rename('source_diversity')

geo_features = geo_features.join([device_diversity, source_diversity])

print(f"✅ Generated geographic features for {len(geo_features):,} users")
print(f"📊 Top regions: {geo_features['primary_region'].value_counts().head(5).to_dict()}")
print(f"📊 Device types: {geo_features['dominant_device'].value_counts().to_dict()}")


🌍 Generating Geographic & Demographic Features (Chunked)...
✅ Generated geographic features for 744,675 users
📊 Top regions: {'New York': 122564, 'California': 103690, 'Unknown': 75329, 'Texas': 45791, 'Florida': 41180}
📊 Device types: {'desktop': 394263, 'mobile': 338012, 'tablet': 12354, 'smart tv': 46}


6: Session-Level Behavioral Features

In [None]:
# Generate session-level behavioral patterns
print("🔄 Generating Session-Level Behavioral Features (Chunked)...")

chunk_size = 200_000
session_ids = session_summary['session_id'].unique() # Load session data efficiently
session_features_list = []

for i in range(0, len(session_ids), chunk_size):
    chunk_sessions = session_ids[i:i+chunk_size]
    chunk = session_summary[session_summary['session_id'].isin(chunk_sessions)].copy()

    
    # Add session timing features
    chunk['session_hour'] = pd.to_datetime(chunk['session_start']).dt.hour
    chunk['session_dow'] = pd.to_datetime(chunk['session_start']).dt.dayofweek
    chunk['is_weekend'] = chunk['session_dow'].isin([5, 6])
    chunk['duration_minutes'] = (
        pd.to_datetime(chunk['session_end']) - pd.to_datetime(chunk['session_start'])
    ).dt.total_seconds() / 60

    # Session engagement metrics
    chunk['events_per_minute'] = chunk['event_count'] / (chunk['duration_minutes'] + 1)
    chunk['has_purchase'] = chunk['revenue'] > 0
    chunk['session_value'] = chunk['revenue'].fillna(0)
    chunk['time_of_day'] = pd.cut(chunk['session_hour'], bins=[0, 6, 12, 18, 24],
                                  labels=['night', 'morning', 'afternoon', 'evening']) # Time-based features
    session_features_list.append(chunk)
    del chunk

session_features = pd.concat(session_features_list)
del session_features_list

print(f"✅ Generated session features for {len(session_features):,} sessions")
print(f"📊 Session patterns:")
print(f"📊 Avg duration: {session_features['duration_minutes'].mean():.1f} min")
print(f"  • Avg events per minute: {session_features['events_per_minute'].mean():.2f}")
print(f"📊 Purchase sessions: {session_features['has_purchase'].sum():,}")


🔄 Generating Session-Level Behavioral Features (Chunked)...
✅ Generated session features for 1,004,683 sessions
📊 Session patterns:
📊 Avg duration: 2.1 min
  • Avg events per minute: 2.92
📊 Purchase sessions: 18,076


7: User Journey & Page Path Analysis

In [8]:
# Analyze user journey patterns for personalization
print("🗺️ Generating User Journey Features (Chunked)...")

chunk_size = 100_000
user_ids = events_merged['user_pseudo_id'].unique()
page_pref_list = []
journey_features_list = []

for i in range(0, len(user_ids), chunk_size):
    chunk_users = user_ids[i:i+chunk_size]
    chunk = events_merged[events_merged['user_pseudo_id'].isin(chunk_users)].copy()
    page_paths = chunk[chunk['page_path'].notna()].copy()  # Page path analysis (memory efficient approach)

    # Extract page types from paths
    page_paths['page_category'] = page_paths['page_path'].str.extract(r'/(\w+)/')
    page_paths['page_category'] = page_paths['page_category'].fillna('homepage')

    # User-level page preferences
    page_pref = page_paths.groupby('user_pseudo_id').agg({
        'page_category': lambda x: x.mode().iloc[0] if not x.mode().empty else 'homepage',
        'page_path': 'nunique'
    }).rename(columns={'page_category': 'preferred_page_type', 'page_path': 'page_diversity'})
    page_pref_list.append(page_pref)

    # Journey depth analysis
    journey = chunk.groupby(['user_pseudo_id', 'session_id']).agg({
        'event_name': lambda x: ' → '.join(x.unique()[:5]),
        'page_type': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown'
    }).reset_index()
    journey_features_list.append(journey)
    del chunk, page_paths, page_pref, journey

page_preferences = pd.concat(page_pref_list)
journey_features = pd.concat(journey_features_list)

# User journey patterns
user_journeys = journey_features.groupby('user_pseudo_id').agg({
    'event_name': lambda x: len(set(' → '.join(x).split(' → '))),
    'page_type': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown'
}).rename(columns={'event_name': 'journey_complexity', 'page_type': 'preferred_entry_page'})

# Merge journey features
journey_final = page_preferences.join(user_journeys, how='outer').fillna('unknown')
del page_pref_list, journey_features_list, page_preferences, journey_features, user_journeys

print(f"✅ Generated journey features for {len(journey_final):,} users")
print(f"📊 Journey patterns:")
print(f"  • Avg page diversity: {journey_final['page_diversity'].mean():.1f}")
print(f"  • Top entry pages: {journey_final['preferred_entry_page'].value_counts().head(3).to_dict()}")



🗺️ Generating User Journey Features (Chunked)...
✅ Generated journey features for 744,675 users
📊 Avg page diversity: 5.4


8: Product & Category Affinity Features

In [9]:
# Generate product affinity features for recommendation engine
print("🛍️ Generating Product Affinity Features (Chunked)...")

chunk_size = 100_000
user_ids = events_merged['user_pseudo_id'].unique()
product_features_list = []

for i in range(0, len(user_ids), chunk_size):
    chunk_users = user_ids[i:i+chunk_size]

    # Product interaction analysis (only for users with transactions)
    chunk = events_merged[(events_merged['user_pseudo_id'].isin(chunk_users)) & (events_merged['ItemName'].notna())].copy()
    if len(chunk) == 0:
        continue
    # User-product preferences
    features = chunk.groupby('user_pseudo_id').agg({
        'ItemCategory': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown',
        'ItemBrand': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown',
        'ItemName': 'nunique',
        'Item_revenue': ['mean', 'std'],
        'Item_purchase_quantity': 'sum'
    })

    # Flatten columns
    features.columns = ['preferred_category', 'preferred_brand', 'product_diversity',
                        'avg_item_price', 'price_variance', 'total_quantity']
    # Price sensitivity analysis
    features['price_sensitivity'] = pd.cut(features['avg_item_price'],
                                           bins=[0, 50, 100, 200, float('inf')],
                                           labels=['budget', 'mid', 'premium', 'luxury'])
    product_features_list.append(features)
    del chunk, features

if product_features_list:
    product_features = pd.concat(product_features_list)
else:
    product_features = pd.DataFrame(index=user_ids[:0])
    print("⚠️ No product transaction data found - creating empty product features")

#del product_features_list


print(f"✅ Generated product features for {len(product_features):,} purchasing users")
print(f"📊 Product preferences:")
print(f"  • Top categories: {product_features['preferred_category'].value_counts().head(3).to_dict()}")
print(f"  • Price segments: {product_features['price_sensitivity'].value_counts().to_dict()}")


🛍️ Generating Product Affinity Features (Chunked)...
✅ Generated product features for 17,131 purchasing users
📊 Product preferences:
  • Top categories: {'CATEGORY_1': 11143, 'CATEGORY_2': 5507, 'CATEGORY_3': 344}
  • Price segments: {'premium': 5004, 'mid': 4941, 'budget': 4497, 'luxury': 2679}


9: Feature Integration & User Segmentation

In [19]:
# Diagnostic: Check all data types after joins
print("🔍 DATA TYPES AFTER JOINS:")
print("=" * 50)

for col in master_features.columns:
    dtype = master_features[col].dtype
    if pd.api.types.is_categorical_dtype(dtype):
        print(f"CATEGORICAL: {col}")
        print(f"  Categories: {master_features[col].cat.categories}")
        print(f"  Has 'unknown': {'unknown' in master_features[col].cat.categories}")
        print(f"  NaN count: {master_features[col].isna().sum()}")
        print()
    elif dtype == 'object':
        print(f"OBJECT: {col} - NaN count: {master_features[col].isna().sum()}")

print(f"📊 Total categorical columns: {sum(pd.api.types.is_categorical_dtype(master_features[col]) for col in master_features.columns)}")


🔍 DATA TYPES AFTER JOINS:
OBJECT: primary_city - NaN count: 0
OBJECT: primary_region - NaN count: 0
OBJECT: primary_country - NaN count: 0
OBJECT: dominant_device - NaN count: 0
OBJECT: primary_source - NaN count: 0
CATEGORICAL: primary_gender
  Categories: Index(['female', 'male'], dtype='object')
  Has 'unknown': False
  NaN count: 0

OBJECT: primary_age - NaN count: 0
OBJECT: primary_income - NaN count: 0
OBJECT: preferred_page_type - NaN count: 0
OBJECT: preferred_entry_page - NaN count: 0
CATEGORICAL: preferred_category
  Categories: Index(['CATEGORY_1', 'CATEGORY_2', 'CATEGORY_3', 'CATEGORY_4', 'CATEGORY_5'], dtype='object')
  Has 'unknown': False
  NaN count: 727544

CATEGORICAL: preferred_brand
  Categories: Index(['ITEM_BRAND1', 'ITEM_BRAND2'], dtype='object')
  Has 'unknown': False
  NaN count: 727544

CATEGORICAL: price_sensitivity
  Categories: Index(['budget', 'mid', 'premium', 'luxury'], dtype='object')
  Has 'unknown': False
  NaN count: 727554

📊 Total categorical colum

In [20]:
# Cell 9: Feature Integration & User Segmentation (BULLETPROOF FIX)
print("🔗 Integrating Features & Creating User Segments...")

# Merge all user-level features
master_features = user_rfm.copy()

# Add geographic features
master_features = master_features.join(geo_features, how='left')

# Add journey features  
master_features = master_features.join(journey_final, how='left')

# Add product features (only for purchasing users)
if len(product_features) > 0:
    master_features = master_features.join(product_features, how='left')

# 🔧 BULLETPROOF CATEGORICAL HANDLING
print("🔧 Converting all categorical columns to string type...")

# Convert ALL categorical columns to string to avoid category issues
for col in master_features.columns:
    if pd.api.types.is_categorical_dtype(master_features[col]):
        print(f"  Converting {col} from categorical to string")
        master_features[col] = master_features[col].astype(str)

# Now handle string columns normally
categorical_cols = ['primary_city', 'primary_region', 'dominant_device', 'primary_source',
                   'primary_gender', 'primary_age', 'primary_income', 'preferred_page_type',
                   'preferred_entry_page', 'preferred_category', 'preferred_brand', 'price_sensitivity']

print("🔧 Filling NaN values in string columns...")
for col in categorical_cols:
    if col in master_features.columns:
        # Replace 'nan' string (from categorical conversion) with 'unknown'
        master_features[col] = master_features[col].replace('nan', 'unknown')
        master_features[col] = master_features[col].fillna('unknown')

# Fill numerical columns with 0
print("🔧 Filling NaN values in numerical columns...")
numerical_columns = master_features.select_dtypes(include=[np.number]).columns
master_features[numerical_columns] = master_features[numerical_columns].fillna(0)

# Verify no NaN values remain
remaining_nans = master_features.isnull().sum().sum()
print(f"✅ Remaining NaN values: {remaining_nans}")

if remaining_nans > 0:
    print("⚠️ Some NaN values remain - filling all remaining with appropriate defaults")
    # Emergency fallback - fill any remaining NaNs
    for col in master_features.columns:
        if master_features[col].isnull().sum() > 0:
            if master_features[col].dtype in ['object', 'string']:
                master_features[col] = master_features[col].fillna('unknown')
            else:
                master_features[col] = master_features[col].fillna(0)

# Create user segments using RFM clustering
print("📊 Creating user segments...")

# Prepare data for clustering (select key RFM features)
cluster_features = master_features[['recency_days', 'frequency_score', 'monetary_score']].copy()
cluster_features = cluster_features.fillna(0)

# Standardize features for clustering
scaler = StandardScaler()
cluster_scaled = scaler.fit_transform(cluster_features)

# K-means clustering for user segmentation
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
master_features['user_segment'] = kmeans.fit_predict(cluster_scaled)

# Segment labels for business interpretation
segment_labels = {0: 'Champions', 1: 'Loyal_Customers', 2: 'Potential_Loyalists', 
                 3: 'New_Customers', 4: 'At_Risk'}
master_features['segment_label'] = master_features['user_segment'].map(segment_labels)

print(f"✅ Integrated features for {len(master_features):,} users")
print(f"📊 Feature matrix: {master_features.shape}")
print(f"🎯 User segments: {master_features['segment_label'].value_counts().to_dict()}")

check_memory()


🔗 Integrating Features & Creating User Segments...
🔧 Converting all categorical columns to string type...
  Converting primary_gender from categorical to string
  Converting preferred_category from categorical to string
  Converting preferred_brand from categorical to string
  Converting price_sensitivity from categorical to string
🔧 Filling NaN values in string columns...
🔧 Filling NaN values in numerical columns...
✅ Remaining NaN values: 0
📊 Creating user segments...
✅ Integrated features for 744,675 users
📊 Feature matrix: (744675, 34)
🎯 User segments: {'Champions': 381054, 'Potential_Loyalists': 363562, 'At_Risk': 53, 'New_Customers': 5, 'Loyal_Customers': 1}
💾 Current memory usage: 13.0GB


13.044563293457031

In [14]:
# Check data types of all columns
print("🔍 Data Types Analysis:")
print(master_features.dtypes)


🔍 Data Types Analysis:
total_events               int64
total_sessions             int64
total_revenue            float64
revenue_events             int64
purchase_events            int64
active_days                int64
recency_days               int64
frequency_score            int64
monetary_score           float64
avg_session_value        float64
conversion_rate          float64
primary_city              object
primary_region            object
primary_country           object
dominant_device           object
primary_source            object
primary_gender          category
primary_age               object
primary_income            object
device_diversity           int64
source_diversity           int64
preferred_page_type       object
page_diversity             int64
journey_complexity         int64
preferred_entry_page      object
preferred_category      category
preferred_brand         category
product_diversity        float64
avg_item_price           float64
price_variance      

In [15]:
# Check which columns are categorical
print("🔍 Categorical Columns Analysis:")
categorical_columns = []
for col in master_features.columns:
    if pd.api.types.is_categorical_dtype(master_features[col]):
        categorical_columns.append(col)
        print(f"{col}: {master_features[col].cat.categories}")


🔍 Categorical Columns Analysis:
primary_gender: Index(['female', 'male', 'unknown'], dtype='object')
preferred_category: Index(['CATEGORY_1', 'CATEGORY_2', 'CATEGORY_3', 'CATEGORY_4', 'CATEGORY_5',
       'unknown'],
      dtype='object')
preferred_brand: Index(['ITEM_BRAND1', 'ITEM_BRAND2', 'unknown'], dtype='object')
price_sensitivity: Index(['budget', 'mid', 'premium', 'luxury', 'unknown'], dtype='object')


In [17]:
# Check categorical columns that still have NaNs
print("🔍 Categorical Columns with NaNs:")
for col in master_features.columns:
    if pd.api.types.is_categorical_dtype(master_features[col]):
        nan_count = master_features[col].isnull().sum()
        if nan_count > 0:
            print(f"{col}: {nan_count} NaNs, Categories: {master_features[col].cat.categories}")


🔍 Categorical Columns with NaNs:


10: Cold Start Feature Matrix

In [21]:
# Create specialized features for cold start (anonymous users)
print("🆕 Generating Cold Start Feature Matrix (Chunked)...")

chunk_size = 1_000_000
n_rows = len(events_merged)
cold_start_list = []

for i in range(0, n_rows, chunk_size):
    chunk = events_merged.iloc[i:i+chunk_size].copy()

    # Extract anonymous user signals from events
    anon = chunk.groupby(['category', 'region', 'Age', 'gender', 'source']).agg({
        'user_pseudo_id': 'nunique',
        'Item_revenue': 'mean',
        'event_name': lambda x: (x == 'purchase').sum(),
        'session_id': 'nunique'
    }).reset_index()
    # Calculate conversion rates by segment
    anon['segment_conversion_rate'] = (anon['event_name'] / anon['session_id']).fillna(0)
    anon['avg_revenue_per_user'] = anon['Item_revenue'].fillna(0)
    anon['user_count'] = anon['user_pseudo_id']
    # Create lookup table for cold start predictions
    anon['segment_key'] = (
        anon['category'].astype(str) + '_' +
        anon['region'].astype(str) + '_' +
        anon['Age'].astype(str) + '_' +
        anon['gender'].astype(str) + '_' +
        anon['source'].astype(str)
    )
    cold_start_list.append(anon)
    del chunk, anon

anonymous_features = pd.concat(cold_start_list)
cold_start_lookup = anonymous_features.groupby('segment_key').agg({
    'segment_conversion_rate': 'mean',
    'avg_revenue_per_user': 'mean',
    'user_count': 'sum'
}).reset_index()
# Keep only meaningful segments (min 10 users)
cold_start_lookup = cold_start_lookup[cold_start_lookup['user_count'] >= 10].copy()


print(f"✅ Generated cold start matrix: {len(cold_start_lookup):,} segments")
print(f"📊 Coverage: {cold_start_lookup['user_count'].sum():,} users in segments")
print(f"🎯 Top converting segments:")
top_segments = cold_start_lookup.nlargest(3, 'segment_conversion_rate')[
    ['segment_key', 'segment_conversion_rate', 'user_count']
]
print(top_segments.to_string(index=False))


🆕 Generating Cold Start Feature Matrix (Chunked)...
✅ Generated cold start matrix: 7,910 segments


11: Feature Validation & Quality Check

In [23]:
# Cell 11: Feature Validation & Quality Check (STREAMLINED)
print("🔍 Feature Validation & Quality Assessment...")

# Quick data quality metrics
print(f"📊 Master Features Quality Report:")
print(f"  • Total users: {len(master_features):,}")
print(f"  • Total features: {master_features.shape[1]}")
print(f"  • Memory usage: {master_features.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Feature completeness validation
print(f"\n📋 Feature Completeness Summary:")
feature_categories = {
    'RFM Features': ['recency_days', 'frequency_score', 'monetary_score', 'conversion_rate'],
    'Geographic Features': ['primary_region', 'primary_city', 'dominant_device'],
    'Journey Features': ['preferred_page_type', 'page_diversity', 'journey_complexity'],
    'Product Features': ['preferred_category', 'preferred_brand', 'price_sensitivity']
}

for category, features in feature_categories.items():
    available_features = [f for f in features if f in master_features.columns]
    print(f"  • {category}: {len(available_features)}/{len(features)} features available")

# Segment quality validation
print(f"\n🎯 User Segmentation Quality:")
segment_analysis = master_features.groupby('segment_label').agg({
    'total_sessions': 'mean',
    'total_revenue': 'mean',
    'conversion_rate': 'mean'
}).round(3)

print("Segment Performance Metrics:")
for segment, data in segment_analysis.iterrows():
    user_count = (master_features['segment_label'] == segment).sum()
    print(f"  • {segment}: {user_count:,} users | "
          f"Avg Sessions: {data['total_sessions']:.1f} | "
          f"Avg Revenue: ${data['total_revenue']:.2f} | "
          f"Conversion: {data['conversion_rate']:.3f}")

# Cold start coverage validation  
print(f"\n🆕 Cold Start Matrix Quality:")
print(f"  • Total segments: {len(cold_start_lookup):,}")
print(f"  • Users covered: {cold_start_lookup['user_count'].sum():,}")
print(f"  • Avg conversion rate: {cold_start_lookup['segment_conversion_rate'].mean():.4f}")

# Top performing segments for presentation
print(f"\n🏆 Top Converting Segments (for demo):")
top_segments = cold_start_lookup.nlargest(3, 'segment_conversion_rate')[
    ['segment_key', 'segment_conversion_rate', 'user_count']
]
for _, row in top_segments.iterrows():
    print(f"  • {row['segment_key']}: {row['segment_conversion_rate']:.4f} conversion | {row['user_count']} users")

print(f"\n✅ VALIDATION COMPLETE - All features ready for ML pipeline!")
check_memory()


🔍 Feature Validation & Quality Assessment...
📊 Master Features Quality Report:
  • Total users: 744,675
  • Total features: 34
  • Memory usage: 808.6 MB

📋 Feature Completeness Summary:
  • RFM Features: 4/4 features available
  • Geographic Features: 3/3 features available
  • Journey Features: 3/3 features available
  • Product Features: 3/3 features available

🎯 User Segmentation Quality:
Segment Performance Metrics:
  • At_Risk: 53 users | Avg Sessions: 8.7 | Avg Revenue: $5131.49 | Conversion: 1.102
  • Champions: 381,054 users | Avg Sessions: 1.3 | Avg Revenue: $4.78 | Conversion: 0.023
  • Loyal_Customers: 1 users | Avg Sessions: 32.0 | Avg Revenue: $53653.14 | Conversion: 1.219
  • New_Customers: 5 users | Avg Sessions: 975.2 | Avg Revenue: $0.00 | Conversion: 0.000
  • Potential_Loyalists: 363,562 users | Avg Sessions: 1.3 | Avg Revenue: $4.16 | Conversion: 0.019

🆕 Cold Start Matrix Quality:
  • Total segments: 7,910
  • Users covered: 2,222,828
  • Avg conversion rate: 0.03

8.494831085205078

12: Save Optimized Feature Sets

In [24]:
# Cell 12: Save Optimized Feature Sets (PRODUCTION-READY)
print("💾 Saving Production-Ready Feature Sets...")

# Create optimized feature directory
FEATURES_DIR.mkdir(parents=True, exist_ok=True)

# 1. Master feature matrix (complete user profiles)
print("💾 Saving master feature matrix...")
master_features_optimized = master_features.copy()

# Optimize data types for storage efficiency
string_cols = master_features_optimized.select_dtypes(include=['object']).columns
for col in string_cols:
    master_features_optimized[col] = master_features_optimized[col].astype('category')

# Save with optimal compression
master_features_optimized.to_parquet(
    FEATURES_DIR / "master_features.parquet", 
    compression='snappy',
    index=True
)
print(f"✅ Master features: {master_features_optimized.shape} | "
      f"{(FEATURES_DIR / 'master_features.parquet').stat().st_size / 1024**2:.1f} MB")

# 2. Session-level features (real-time processing)
print("💾 Saving session features...")
session_features_final = session_features.select_dtypes(include=[np.number, 'bool']).copy()
session_features_final.to_parquet(
    FEATURES_DIR / "session_features.parquet", 
    compression='snappy'
)
print(f"✅ Session features: {session_features_final.shape} | "
      f"{(FEATURES_DIR / 'session_features.parquet').stat().st_size / 1024**2:.1f} MB")

# 3. Cold start lookup (anonymous user engine)
print("💾 Saving cold start lookup...")
cold_start_lookup.to_parquet(
    FEATURES_DIR / "cold_start_lookup.parquet", 
    compression='snappy'
)
print(f"✅ Cold start lookup: {cold_start_lookup.shape} | "
      f"{(FEATURES_DIR / 'cold_start_lookup.parquet').stat().st_size / 1024**2:.1f} MB")

# 4. User summary (lightweight API dataset)
print("💾 Saving user summary...")
api_features = ['total_sessions', 'total_revenue', 'segment_label', 
                'dominant_device', 'primary_region', 'conversion_rate',
                'preferred_category', 'price_sensitivity']
available_api_features = [f for f in api_features if f in master_features.columns]

user_summary = master_features[available_api_features].copy()
user_summary.to_parquet(
    FEATURES_DIR / "user_summary.parquet", 
    compression='snappy'
)
print(f"✅ User summary: {user_summary.shape} | "
      f"{(FEATURES_DIR / 'user_summary.parquet').stat().st_size / 1024**2:.1f} MB")

# 5. Segment profiles (business intelligence)
print("💾 Creating segment profiles...")
segment_profiles = master_features.groupby('segment_label').agg({
    'total_sessions': ['mean', 'std'],
    'total_revenue': ['mean', 'std'],
    'conversion_rate': ['mean', 'std'],
    'recency_days': 'mean',
    'primary_region': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown',
    'dominant_device': lambda x: x.mode().iloc[0] if not x.mode().empty else 'unknown'
}).round(3)

# Flatten column names
segment_profiles.columns = ['_'.join(col).strip() for col in segment_profiles.columns]
segment_profiles.to_csv(FEATURES_DIR / "segment_profiles.csv")
print(f"✅ Segment profiles saved for business analysis")

# 6. Feature dictionary (documentation)
print("💾 Creating feature documentation...")
feature_dict = {
    'total_features': master_features.shape[1],
    'total_users': len(master_features),
    'user_segments': master_features['segment_label'].value_counts().to_dict(),
    'cold_start_segments': len(cold_start_lookup),
    'data_coverage': {
        'date_range': f"{events_merged['eventDate'].min()} to {events_merged['eventDate'].max()}",
        'total_revenue': f"${master_features['total_revenue'].sum():,.2f}",
        'conversion_rate': f"{master_features['conversion_rate'].mean():.4f}"
    }
}

import json
with open(FEATURES_DIR / "feature_dictionary.json", 'w') as f:
    json.dump(feature_dict, f, indent=2, default=str)

# File inventory and summary
print(f"\n📁 FEATURE ENGINEERING COMPLETE!")
print(f"=" * 60)
print(f"📊 Files Created:")
total_size = 0
for file in FEATURES_DIR.glob("*"):
    size_mb = file.stat().st_size / (1024**2)
    total_size += size_mb
    print(f"  • {file.name}: {size_mb:.1f} MB")

print(f"\n🎯 PRODUCTION SUMMARY:")
print(f"  ✅ Total users processed: {len(master_features):,}")
print(f"  ✅ Features generated: {master_features.shape[1]}")
print(f"  ✅ User segments: {len(master_features['segment_label'].unique())}")
print(f"  ✅ Cold start segments: {len(cold_start_lookup):,}")
print(f"  ✅ Total storage: {total_size:.1f} MB")
print(f"  ✅ Ready for Notebook 03: EDA & Segmentation")

# Final memory cleanup
del master_features_optimized, session_features_final, user_summary, segment_profiles
gc.collect()
check_memory()

print(f"\n🏆 NOTEBOOK 02 COMPLETE - FEATURE ENGINEERING SUCCESS!")
print(f"🚀 Ready for recommendation engine and Streamlit prototype!")


💾 Saving Production-Ready Feature Sets...
💾 Saving master feature matrix...
✅ Master features: (744675, 34) | 17.2 MB
💾 Saving session features...
✅ Session features: (1004683, 9) | 19.1 MB
💾 Saving cold start lookup...
✅ Cold start lookup: (7910, 4) | 0.2 MB
💾 Saving user summary...
✅ User summary: (744675, 8) | 10.2 MB
💾 Creating segment profiles...
✅ Segment profiles saved for business analysis
💾 Creating feature documentation...

📁 FEATURE ENGINEERING COMPLETE!
📊 Files Created:
  • cold_start_lookup.parquet: 0.2 MB
  • feature_dictionary.json: 0.0 MB
  • master_features.parquet: 17.2 MB
  • segment_profiles.csv: 0.0 MB
  • session_features.parquet: 19.1 MB
  • user_summary.parquet: 10.2 MB

🎯 PRODUCTION SUMMARY:
  ✅ Total users processed: 744,675
  ✅ Features generated: 34
  ✅ User segments: 5
  ✅ Cold start segments: 7,910
  ✅ Total storage: 46.6 MB
  ✅ Ready for Notebook 03: EDA & Segmentation
💾 Current memory usage: 9.1GB

🏆 NOTEBOOK 02 COMPLETE - FEATURE ENGINEERING SUCCESS!
🚀 

13: Feature Dictionary & Documentation

## 🗃️ Feature Dictionary & Documentation

### **Master Features (master_features.parquet)**
**User-level aggregated features for personalization engine**

#### **RFM Features**
- `recency_days`: Days since last user activity
- `frequency_score`: Total number of sessions per user  
- `monetary_score`: Total revenue attributed to user
- `total_events`: Total events across all sessions
- `conversion_rate`: Purchase events / total sessions
- `avg_session_value`: Revenue per session average

#### **Geographic & Demographic Features**
- `primary_region`: Most frequent geographic region
- `primary_city`: Most frequent city location
- `dominant_device`: Most used device type (mobile/desktop/tablet)
- `primary_source`: Most frequent traffic source
- `primary_gender`: User gender (if available)
- `primary_age`: User age group classification
- `primary_income`: User income segment

#### **Behavioral Features**
- `preferred_page_type`: Most visited page category
- `page_diversity`: Number of unique pages visited
- `journey_complexity`: Unique event types in user journey
- `device_diversity`: Number of different devices used
- `source_diversity`: Number of different traffic sources

#### **Product Affinity Features** 
- `preferred_category`: Most purchased product category
- `preferred_brand`: Most purchased brand
- `product_diversity`: Number of unique products purchased
- `avg_item_price`: Average purchase price
- `price_sensitivity`: Budget/mid/premium/luxury classification

#### **Segmentation**
- `user_segment`: Numeric cluster ID (0-4)
- `segment_label`: Business-friendly segment name

### **Session Features (session_features.parquet)**
**Session-level features for real-time personalization**

- `session_hour`: Hour of day session started
- `is_weekend`: Boolean weekend indicator
- `duration_minutes`: Session length in minutes
- `events_per_minute`: Engagement intensity metric
- `has_purchase`: Boolean purchase indicator
- `session_value`: Revenue generated in session
- `time_of_day`: Categorical time period

### **Cold Start Lookup (cold_start_lookup.parquet)**
**Anonymous user prediction matrix**

- `segment_key`: Combined demographic+device+source identifier
- `segment_conversion_rate`: Historical conversion rate for segment
- `avg_revenue_per_user`: Expected revenue for segment
- `user_count`: Number of historical users in segment

**🎯 TOTAL FEATURES GENERATED: 30+ personalization features ready for ML pipeline**
