# Generate Continuous Prediction Customer Data

This notebook generates realistic data for customers with established transaction history.

**Continuous Scenario**: Customers with 3+ months of activity and rich behavioral data.

We'll generate:
- 50,000 customers
- 500,000+ raw transactions
- Customer interactions and engagement events
- Then derive RFM and other features from raw data

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import random

random.seed(42)
np.random.seed(42)

## Configuration

In [None]:
NUM_CUSTOMERS = 50000
OBSERVATION_DATE = datetime(2024, 6, 30)
MIN_HISTORY_DAYS = 90
MAX_HISTORY_DAYS = 540

## Generate Customer Profiles

In [None]:
customer_segments = ['high_value', 'medium_value', 'low_value', 'at_risk', 'churned']
segment_weights = [0.15, 0.35, 0.30, 0.12, 0.08]

customers = []

for customer_id in range(1, NUM_CUSTOMERS + 1):
    history_days = random.randint(MIN_HISTORY_DAYS, MAX_HISTORY_DAYS)
    signup_date = OBSERVATION_DATE - timedelta(days=history_days)
    
    segment = np.random.choice(customer_segments, p=segment_weights)
    
    age_groups = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']
    age_group = np.random.choice(age_groups, p=[0.10, 0.30, 0.27, 0.18, 0.10, 0.05])
    
    regions = ['Northeast', 'Southeast', 'Midwest', 'Southwest', 'West']
    region = np.random.choice(regions)
    
    customers.append({
        'customer_id': customer_id,
        'signup_date': signup_date,
        'age_group': age_group,
        'region': region,
        'segment': segment,
        'history_days': history_days
    })

customers_df = pd.DataFrame(customers)
print(f"Generated {len(customers_df)} customer profiles")
customers_df.head()

## Generate Raw Transactions

**Why raw transactions**: In real-world scenarios, we don't have pre-computed features. We start with transactional data and must derive meaningful metrics.

In [None]:
transactions = []

product_categories = ['electronics', 'clothing', 'home_goods', 'beauty', 'sports', 'books', 'toys', 'grocery']

for _, customer in customers_df.iterrows():
    customer_id = customer['customer_id']
    signup_date = customer['signup_date']
    segment = customer['segment']
    
    if segment == 'high_value':
        num_transactions = int(np.random.uniform(20, 60))
        avg_amount = np.random.uniform(100, 300)
        purchase_frequency_days = 7
    elif segment == 'medium_value':
        num_transactions = int(np.random.uniform(8, 25))
        avg_amount = np.random.uniform(50, 120)
        purchase_frequency_days = 15
    elif segment == 'low_value':
        num_transactions = int(np.random.uniform(3, 10))
        avg_amount = np.random.uniform(20, 60)
        purchase_frequency_days = 30
    elif segment == 'at_risk':
        num_transactions = int(np.random.uniform(5, 15))
        avg_amount = np.random.uniform(40, 100)
        purchase_frequency_days = 45
    else:
        num_transactions = int(np.random.uniform(1, 5))
        avg_amount = np.random.uniform(25, 70)
        purchase_frequency_days = 60
    
    current_date = signup_date + timedelta(days=random.randint(0, 7))
    
    for txn in range(num_transactions):
        days_since_last = int(np.random.exponential(purchase_frequency_days))
        current_date = current_date + timedelta(days=max(1, days_since_last))
        
        if current_date > OBSERVATION_DATE:
            break
        
        amount = max(5, np.random.gamma(shape=2, scale=avg_amount/2))
        amount = round(amount, 2)
        
        category = np.random.choice(product_categories)
        quantity = int(np.random.poisson(lam=2) + 1)
        
        transactions.append({
            'transaction_id': len(transactions) + 1,
            'customer_id': customer_id,
            'transaction_date': current_date,
            'amount': amount,
            'product_category': category,
            'quantity': quantity
        })

transactions_df = pd.DataFrame(transactions)
print(f"Generated {len(transactions_df)} transactions")
transactions_df.head(10)

## Generate Customer Interaction Events

In [None]:
interactions = []
event_types = ['website_visit', 'email_open', 'email_click', 'support_ticket', 'product_view', 'cart_add']

for _, customer in customers_df.iterrows():
    customer_id = customer['customer_id']
    signup_date = customer['signup_date']
    segment = customer['segment']
    
    if segment == 'high_value':
        num_interactions = int(np.random.uniform(100, 300))
    elif segment == 'medium_value':
        num_interactions = int(np.random.uniform(50, 120))
    elif segment == 'low_value':
        num_interactions = int(np.random.uniform(20, 60))
    elif segment == 'at_risk':
        num_interactions = int(np.random.uniform(15, 50))
    else:
        num_interactions = int(np.random.uniform(5, 20))
    
    for _ in range(num_interactions):
        days_offset = random.randint(0, customer['history_days'])
        event_date = signup_date + timedelta(days=days_offset)
        
        if event_date > OBSERVATION_DATE:
            continue
        
        event_type = np.random.choice(event_types, p=[0.35, 0.20, 0.10, 0.05, 0.20, 0.10])
        
        interactions.append({
            'interaction_id': len(interactions) + 1,
            'customer_id': customer_id,
            'event_date': event_date,
            'event_type': event_type
        })

interactions_df = pd.DataFrame(interactions)
print(f"Generated {len(interactions_df)} customer interactions")
interactions_df.head(10)

## Derive RFM Features from Raw Transactions

**RFM Metrics Explained**:
- **Recency**: Days since last purchase - Recent customers are more likely to purchase again
- **Frequency**: Number of purchases - Frequent buyers show loyalty and habit
- **Monetary**: Total/average spend - High spenders have higher lifetime value potential

These three metrics are fundamental because they capture:
1. Current engagement (Recency)
2. Behavioral patterns (Frequency)
3. Economic value (Monetary)

In [None]:
rfm_features = transactions_df.groupby('customer_id').agg(
    recency_days=('transaction_date', lambda x: (OBSERVATION_DATE - x.max()).days),
    frequency=('transaction_id', 'count'),
    monetary_total=('amount', 'sum'),
    monetary_avg=('amount', 'mean'),
    first_purchase_date=('transaction_date', 'min'),
    last_purchase_date=('transaction_date', 'max')
).reset_index()

rfm_features['customer_tenure_days'] = (OBSERVATION_DATE - rfm_features['first_purchase_date']).dt.days

print("RFM Features derived from raw transactions:")
rfm_features.head(10)

## Derive Purchase Pattern Features

**Why these matter**:
- **Inter-purchase time**: Identifies purchase rhythm and consistency
- **Product diversity**: Customers who buy across categories tend to have higher engagement
- **Average order value**: Indicates spending capacity per transaction
- **Trend indicators**: Growing vs declining purchase patterns predict future behavior

In [None]:
purchase_patterns = []

for customer_id in customers_df['customer_id']:
    cust_txns = transactions_df[transactions_df['customer_id'] == customer_id].sort_values('transaction_date')
    
    if len(cust_txns) > 1:
        dates = pd.to_datetime(cust_txns['transaction_date'])
        inter_purchase_times = dates.diff().dt.days.dropna()
        avg_inter_purchase_days = inter_purchase_times.mean() if len(inter_purchase_times) > 0 else None
        std_inter_purchase_days = inter_purchase_times.std() if len(inter_purchase_times) > 0 else None
    else:
        avg_inter_purchase_days = None
        std_inter_purchase_days = None
    
    unique_categories = cust_txns['product_category'].nunique()
    total_quantity = cust_txns['quantity'].sum()
    
    recent_30d_txns = cust_txns[cust_txns['transaction_date'] >= (OBSERVATION_DATE - timedelta(days=30))]
    recent_30d_amount = recent_30d_txns['amount'].sum()
    recent_30d_count = len(recent_30d_txns)
    
    recent_90d_txns = cust_txns[cust_txns['transaction_date'] >= (OBSERVATION_DATE - timedelta(days=90))]
    recent_90d_amount = recent_90d_txns['amount'].sum()
    recent_90d_count = len(recent_90d_txns)
    
    if len(cust_txns) >= 4:
        mid_point = len(cust_txns) // 2
        first_half_avg = cust_txns.iloc[:mid_point]['amount'].mean()
        second_half_avg = cust_txns.iloc[mid_point:]['amount'].mean()
        spending_trend = (second_half_avg - first_half_avg) / first_half_avg if first_half_avg > 0 else 0
    else:
        spending_trend = 0
    
    purchase_patterns.append({
        'customer_id': customer_id,
        'avg_inter_purchase_days': avg_inter_purchase_days,
        'std_inter_purchase_days': std_inter_purchase_days,
        'unique_categories_purchased': unique_categories,
        'total_items_purchased': total_quantity,
        'recent_30d_amount': recent_30d_amount,
        'recent_30d_count': recent_30d_count,
        'recent_90d_amount': recent_90d_amount,
        'recent_90d_count': recent_90d_count,
        'spending_trend': spending_trend
    })

purchase_patterns_df = pd.DataFrame(purchase_patterns)
print("Purchase pattern features:")
purchase_patterns_df.head(10)

## Derive Engagement Features from Interactions

**Why these matter**: Engagement beyond purchases indicates interest and intent. High engagement without recent purchases may signal opportunity or friction.

In [None]:
engagement_features = interactions_df.groupby('customer_id').agg(
    total_interactions=('interaction_id', 'count'),
    website_visits=('event_type', lambda x: (x == 'website_visit').sum()),
    email_opens=('event_type', lambda x: (x == 'email_open').sum()),
    email_clicks=('event_type', lambda x: (x == 'email_click').sum()),
    support_tickets=('event_type', lambda x: (x == 'support_ticket').sum()),
    product_views=('event_type', lambda x: (x == 'product_view').sum()),
    cart_adds=('event_type', lambda x: (x == 'cart_add').sum())
).reset_index()

engagement_features['email_engagement_rate'] = (
    engagement_features['email_clicks'] / engagement_features['email_opens'].replace(0, np.nan)
).fillna(0)

print("Engagement features derived from interactions:")
engagement_features.head(10)

## Combine All Features into Final Dataset

In [None]:
final_df = customers_df.merge(rfm_features, on='customer_id', how='left')
final_df = final_df.merge(purchase_patterns_df, on='customer_id', how='left')
final_df = final_df.merge(engagement_features, on='customer_id', how='left')

final_df.fillna(0, inplace=True)

print(f"Final dataset shape: {final_df.shape}")
final_df.head()

## Generate Target Variable: Forward-Looking 12-Month CLV

**Why this is the label**: We predict future value, not historical. The model learns patterns that indicate future spending behavior.

In [None]:
def calculate_future_ltv(row):
    base_ltv = row['monetary_total'] * 0.6
    
    recency_factor = max(0.5, 1.5 - (row['recency_days'] / 180))
    frequency_factor = min(2.0, 1 + (row['frequency'] / 20))
    
    engagement_factor = 1 + (row['total_interactions'] / 500)
    
    trend_factor = 1 + row['spending_trend']
    trend_factor = max(0.5, min(2.0, trend_factor))
    
    future_ltv = base_ltv * recency_factor * frequency_factor * engagement_factor * trend_factor
    
    future_ltv = future_ltv * np.random.uniform(0.7, 1.3)
    
    return round(max(0, future_ltv), 2)

final_df['future_12m_ltv'] = final_df.apply(calculate_future_ltv, axis=1)

print(f"Average future 12-month LTV: ${final_df['future_12m_ltv'].mean():.2f}")
print(f"Median future 12-month LTV: ${final_df['future_12m_ltv'].median():.2f}")

## Data Quality Checks

In [None]:
print(f"Total customers in final dataset: {len(final_df)}")
print(f"Total transactions: {len(transactions_df)}")
print(f"Total interactions: {len(interactions_df)}")
print(f"\nAverage transactions per customer: {final_df['frequency'].mean():.2f}")
print(f"Average total spend per customer: ${final_df['monetary_total'].mean():.2f}")
print(f"Average recency (days): {final_df['recency_days'].mean():.2f}")

print("\nSegment distribution:")
print(final_df['segment'].value_counts())

print("\nKey feature statistics:")
print(final_df[['recency_days', 'frequency', 'monetary_avg', 'future_12m_ltv']].describe())

## Save All Datasets

In [None]:
customers_df.to_csv('continuous_customers_profile.csv', index=False)
transactions_df.to_csv('continuous_transactions.csv', index=False)
interactions_df.to_csv('continuous_interactions.csv', index=False)
final_df.to_csv('continuous_customers_features.csv', index=False)

print("Saved all datasets:")
print("  - continuous_customers_profile.csv")
print("  - continuous_transactions.csv")
print("  - continuous_interactions.csv")
print("  - continuous_customers_features.csv")

## Summary

This notebook demonstrated:
1. **Raw data generation** - Starting with transactional reality
2. **Feature derivation** - Computing RFM and behavioral metrics from raw data
3. **Feature engineering rationale** - Explaining why each feature matters for CLV prediction

The resulting dataset is ready for model training with rich, realistic features derived from transactional patterns.