# BNPL Feature Engineering

**Objective**: Create domain-specific features for BNPL risk prediction

**Feature Categories**:
1. **Payment Velocity**: Speed and consistency of payments
2. **Risk Indicators**: Historical default patterns and red flags
3. **Customer Profile**: Demographic and behavioral features
4. **Transaction Patterns**: Spending behavior and trends
5. **Temporal Features**: Time-based patterns and seasonality

**Business Context**: BNPL decisions need to be made in <100ms with high accuracy

In [None]:
# Environment setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# BigQuery integration
from google.cloud import bigquery
from flit_ml.config import config

# Configuration
pd.set_option('display.max_columns', 50)
sns.set_style("whitegrid")

print("Feature engineering environment ready!")

## 1. Data Loading Strategy

For feature engineering, we need customer-level aggregations

In [None]:
# Connect to BigQuery
client = config.get_client()

# Load larger sample for feature engineering (stratified by customer)
feature_data_query = """
WITH customer_sample AS (
  SELECT DISTINCT customer_id
  FROM `flit-data-platform.flit_staging.stg_bnpl_raw_transactions`
  ORDER BY RAND()
  LIMIT 5000  -- Sample customers for faster iteration
)
SELECT t.*
FROM `flit-data-platform.flit_staging.stg_bnpl_raw_transactions` t
INNER JOIN customer_sample cs ON t.customer_id = cs.customer_id
ORDER BY t.customer_id, t.transaction_date
"""

print("📥 Loading customer transaction data...")
df = client.query(feature_data_query).to_dataframe()

print(f"Data loaded: {df.shape[0]:,} transactions for {df['customer_id'].nunique():,} customers")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

## 2. Payment Velocity Features

Key insight: Payment speed and consistency are strong predictors of default risk

In [None]:
def calculate_payment_velocity_features(df):
    """
    Calculate payment velocity and consistency features.
    
    Business Logic:
    - Faster payers = lower default risk
    - Consistent payment patterns = lower risk
    - Late payment history = red flag
    """
    # Ensure datetime conversion
    df['transaction_date'] = pd.to_datetime(df['transaction_date'])
    df['payment_due_date'] = pd.to_datetime(df['payment_due_date'])
    
    # Calculate days to payment
    df['days_to_payment'] = (df['payment_date'] - df['transaction_date']).dt.days
    df['days_early_late'] = (df['payment_due_date'] - df['payment_date']).dt.days
    
    # Customer-level aggregations
    velocity_features = df.groupby('customer_id').agg({
        'days_to_payment': ['mean', 'std', 'min', 'max'],
        'days_early_late': ['mean', 'std'],
        'transaction_amount': ['count', 'mean', 'std', 'sum'],
        'payment_status': lambda x: (x == 'late').sum(),  # Count late payments
    }).round(2)
    
    # Flatten column names
    velocity_features.columns = [f"velocity_{col[0]}_{col[1]}" for col in velocity_features.columns]
    
    # Calculate derived features
    velocity_features['velocity_late_payment_rate'] = (
        velocity_features['velocity_payment_status_<lambda>'] / 
        velocity_features['velocity_transaction_amount_count']
    ).round(3)
    
    velocity_features['velocity_payment_consistency'] = (
        1 / (1 + velocity_features['velocity_days_to_payment_std'])
    ).round(3)
    
    return velocity_features

# Calculate velocity features
print("⚡ Calculating payment velocity features...")
# velocity_features = calculate_payment_velocity_features(df)
# print(f"Velocity features created: {velocity_features.shape}")
# velocity_features.head()

# Note: This is a template - actual implementation depends on available columns
print("📝 Velocity feature template created (awaiting actual schema)")

## 3. Risk Indicator Features

Features that historically correlate with default risk

In [None]:
def calculate_risk_indicator_features(df):
    """
    Calculate risk indicators based on BNPL domain knowledge.
    
    Risk Factors:
    - High transaction amounts relative to income
    - Frequent BNPL usage (over-borrowing)
    - Declining payment performance
    - Multiple failed payments
    """
    
    customer_risk = df.groupby('customer_id').agg({
        'transaction_amount': ['sum', 'mean', 'max', 'count'],
        'credit_score': 'first',  # Assume this exists
        'annual_income': 'first',  # Assume this exists
        'failed_payment_count': 'sum',  # Historical failures
        'days_since_last_transaction': 'min',  # Recency
    })
    
    # Flatten columns
    customer_risk.columns = [f"risk_{col[0]}_{col[1]}" for col in customer_risk.columns]
    
    # Calculate derived risk indicators
    customer_risk['risk_debt_to_income_ratio'] = (
        customer_risk['risk_transaction_amount_sum'] / 
        customer_risk['risk_annual_income_first']
    ).clip(0, 1).round(3)
    
    customer_risk['risk_transaction_frequency'] = (
        customer_risk['risk_transaction_amount_count'] / 30  # Transactions per month
    ).round(2)
    
    customer_risk['risk_avg_transaction_to_income'] = (
        customer_risk['risk_transaction_amount_mean'] / 
        (customer_risk['risk_annual_income_first'] / 12)  # Monthly income
    ).round(3)
    
    return customer_risk

print("🚨 Risk indicator feature template created")
print("Will implement after schema exploration")

## 4. Temporal Features

Time-based patterns that influence BNPL behavior

In [None]:
def calculate_temporal_features(df):
    """
    Extract temporal patterns from transaction data.
    
    Temporal Insights:
    - Weekend vs weekday spending patterns
    - Month-end vs month-start behavior
    - Seasonal spending (holidays, back-to-school)
    - Time of day preferences
    """
    
    df['transaction_date'] = pd.to_datetime(df['transaction_date'])
    
    # Extract temporal components
    df['day_of_week'] = df['transaction_date'].dt.dayofweek
    df['day_of_month'] = df['transaction_date'].dt.day
    df['month'] = df['transaction_date'].dt.month
    df['quarter'] = df['transaction_date'].dt.quarter
    df['hour'] = df['transaction_date'].dt.hour
    
    # Create categorical features
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    df['is_month_end'] = (df['day_of_month'] > 25).astype(int)
    df['is_holiday_season'] = df['month'].isin([11, 12]).astype(int)
    df['is_business_hours'] = df['hour'].between(9, 17).astype(int)
    
    # Customer temporal preferences
    temporal_features = df.groupby('customer_id').agg({
        'is_weekend': 'mean',
        'is_month_end': 'mean', 
        'is_holiday_season': 'mean',
        'is_business_hours': 'mean',
        'day_of_week': lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else 0,
        'hour': ['mean', 'std']
    }).round(3)
    
    temporal_features.columns = [f"temporal_{col[0]}_{col[1]}" if isinstance(col, tuple) else f"temporal_{col}" for col in temporal_features.columns]
    
    return temporal_features

print("⏰ Temporal feature engineering template ready")

## 5. Feature Engineering Pipeline

Combine all feature engineering into a reproducible pipeline

In [None]:
def create_bnpl_features(df):
    """
    Master feature engineering pipeline for BNPL risk prediction.
    
    Returns:
        DataFrame with engineered features at customer level
    """
    
    print("🔧 Starting feature engineering pipeline...")
    
    # 1. Basic customer statistics
    basic_features = df.groupby('customer_id').agg({
        'transaction_amount': ['count', 'sum', 'mean', 'std', 'min', 'max'],
        'customer_age': 'first',
        'credit_score': 'first',
        'annual_income': 'first'
    })
    
    basic_features.columns = [f"basic_{col[0]}_{col[1]}" if isinstance(col, tuple) else f"basic_{col}" for col in basic_features.columns]
    
    # 2. Payment velocity features
    # velocity_features = calculate_payment_velocity_features(df)
    
    # 3. Risk indicators
    # risk_features = calculate_risk_indicator_features(df)
    
    # 4. Temporal patterns
    temporal_features = calculate_temporal_features(df)
    
    # 5. Combine all features
    # feature_df = pd.concat([
    #     basic_features,
    #     velocity_features,
    #     risk_features,
    #     temporal_features
    # ], axis=1)
    
    # For now, just return basic + temporal
    feature_df = pd.concat([basic_features, temporal_features], axis=1)
    
    print(f"✅ Feature engineering complete: {feature_df.shape}")
    return feature_df

# This will be implemented once we understand the actual schema
print("🏗️  Feature engineering pipeline framework ready")
print("Next: Explore actual data schema to implement concrete features")

## 6. Feature Validation Framework

Ensure features are predictive and production-ready

In [None]:
def validate_features(feature_df, target_col='default_risk'):
    """
    Validate engineered features for production readiness.
    
    Validation Checks:
    1. No missing values in critical features
    2. Reasonable feature distributions
    3. Low correlation with identifiers
    4. Predictive power assessment
    """
    
    validation_report = {}
    
    # 1. Missing values
    missing_pct = (feature_df.isnull().sum() / len(feature_df) * 100).round(2)
    validation_report['missing_values'] = missing_pct[missing_pct > 0].to_dict()
    
    # 2. Feature distributions
    numeric_features = feature_df.select_dtypes(include=[np.number]).columns
    
    distribution_issues = []
    for col in numeric_features:
        if feature_df[col].std() == 0:
            distribution_issues.append(f"{col}: No variance")
        elif feature_df[col].skew() > 10:
            distribution_issues.append(f"{col}: Highly skewed ({feature_df[col].skew():.1f})")
    
    validation_report['distribution_issues'] = distribution_issues
    
    # 3. Feature correlations (high correlation = potential multicollinearity)
    corr_matrix = feature_df[numeric_features].corr().abs()
    high_corr_pairs = []
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if corr_matrix.iloc[i, j] > 0.8:
                high_corr_pairs.append({
                    'feature1': corr_matrix.columns[i],
                    'feature2': corr_matrix.columns[j],
                    'correlation': round(corr_matrix.iloc[i, j], 3)
                })
    
    validation_report['high_correlations'] = high_corr_pairs
    
    return validation_report

print("✅ Feature validation framework ready")
print("Will validate features once they are generated")

## 7. Next Steps

**Immediate Actions**:
1. Run data exploration notebook to understand actual schema
2. Implement concrete feature engineering based on available columns
3. Validate feature quality and predictive power
4. Create target variable for supervised learning

**Feature Engineering Priorities**:
1. **Payment Velocity** - Core BNPL predictor
2. **Debt-to-Income Ratios** - Financial health indicator  
3. **Historical Payment Performance** - Past behavior predicts future
4. **Transaction Patterns** - Behavioral fingerprints

In [None]:
# Feature engineering roadmap
roadmap = {
    'phase_1_basic': [
        'transaction_count_per_customer',
        'avg_transaction_amount',
        'total_exposure',
        'customer_age',
        'credit_score'
    ],
    'phase_2_velocity': [
        'avg_days_to_payment',
        'payment_consistency_score',
        'late_payment_rate',
        'payment_amount_variance'
    ],
    'phase_3_risk': [
        'debt_to_income_ratio',
        'transaction_frequency_trend',
        'failed_payment_count',
        'credit_utilization'
    ],
    'phase_4_temporal': [
        'weekend_spending_ratio',
        'month_end_behavior',
        'seasonal_patterns',
        'time_of_day_preferences'
    ]
}

print("🗺️  Feature Engineering Roadmap:")
for phase, features in roadmap.items():
    print(f"\n{phase.upper()}:")
    for feature in features:
        print(f"  - {feature}")