# BNPL Feature Engineering Strategy

**Objective**: Create predictive features for BNPL default risk using only available data

**Critical Constraints**:
- No payment behavior data available (major limitation acknowledged)
- Features must be computable at transaction time (<100ms)
- Beat current 3.5x risk discrimination baseline

**Available Data Sources**: Transaction records with customer demographics and transaction context

In [1]:
# Add project root to Python path
# Allows us to use flit-ml modules directly in notebooks
import sys
import os
project_root = os.path.abspath(os.path.join(os.getcwd(), '../..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    
print(f"Project root added to path: {project_root}")

Project root added to path: /Users/kevin/Documents/repos/flit-ml


In [2]:
# Environment setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# BigQuery integration
from google.cloud import bigquery
from flit_ml.config import config

# Configuration
pd.set_option('display.max_columns', 50)
sns.set_style("whitegrid")

print("Feature engineering environment ready!")

Feature engineering environment ready!


In [5]:
# Connect to BigQuery and load feature engineering dataset
client = config.get_client()

# Based on EDA findings, load key features for ML model development
# Focus on validated pre-transaction features that showed predictive power
feature_data_query = """
WITH customer_sample AS (
  SELECT DISTINCT customer_id
  FROM `flit-data-platform.flit_staging.stg_bnpl_raw_transactions`
  ORDER BY RAND()
  LIMIT 1000
)
SELECT 
    t.customer_id,
    t.transaction_id,
    t.amount,
    t.will_default,
    t.transaction_timestamp,
    t.days_to_first_missed_payment,
    
    -- EDA-validated customer features (legitimate pre-transaction)
    t.customer_credit_score_range,
    t.customer_age_bracket,
    t.customer_income_bracket,
    t.customer_verification_level,
    t.customer_tenure_days,
    t.customer_state,
    
    -- EDA-validated transaction context features
    t.product_category,
    t.product_risk_category,
    t.product_price,
    t.device_type,
    t.device_is_trusted,
    
    -- EDA-validated current underwriting features (baseline to beat)
    t.risk_score,
    t.risk_level,
    t.risk_scenario,
    
    -- Additional context features
    -- first_payment_amount is excluded as it is ideally a pct of amount
    -- payment_frequency is excluded as it is ideally a standard term (biweekly)
    t.payment_provider,
    t.installment_count,
    t.payment_credit_limit,
    t.payment_type,
    t.time_on_site_seconds,
    t.purchase_context,
    t.price_comparison_time
    
FROM `flit-data-platform.flit_staging.stg_bnpl_raw_transactions` t
INNER JOIN customer_sample cs ON t.customer_id = cs.customer_id
ORDER BY t.customer_id, t.transaction_timestamp
"""

print("📥 Loading feature engineering dataset based on EDA insights...")
df = client.query(feature_data_query).to_dataframe()

print(f"✅ Data loaded: {df.shape[0]:,} transactions for {df['customer_id'].nunique():,} customers")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"Default rate: {df['will_default'].mean():.1%}")

# Display schema for feature engineering reference
print(f"\n📋 Available columns for feature engineering:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2}. {col}")
    
print(f"\n🎯 Target variable: 'will_default' (Binary: {df['will_default'].value_counts().to_dict()})")

📥 Loading feature engineering dataset based on EDA insights...


E0000 00:00:1758552293.715940 23476930 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


✅ Data loaded: 88,309 transactions for 1,000 customers
Memory usage: 94.5 MB
Default rate: 4.8%

📋 Available columns for feature engineering:
 1. customer_id
 2. transaction_id
 3. amount
 4. will_default
 5. transaction_timestamp
 6. days_to_first_missed_payment
 7. customer_credit_score_range
 8. customer_age_bracket
 9. customer_income_bracket
10. customer_verification_level
11. customer_tenure_days
12. customer_state
13. product_category
14. product_risk_category
15. product_price
16. device_type
17. device_is_trusted
18. risk_score
19. risk_level
20. risk_scenario
21. payment_provider
22. installment_count
23. payment_credit_limit
24. payment_type
25. time_on_site_seconds
26. purchase_context
27. price_comparison_time

🎯 Target variable: 'will_default' (Binary: {False: 84027, True: 4282})


In [6]:
# Data exploration: Understand what we actually have
print("📊 Available Fields for Feature Engineering:")
print("=" * 50)

for i, col in enumerate(df.columns, 1):
    dtype = df[col].dtype
    unique_vals = df[col].nunique() if df[col].nunique() < 10 else f"{df[col].nunique():,} unique values"
    print(f"{i:2}. {col:<30} | {str(dtype):<15} | {unique_vals}")

print(f"\n🎯 Target Variable: will_default")
print(f"   Distribution: {df['will_default'].value_counts().to_dict()}")
print(f"   Default rate: {df['will_default'].mean():.1%}")

📊 Available Fields for Feature Engineering:
 1. customer_id                    | object          | 1,000 unique values
 2. transaction_id                 | object          | 46,458 unique values
 3. amount                         | float64         | 28,372 unique values
 4. will_default                   | boolean         | 2
 5. transaction_timestamp          | datetime64[us, UTC] | 88,118 unique values
 6. days_to_first_missed_payment   | object          | 6
 7. customer_credit_score_range    | object          | 4
 8. customer_age_bracket           | object          | 5
 9. customer_income_bracket        | object          | 5
10. customer_verification_level    | object          | 3
11. customer_tenure_days           | float64         | 731 unique values
12. customer_state                 | object          | 59 unique values
13. product_category               | object          | 5
14. product_risk_category          | object          | 3
15. product_price                  | float64    

# Feature Engineering Strategy

## Data Reality Check
**What We DON'T Have** (Critical Gap):
- Payment history/behavior data
- Previous default information
- Account balance information
- Payment timing/lateness data

**What We DO Have**:
- Customer demographics at transaction time
- Transaction context and characteristics
- Current underwriting features (baseline)
- Device and behavioral signals

## Proposed Feature Engineering Steps

### Step 1: Temporal Feature Extraction
**What**: Extract time-based components from transaction_timestamp
**Why**: Temporal patterns often correlate with financial behavior:
- Hour of day (impulse vs planned purchases)
- Day of week (weekend vs weekday spending)
- Month/season (holiday spending, month-end behavior)
**Features**: hour, day_of_week, month, is_weekend, is_month_end, is_holiday_season

### Step 2: Categorical Variable Encoding
**What**: Convert categorical variables to numerical representations

**Why**: ML algorithms require numerical inputs

**Categorical Variables Identified**:

- customer_credit_score_range
- customer_age_bracket  
- customer_income_bracket
- customer_verification_level
- product_category
- product_risk_category
- device_type
- risk_level
- payment_provider
- purchase_context
**Method**: Ordinal encoding for ordered categories, one-hot for nominal

### Step 3: Customer Historical Aggregation Features
**What**: Create customer-level behavioral patterns from transaction history

**Why**: Even without payment data, transaction patterns reveal behavior:

- Spending consistency/volatility
- Purchase frequency patterns
- Product category preferences
- Device/channel preferences
**Features**: transaction_count, avg_amount, amount_volatility, category_diversity, device_consistency

### Step 4: Transaction Context Risk Indicators
**What**: Engineer risk signals from available transaction context

**Why**: Certain transaction characteristics correlate with default risk:

- High amounts relative to income/credit limit
- High-risk product categories
- Unverified customers
- Untrusted devices
**Features**: amount_to_income_ratio, amount_to_credit_ratio, risk_score_normalized

### Step 5: Interaction Features
**What**: Create features that capture relationships between variables

**Why**: Risk often emerges from combinations of factors

**Examples**: 

- young_customer_high_amount (age + amount interaction)
- unverified_high_risk_product (verification + product risk)
- mobile_impulse_purchase (device + context)

### Step 6: Feature Validation and Selection
**What**: Validate feature quality and remove redundant/low-value features

**Why**: Ensure features are predictive and production-ready

**Methods**: 

- Correlation analysis
- Feature importance
- Missing value analysis
- Distribution validation

## Expected Limitations
1. **No Payment Behavior**: Severely limits predictive power
2. **Cross-sectional Only**: No longitudinal payment performance
3. **Synthetic Data**: May not reflect real-world patterns
4. **Limited Risk Signals**: Current features may not capture default risk effectively

## Success Metrics
- Beat current 3.5x risk discrimination ratio
- Achieve >40% precision on high-risk segment
- Maintain feature computation <100ms
- Demonstrate statistical significance vs baseline