# Modeling: Offer Completion Prediction

**Goal:** Build and compare ML models to predict which customers will complete offers.

**Models to Train:**
1. Logistic Regression (baseline)
2. Decision Tree (baseline)
3. Random Forest (ensemble)
4. XGBoost (ensemble)

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                            f1_score, roc_auc_score, confusion_matrix,
                            classification_report, roc_curve, precision_recall_curve)
import joblib
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')
RANDOM_STATE = 42

print("Environment ready! ‚úì")

Environment ready! ‚úì


## Load Processed Data

In [2]:
processed_dir = '../Cafe_Rewards_Offers/processed'

X_train = joblib.load(f'{processed_dir}/X_train_scaled.pkl')
X_test = joblib.load(f'{processed_dir}/X_test_scaled.pkl')
y_train = joblib.load(f'{processed_dir}/y_train.pkl')
y_test = joblib.load(f'{processed_dir}/y_test.pkl')
feature_names = joblib.load(f'{processed_dir}/feature_names.pkl')
scaler = joblib.load(f'{processed_dir}/scaler.pkl')

print("="*60)
print("DATA LOADED")
print("="*60)
print(f"\nTraining set: {X_train.shape[0]:,} samples √ó {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]:,} samples √ó {X_test.shape[1]} features")
print(f"\nTarget distribution in train set:")
print(y_train.value_counts(normalize=True).round(3))
print(f"\nTarget distribution in test set:")
print(y_test.value_counts(normalize=True).round(3))

DATA LOADED

Training set: 69,145 samples √ó 26 features
Test set: 17,287 samples √ó 26 features

Target distribution in train set:
target
1    0.534
0    0.466
Name: proportion, dtype: float64

Target distribution in test set:
target
1    0.534
0    0.466
Name: proportion, dtype: float64


## üîç Check for Data Leakage

**Important:** Perfect metrics (1.0 accuracy) usually indicate data leakage.
Let's check if any features directly reveal the target.

In [3]:
print("="*60)
print("CHECKING FOR DATA LEAKAGE")
print("="*60)

# Check feature names
print(f"\nFeatures ({len(feature_names)} total):")
for i, feat in enumerate(feature_names):
    print(f"  {i:2}. {feat}")

# Check if target has perfect correlation with any feature
print("\n" + "="*60)
print("CHECKING FOR PERFECT CORRELATION")
print("="*60)

# Combine X and y for correlation check
train_df = X_train.copy()
train_df['target'] = y_train.values

# Calculate correlation with target
correlations = train_df.corr()['target'].sort_values(ascending=False)

print("\nTop correlations with target:")
for feat, corr in correlations.head(10).items():
    print(f"  {feat:30}: {corr:.4f}")

# Flag potential data leaks (correlation = 1.0 or near 1.0)
perfect_leaks = correlations[correlations == 1.0]
if len(perfect_leaks) > 0:
    print(f"\n‚ö†Ô∏è  DATA LEAKAGE DETECTED!")
    print(f"Features with perfect correlation (r=1.0):")
    for feat in perfect_leaks.index:
        print(f"  - {feat}")
    print("\n‚ö†Ô∏è  ACTION REQUIRED: Remove these features before modeling!")
else:
    print("\n‚úì No perfect data leaks detected (correlation < 1.0)")

# Check for near-perfect leaks (correlation > 0.95)
near_leaks = correlations[(correlations > 0.95) & (correlations < 1.0)]
if len(near_leaks) > 0:
    print(f"\n‚ö†Ô∏è  NEAR-PERFECT DATA LEAKAGE DETECTED!")
    print(f"Features with near-perfect correlation (r > 0.95):")
    for feat, corr in near_leaks.items():
        print(f"  - {feat:30}: {corr:.4f}")

CHECKING FOR DATA LEAKAGE

Features (26 total):
   0. received_time
   1. difficulty
   2. duration
   3. in_email
   4. in_mobile
   5. in_social
   6. in_web
   7. offer_received
   8. offer_viewed
   9. offer_completed
  10. age
  11. income
  12. membership_year
  13. is_demographics_missing
  14. membership_duration_days
  15. membership_month
  16. offer_type_bogo
  17. offer_type_discount
  18. offer_type_informational
  19. gender_F
  20. gender_M
  21. gender_Missing
  22. gender_O
  23. age_group_encoded
  24. income_bracket_encoded
  25. tenure_group_encoded

CHECKING FOR PERFECT CORRELATION

Top correlations with target:
  target                        : 1.0000
  offer_completed               : 1.0000
  duration                      : 0.3518
  income                        : 0.3164
  income_bracket_encoded        : 0.3081
  difficulty                    : 0.2695
  offer_type_discount           : 0.2497
  tenure_group_encoded          : 0.2294
  in_web                       

In [4]:


print("=" * 60)
print("CHECKING FOR DATA LEAKAGE")
print("=" * 60)

# Check if offer_completed is in features (THIS IS THE PROBLEM!)
if 'offer_completed' in X_train.columns:
    print("\n" + "=" * 60)
    print("‚ö†Ô∏è  DATA LEAKAGE DETECTED!")
    print("=" * 60)
    print("Column 'offer_completed' found in features.")
    print("This is IDENTICAL to target, causing perfect 1.0 predictions.")
    print("Dropping 'offer_completed' from train and test sets...")
    
    X_train.drop('offer_completed', axis=1, inplace=True)
    X_test.drop('offer_completed', axis=1, inplace=True)
    
    print(f"‚úì Dropped. New train shape: {X_train.shape}")
    print(f"‚úì Features remaining: {len(X_train.columns)}")
else:
    print("\n‚úì No 'offer_completed' column found (already removed)")

# Also check for offer_viewed (less severe leak, but worth noting)
if 'offer_viewed' in X_train.columns:
    print("\n‚ÑπÔ∏è  INFO: 'offer_viewed' feature present")
    print("=" * 60)
    print("This feature is a potential data leak.")
    print("It's available before completion in real-time scenarios.")
    print("For now, we'll keep it to see full model performance.")
    print("=" * 60)
    print("\nRECOMMENDATION:")
    print("For true real-time prediction models, consider:")
    print("  1. Train models WITHOUT 'offer_viewed'")
    print("  2. For 'post-notification' prediction, keep 'offer_viewed'")
    print("\nFor now, we'll keep it to see full model performance.")
else:
    print("‚úì No 'offer_viewed' feature found")

# Final verification
print("\n" + "=" * 60)
print("FINAL DATA SHAPE")
print("=" * 60)
print(f"Train: {X_train.shape[0]:,} samples √ó {X_train.shape[1]} features")
print(f"Test: {X_test.shape[0]:,} samples √ó {X_test.shape[1]} features")
print(f"Features: {len(X_train.columns)}")


CHECKING FOR DATA LEAKAGE

‚ö†Ô∏è  DATA LEAKAGE DETECTED!
Column 'offer_completed' found in features.
This is IDENTICAL to target, causing perfect 1.0 predictions.
Dropping 'offer_completed' from train and test sets...
‚úì Dropped. New train shape: (69145, 25)
‚úì Features remaining: 25

‚ÑπÔ∏è  INFO: 'offer_viewed' feature present
This feature is a potential data leak.
It's available before completion in real-time scenarios.
For now, we'll keep it to see full model performance.

RECOMMENDATION:
For true real-time prediction models, consider:
  1. Train models WITHOUT 'offer_viewed'
  2. For 'post-notification' prediction, keep 'offer_viewed'

For now, we'll keep it to see full model performance.

FINAL DATA SHAPE
Train: 69,145 samples √ó 25 features
Test: 17,287 samples √ó 25 features
Features: 25


In [None]:
# Check if offer_completed is in features and remove it
if 'offer_completed' in X_train.columns:
    print("=" * 60)
    print("‚ö†Ô∏è  DATA LEAKAGE DETECTED!")
    print("=" * 60)
    print("Column 'offer_completed' found in features.")
    print("This is IDENTICAL to target, causing perfect predictions.")
    print("\nDropping 'offer_completed' and 'offer_viewed' from train and test sets...\n")

    X_train = X_train.drop(columns=['offer_completed', 'offer_viewed'], axis=1, inplace=True)
    X_test = X_test.drop(columns=['offer_completed', 'offer_viewed'], axis=1, inplace=True)

    # Update feature names list
    global feature_names
    feature_names = [f for f in feature_names if f not in ['offer_completed', 'offer_viewed']]
    
    print(f"‚úì Dropped. New shape: {X_train.shape}")
    print(f"‚úì Features remaining: {len(feature_names)}")
else:
    print("‚úì No data leakage columns detected")


In [None]:
X_test.columns

## Evaluation Functions

In [None]:
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Train and evaluate a model, returning metrics and predictions."""
    
    print(f"\n{'='*60}")
    print(f"TRAINING: {model_name}")
    print(f"{'='*60}")
    
    # Train model
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    metrics = {
        'model': model_name,
        'train_accuracy': accuracy_score(y_train, y_train_pred),
        'test_accuracy': accuracy_score(y_test, y_test_pred),
        'precision': precision_score(y_test, y_test_pred),
        'recall': recall_score(y_test, y_test_pred),
        'f1': f1_score(y_test, y_test_pred),
        'roc_auc': roc_auc_score(y_test, y_test_proba)
    }
    
    # Cross-validation score
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1')
    metrics['cv_f1_mean'] = cv_scores.mean()
    metrics['cv_f1_std'] = cv_scores.std()
    
    # Print results
    print(f"\n‚úì Model trained successfully")
    print(f"\nTest Set Performance:")
    print(f"  Accuracy:  {metrics['test_accuracy']:.4f}")
    print(f"  Precision: {metrics['precision']:.4f}")
    print(f"  Recall:    {metrics['recall']:.4f}")
    print(f"  F1-Score:  {metrics['f1']:.4f}")
    print(f"  AUC-ROC:   {metrics['roc_auc']:.4f}")
    print(f"\n5-Fold CV F1-Score: {metrics['cv_f1_mean']:.4f} (¬±{metrics['cv_f1_std']:.4f})")
    
    # Check for overfitting/underfitting (more nuanced check)
    train_test_diff = metrics['train_accuracy'] - metrics['test_accuracy']
    cv_test_diff = abs(metrics['cv_f1_mean'] - metrics['f1'])
    
    print(f"\nOverfitting Analysis:")
    print(f"  Train acc: {metrics['train_accuracy']:.4f}")
    print(f"  Test acc: {metrics['test_accuracy']:.4f}")
    print(f"  Difference: {train_test_diff:+.4f}")
    print(f"  CV F1: {metrics['cv_f1_mean']:.4f} vs Test F1: {metrics['f1']:.4f} (diff: {cv_test_diff:+.4f})")
    
    # Overfitting detection criteria
    if train_test_diff > 0.15:
        print(f"\n‚ö†Ô∏è  OVERFITTING DETECTED: Train acc exceeds test by {train_test_diff:.3f}")
    elif train_test_diff < -0.10:
        print(f"\n‚ö†Ô∏è  UNDERFITTING: Test acc exceeds train by {abs(train_test_diff):.3f}")
    elif cv_test_diff > 0.10:
        print(f"\n‚ö†Ô∏è  OVERFITTING WARNING: CV F1 exceeds Test F1 by {cv_test_diff:.3f}")
    else:
        print(f"\n‚úì No significant overfitting/underfitting")
    
    return model, metrics, y_test_pred, y_test_proba


def plot_confusion_matrix(y_true, y_pred, model_name):
    """Plot confusion matrix for a model."""
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['Not Completed', 'Completed'],
                yticklabels=['Not Completed', 'Completed'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.tight_layout()
    plt.show()


def plot_roc_curve(y_true, y_proba, model_name):
    """Plot ROC curve for a model."""
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    auc_score = roc_auc_score(y_true, y_proba)
    
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {auc_score:.3f})', linewidth=2)
    plt.plot([0, 1], [0, 1], 'k--', label='Random', linewidth=1)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc='lower right')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    plt.show()

print("Evaluation functions defined! ‚úì")

## Baseline Models

Start with simple, interpretable models to establish performance baselines.

In [None]:
results = []
predictions = {}
probabilities = {}

lr_model = LogisticRegression(
    random_state=RANDOM_STATE,
    max_iter=1000,
    class_weight='balanced'
)
lr_model, lr_metrics, lr_pred, lr_proba = evaluate_model(
    lr_model, X_train, X_test, y_train, y_test, "Logistic Regression"
)
results.append(lr_metrics)
predictions['Logistic Regression'] = lr_pred
probabilities['Logistic Regression'] = lr_proba

In [None]:
dt_model = DecisionTreeClassifier(
    random_state=RANDOM_STATE,
    max_depth=10,
    min_samples_split=50,
    class_weight='balanced'
)
dt_model, dt_metrics, dt_pred, dt_proba = evaluate_model(
    dt_model, X_train, X_test, y_train, y_test, "Decision Tree"
)
results.append(dt_metrics)
predictions['Decision Tree'] = dt_pred
probabilities['Decision Tree'] = dt_proba

## Ensemble Models

Train more powerful ensemble models to improve performance.

In [None]:
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    class_weight='balanced'
)
rf_model, rf_metrics, rf_pred, rf_proba = evaluate_model(
    rf_model, X_train, X_test, y_train, y_test, "Random Forest"
)
results.append(rf_metrics)
predictions['Random Forest'] = rf_pred
probabilities['Random Forest'] = rf_proba

In [None]:
try:
    import xgboost as xgb
    print("XGBoost available! ‚úì")
    
    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        random_state=RANDOM_STATE,
        n_jobs=-1,
        eval_metric='logloss',
        use_label_encoder=False
    )
    xgb_model, xgb_metrics, xgb_pred, xgb_proba = evaluate_model(
        xgb_model, X_train, X_test, y_train, y_test, "XGBoost"
    )
    results.append(xgb_metrics)
    predictions['XGBoost'] = xgb_pred
    probabilities['XGBoost'] = xgb_proba
except ImportError:
    print("\n‚ö†Ô∏è  XGBoost not installed. Installing...")
    print("Run: pip install xgboost")
    xgb_model = None

## Model Comparison

In [None]:
# Create comparison DataFrame
results_df = pd.DataFrame(results)
results_df = results_df.set_index('model')[
    ['test_accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'cv_f1_mean', 'cv_f1_std']
]
results_df.columns = ['Accuracy', 'Precision', 'Recall', 'F1', 'AUC-ROC', 'CV F1 (mean)', 'CV F1 (std)']
results_df = results_df.sort_values('F1', ascending=False)

print("="*70)
print("MODEL COMPARISON TABLE")
print("="*70)
print(results_df.round(4))

print("\n" + "="*70)
print("BEST MODEL")
print("="*70)
best_model = results_df.index[0]
best_f1 = results_df.loc[best_model, 'F1']
print(f"\nüèÜ  Best Model: {best_model}")
print(f"   F1-Score: {best_f1:.4f}")
print(f"   AUC-ROC:   {results_df.loc[best_model, 'AUC-ROC']:.4f}")

In [None]:
# Visual comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Model Performance Comparison', fontsize=16, fontweight='bold')

# Accuracy & F1
ax1 = axes[0, 0]
results_df[['Accuracy', 'F1']].plot(kind='bar', ax=ax1, rot=45)
ax1.set_title('Accuracy & F1-Score')
ax1.set_ylim(0.5, 1.0)
ax1.legend(loc='lower right')
ax1.grid(axis='y', alpha=0.3)

# Precision & Recall
ax2 = axes[0, 1]
results_df[['Precision', 'Recall']].plot(kind='bar', ax=ax2, rot=45)
ax2.set_title('Precision & Recall')
ax2.set_ylim(0.5, 1.0)
ax2.legend(loc='lower right')
ax2.grid(axis='y', alpha=0.3)

# AUC-ROC
ax3 = axes[1, 0]
results_df[['AUC-ROC']].sort_values('AUC-ROC').plot(kind='barh', ax=ax3, color='green')
ax3.set_title('AUC-ROC (Higher is Better)')
ax3.set_xlim(0.7, 1.0)
ax3.grid(axis='x', alpha=0.3)

# CV Stability
ax4 = axes[1, 1]
results_df['CV F1 (std)'].plot(kind='bar', ax=ax4, color='orange', rot=45)
ax4.set_title('CV Stability (Lower Std is Better)')
ax4.set_ylim(0, 0.02)
ax4.axhline(y=0.01, color='red', linestyle='--', label='0.01 threshold')
ax4.legend()
ax4.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## Best Model Analysis

In [None]:
# Get best model predictions
best_model_name = results_df.index[0]
plot_confusion_matrix(y_test, predictions[best_model_name], best_model_name)

# Print classification report
print(f"\n{'='*60}")
print(f"CLASSIFICATION REPORT: {best_model_name}")
print(f"{'='*60}")
print(classification_report(y_test, predictions[best_model_name], 
                          target_names=['Not Completed', 'Completed']))

In [None]:
# ROC Curve for best model
plot_roc_curve(y_test, probabilities[best_model_name], best_model_name)

## Feature Importance

In [None]:
# Feature importance from Random Forest (tree-based model)
if hasattr(rf_model, 'feature_importances_'):
    importances = rf_model.feature_importances_
    # Use actual column names from X_train (in case features were removed)
    feature_imp_df = pd.DataFrame({
        'feature': X_train.columns.tolist(),
        'importance': importances
    }).sort_values('importance', ascending=False)
    
    print("\n" + "="*60)
    print("TOP 20 FEATURE IMPORTANCE (Random Forest)")
    print("="*60)
    print(feature_imp_df.head(20).to_string(index=False))
    
    # Plot top 15 features
    plt.figure(figsize=(10, 8))
    top_15 = feature_imp_df.head(15)
    plt.barh(top_15['feature'], top_15['importance'])
    plt.xlabel('Feature Importance')
    plt.title('Top 15 Most Important Features')
    plt.gca().invert_yaxis()
    plt.tight_layout()
    plt.show()
else:
    print("\n‚ö†Ô∏è  Random Forest model not available for feature importance")

In [None]:
# Logistic Regression coefficients (for interpretability)
if hasattr(lr_model, 'coef_'):
    # Use actual column names from X_train (in case features were removed)
    coef_df = pd.DataFrame({
        'feature': X_train.columns.tolist(),
        'coefficient': lr_model.coef_[0]
    })
    coef_df['abs_coef'] = coef_df['coefficient'].abs()
    coef_df = coef_df.sort_values('abs_coef', ascending=False)
    
    print("\n" + "="*60)
    print("TOP 15 LOGISTIC REGRESSION COEFFICIENTS")
    print("="*60)
    print("(Positive coef = increases completion probability)")
    print("(Negative coef = decreases completion probability)\n")
    print(coef_df.head(15)[['feature', 'coefficient']].to_string(index=False))

## Hyperparameter Tuning

Tune the best model to improve performance further.

In [None]:
print("="*60)
print("HYPERPARAMETER TUNING: RANDOM FOREST")
print("="*60)

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [20, 50, 100],
    'min_samples_leaf': [1, 5, 10]
}

print(f"\nParameter grid: {len(param_grid['n_estimators']) * len(param_grid['max_depth']) * len(param_grid['min_samples_split']) * len(param_grid['min_samples_leaf'])} combinations")

# Grid search
rf_tuned = RandomForestClassifier(
    random_state=RANDOM_STATE,
    n_jobs=-1,
    class_weight='balanced'
)

grid_search = GridSearchCV(
    rf_tuned,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

print("\nStarting GridSearchCV... (this may take several minutes)")
grid_search.fit(X_train, y_train)

print(f"\n{'='*60}")
print("TUNING COMPLETE")
print(f"{'='*60}")
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV F1-Score: {grid_search.best_score_:.4f}")

# Evaluate tuned model
rf_tuned = grid_search.best_estimator_
rf_tuned_pred = rf_tuned.predict(X_test)
rf_tuned_proba = rf_tuned.predict_proba(X_test)[:, 1]

tuned_f1 = f1_score(y_test, rf_tuned_pred)
tuned_auc = roc_auc_score(y_test, rf_tuned_proba)

print(f"\nTuned Model Test Performance:")
print(f"  F1-Score: {tuned_f1:.4f} (baseline: {results_df.loc['Random Forest', 'F1']:.4f})")
print(f"  AUC-ROC:   {tuned_auc:.4f} (baseline: {results_df.loc['Random Forest', 'AUC-ROC']:.4f})")

improvement = (tuned_f1 - results_df.loc['Random Forest', 'F1']) / results_df.loc['Random Forest', 'F1'] * 100
print(f"\nImprovement: {improvement:+.2f}%")

## Save Models

Save all trained models for future use.

In [None]:
# Save all models
models_dir = '../Cafe_Rewards_Offers/models'
os.makedirs(models_dir, exist_ok=True)

models_to_save = {
    'logistic_regression.pkl': lr_model,
    'decision_tree.pkl': dt_model,
    'random_forest.pkl': rf_model,
    'random_forest_tuned.pkl': rf_tuned
}

if xgb_model is not None:
    models_to_save['xgboost.pkl'] = xgb_model

for filename, model in models_to_save.items():
    joblib.dump(model, f'{models_dir}/{filename}')
    print(f"‚úì Saved: {filename}")

# Save results
results_df.to_csv(f'{models_dir}/model_comparison.csv')
print(f"\n‚úì Saved: model_comparison.csv")

print(f"\n{'='*60}")
print("ALL MODELS SAVED")
print(f"{'='*60}")

## Modeling Summary

**Completed Steps:**
1. ‚úì Loaded processed data
2. ‚úì Checked for data leakage
3. ‚úì Trained 4 baseline/ensemble models
4. ‚úì Evaluated using multiple metrics (Accuracy, Precision, Recall, F1, AUC-ROC)
5. ‚úì Compared model performance
6. ‚úì Identified best performing model
7. ‚úì Analyzed feature importance
8. ‚úì Performed hyperparameter tuning
9. ‚úì Saved all models

**Next Steps:**
1. PCA for dimensionality reduction
2. SHAP analysis for model explainability
3. Bias & fairness analysis
4. Create presentations (technical & business)
5. Upload to GitHub

In [None]:
print("\n" + "="*60)
print("MODELING COMPLETE! ‚úì")
print("="*60)
print(f"\nBest Model: {best_model}")
print(f"Best F1-Score: {results_df.loc[best_model, 'F1']:.4f}")
print(f"\nAll models saved to: {models_dir}/")