# End-to-End Risk Model Pipeline
## Complete Implementation with All Features

This notebook demonstrates the complete risk model pipeline including:
- Data loading and preprocessing
- Feature engineering and selection
- WOE transformation
- Model training and evaluation
- PSI monitoring
- Calibration analysis
- Risk band optimization
- SHAP analysis
- Comprehensive reporting

## 1. Setup and Imports

import sys
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Install/Update risk-pipeline package from GitHub
print("Installing/Updating risk-pipeline from GitHub...")
print("Uninstalling existing versions...")
!pip uninstall risk-pipeline risk-model-pipeline -y -q 2>/dev/null
print("Installing from GitHub development branch...")
!pip install git+https://github.com/selimoksuz/risk-model-pipeline.git@development --upgrade --force-reinstall -q
print("✅ Package installed from development branch")

# Verify installation
try:
    import risk_pipeline
    print(f"✅ risk_pipeline version: {risk_pipeline.__version__ if hasattr(risk_pipeline, '__version__') else 'Unknown'}")
except ImportError as e:
    print(f"❌ Error importing risk_pipeline: {e}")
    print("Trying alternative import...")

# Import pipeline components
from risk_pipeline.core.config import Config
from risk_pipeline.core.data_processor import DataProcessor
from risk_pipeline.core.splitter import DataSplitter
from risk_pipeline.core.feature_engineer import FeatureEngineer
from risk_pipeline.core.feature_selector import FeatureSelector
from risk_pipeline.core.woe_transformer import WOETransformer
from risk_pipeline.core.model_builder import ModelBuilder
from risk_pipeline.core.psi_calculator import PSICalculator
from risk_pipeline.core.calibration_analyzer import CalibrationAnalyzer
from risk_pipeline.core.risk_band_optimizer import RiskBandOptimizer
from risk_pipeline.core.reporter import Reporter
from risk_pipeline.utils.metrics import calculate_metrics
from risk_pipeline.utils.visualization import VisualizationHelper
from risk_pipeline.utils.error_handler import ErrorHandler

# Complete pipeline imports
from risk_pipeline.complete_pipeline import CompletePipeline
from risk_pipeline.advanced_pipeline import AdvancedRiskPipeline

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print(f"✅ All modules imported successfully")
print(f"Pipeline initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

In [1]:
import sys
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Install/Update risk-pipeline package from GitHub
print("Installing/Updating risk-pipeline from GitHub...")
!pip uninstall risk-pipeline -y -q
!pip install git+https://github.com/selimoksuz/risk-model-pipeline.git@development --force-reinstall -q
print("✅ Package installed from development branch")

# Import pipeline components
from risk_pipeline.core.config import Config
from risk_pipeline.core.data_processor import DataProcessor
from risk_pipeline.core.splitter import DataSplitter
from risk_pipeline.core.feature_engineer import FeatureEngineer
from risk_pipeline.core.feature_selector import FeatureSelector
from risk_pipeline.core.woe_transformer import WOETransformer
from risk_pipeline.core.model_builder import ModelBuilder
from risk_pipeline.core.psi_calculator import PSICalculator
from risk_pipeline.core.calibration_analyzer import CalibrationAnalyzer
from risk_pipeline.core.risk_band_optimizer import RiskBandOptimizer
from risk_pipeline.core.reporter import Reporter
from risk_pipeline.utils.metrics import calculate_metrics
from risk_pipeline.utils.visualization import VisualizationHelper
from risk_pipeline.utils.error_handler import ErrorHandler

# Complete pipeline imports
from risk_pipeline.complete_pipeline import CompletePipeline
from risk_pipeline.advanced_pipeline import AdvancedPipeline

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print(f"Pipeline initialized at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Installing/Updating risk-pipeline from GitHub...


ERROR: Could not install packages due to an OSError: [WinError 5] Erişim engellendi: 'C:\\Users\\Acer\\anaconda3\\Lib\\site-packages\\numpy\\~~ibs\\libopenblas64__v0.3.21-gcc_10_3_0.dll'
Consider using the `--user` option or check the permissions.



✅ Package installed from development branch


ImportError: cannot import name 'AdvancedPipeline' from 'risk_pipeline.advanced_pipeline' (C:\Users\Acer\risk-model-pipeline\src\risk_pipeline\advanced_pipeline.py)

## 2. Configuration

In [None]:
# Create configuration
config = Config(
    target_column='target',
    test_size=0.2,
    validation_size=0.1,
    random_state=42,
    cv_folds=5,
    
    # Feature engineering
    create_polynomial=True,
    create_interactions=True,
    max_poly_degree=2,
    
    # Feature selection
    selection_method='all',  # Use all methods
    variance_threshold=0.01,
    correlation_threshold=0.95,
    top_k_features=50,
    
    # WOE parameters
    max_bins=5,
    min_samples_leaf=0.05,
    
    # Model parameters
    scoring_metric='roc_auc',
    n_jobs=-1,
    
    # Advanced features
    calculate_shap=True,
    monitor_psi=True,
    optimize_risk_bands=True,
    perform_calibration=True,
    
    # Output
    output_folder='outputs/end_to_end_pipeline',
    save_plots=True,
    verbose=True
)

print("Configuration set:")
print(f"  - Target column: {config.target_column}")
print(f"  - Test size: {config.test_size}")
print(f"  - CV folds: {config.cv_folds}")
print(f"  - Output folder: {config.output_folder}")

## 3. Data Loading and Initial Exploration

In [None]:
# Load data
data_path = '../data/processed/model_data.csv'
df = pd.read_csv(data_path)

print(f"Data loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")
print(f"\nTarget distribution:")
print(df['target'].value_counts())
print(f"\nTarget rate: {df['target'].mean():.2%}")

# Basic info
print("\nData types:")
print(df.dtypes.value_counts())

# Missing values
missing = df.isnull().sum()
if missing.sum() > 0:
    print("\nMissing values:")
    print(missing[missing > 0].sort_values(ascending=False))

## 4. Data Preprocessing

In [None]:
# Initialize processor
processor = DataProcessor(config)

# Validate and freeze data
df_processed = processor.validate_and_freeze(df)

# Identify variable types
numeric_vars = processor.get_numeric_columns(df_processed)
categorical_vars = processor.get_categorical_columns(df_processed)

print(f"Numeric variables: {len(numeric_vars)}")
print(f"Categorical variables: {len(categorical_vars)}")

# Handle missing values
from sklearn.impute import SimpleImputer

if numeric_vars:
    imputer = SimpleImputer(strategy='median')
    df_processed[numeric_vars] = imputer.fit_transform(df_processed[numeric_vars])

if categorical_vars:
    df_processed[categorical_vars] = df_processed[categorical_vars].fillna('missing')

print("\nMissing values handled successfully")

## 5. Train/Test/OOT Split

In [None]:
# Split data
splitter = DataSplitter(config)

# Check if we have a time column for OOT split
if 'date' in df_processed.columns or 'created_at' in df_processed.columns:
    time_col = 'date' if 'date' in df_processed.columns else 'created_at'
    train, test, oot = splitter.split_with_oot(df_processed, time_column=time_col)
else:
    # Random split with validation as OOT
    train, test, oot = splitter.split_train_test_validation(df_processed)

print(f"Train set: {train.shape[0]:,} rows ({train.shape[0]/df_processed.shape[0]:.1%})")
print(f"Test set: {test.shape[0]:,} rows ({test.shape[0]/df_processed.shape[0]:.1%})")
print(f"OOT set: {oot.shape[0]:,} rows ({oot.shape[0]/df_processed.shape[0]:.1%})")

print("\nTarget rates:")
print(f"  Train: {train['target'].mean():.2%}")
print(f"  Test: {test['target'].mean():.2%}")
print(f"  OOT: {oot['target'].mean():.2%}")

## 6. Feature Engineering

In [None]:
# Initialize feature engineer
engineer = FeatureEngineer(config)

# Create features
print("Creating engineered features...")
train_eng = engineer.create_features(train)
test_eng = engineer.transform(test)
oot_eng = engineer.transform(oot)

print(f"\nFeatures after engineering:")
print(f"  Original: {train.shape[1]} features")
print(f"  After engineering: {train_eng.shape[1]} features")
print(f"  New features created: {train_eng.shape[1] - train.shape[1]}")

# Show some engineered features
new_features = [col for col in train_eng.columns if col not in train.columns]
if new_features:
    print(f"\nSample of new features: {new_features[:5]}")

## 7. Feature Selection

In [None]:
# Initialize selector
selector = FeatureSelector(config)

# Select features
selected_features = selector.select_features(
    train_eng.drop(columns=['target']),
    train_eng['target']
)

print(f"Selected {len(selected_features)} features from {train_eng.shape[1]-1} candidates")

# Get feature importance
if hasattr(selector, 'feature_importance_'):
    importance_df = pd.DataFrame({
        'feature': selected_features,
        'importance': selector.feature_importance_[:len(selected_features)]
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 important features:")
    print(importance_df.head(10))

# Apply selection
train_selected = train_eng[selected_features + ['target']]
test_selected = test_eng[selected_features + ['target']]
oot_selected = oot_eng[selected_features + ['target']]

## 8. WOE Transformation

In [None]:
# Initialize WOE transformer
woe_transformer = WOETransformer(config)

# Fit and transform
train_woe = woe_transformer.fit_transform(
    train_selected.drop(columns=['target']),
    train_selected['target']
)
test_woe = woe_transformer.transform(test_selected.drop(columns=['target']))
oot_woe = woe_transformer.transform(oot_selected.drop(columns=['target']))

# Add target back
train_woe['target'] = train_selected['target'].values
test_woe['target'] = test_selected['target'].values
oot_woe['target'] = oot_selected['target'].values

print(f"WOE transformation completed")
print(f"  Train shape: {train_woe.shape}")
print(f"  Test shape: {test_woe.shape}")
print(f"  OOT shape: {oot_woe.shape}")

# Show WOE mapping for a sample variable
if woe_transformer.woe_mapping_:
    sample_var = list(woe_transformer.woe_mapping_.keys())[0]
    print(f"\nWOE mapping for '{sample_var}':")
    print(woe_transformer.woe_mapping_[sample_var])

## 9. Model Training

In [None]:
# Initialize model builder
model_builder = ModelBuilder(config)

# Train models
X_train = train_woe.drop(columns=['target'])
y_train = train_woe['target']
X_test = test_woe.drop(columns=['target'])
y_test = test_woe['target']

best_model, best_score, all_models = model_builder.build_models(X_train, y_train, X_test, y_test)

print(f"\nBest model: {model_builder.best_model_name_}")
print(f"Best CV score: {best_score:.4f}")

# Show all model scores
print("\nAll model scores:")
for name, score in model_builder.cv_scores_.items():
    print(f"  {name}: {score:.4f}")

## 10. Model Evaluation

In [None]:
# Predictions
train_pred = best_model.predict_proba(X_train)[:, 1]
test_pred = best_model.predict_proba(X_test)[:, 1]
oot_pred = best_model.predict_proba(oot_woe.drop(columns=['target']))[:, 1]

# Calculate metrics
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

metrics = {
    'Train': {
        'AUC': roc_auc_score(y_train, train_pred),
        'AP': average_precision_score(y_train, train_pred),
        'Brier': brier_score_loss(y_train, train_pred)
    },
    'Test': {
        'AUC': roc_auc_score(y_test, test_pred),
        'AP': average_precision_score(y_test, test_pred),
        'Brier': brier_score_loss(y_test, test_pred)
    },
    'OOT': {
        'AUC': roc_auc_score(oot_woe['target'], oot_pred),
        'AP': average_precision_score(oot_woe['target'], oot_pred),
        'Brier': brier_score_loss(oot_woe['target'], oot_pred)
    }
}

# Display metrics
metrics_df = pd.DataFrame(metrics).T
print("Model Performance:")
print(metrics_df.round(4))

# Check for overfitting
overfit_score = metrics_df.loc['Train', 'AUC'] - metrics_df.loc['Test', 'AUC']
print(f"\nOverfitting check (Train-Test AUC): {overfit_score:.4f}")
if overfit_score > 0.05:
    print("  ⚠️ Warning: Potential overfitting detected")
else:
    print("  ✅ No significant overfitting")

## 11. PSI Monitoring

In [None]:
# Calculate PSI
psi_calculator = PSICalculator()

# Feature PSI
feature_psi = {}
for col in X_train.columns:
    psi_test = psi_calculator.calculate(X_train[col], X_test[col])
    psi_oot = psi_calculator.calculate(X_train[col], oot_woe.drop(columns=['target'])[col])
    feature_psi[col] = {'test': psi_test, 'oot': psi_oot}

# Score PSI
score_psi_test = psi_calculator.calculate(train_pred, test_pred)
score_psi_oot = psi_calculator.calculate(train_pred, oot_pred)

print("PSI Analysis:")
print(f"\nScore PSI:")
print(f"  Test: {score_psi_test:.4f}")
print(f"  OOT: {score_psi_oot:.4f}")

# Check stability
if score_psi_oot < 0.1:
    print("  ✅ Model is stable")
elif score_psi_oot < 0.25:
    print("  ⚠️ Minor shift detected")
else:
    print("  ❌ Significant shift detected")

# Top shifting features
psi_df = pd.DataFrame(feature_psi).T
unstable_features = psi_df[psi_df['oot'] > 0.25]
if not unstable_features.empty:
    print(f"\n⚠️ Unstable features (PSI > 0.25):")
    print(unstable_features.sort_values('oot', ascending=False).head())

## 12. Calibration Analysis

In [None]:
# Calibration analysis
calibration_analyzer = CalibrationAnalyzer()

# Analyze calibration
calibration_results = {
    'test': calibration_analyzer.analyze_calibration(y_test, test_pred),
    'oot': calibration_analyzer.analyze_calibration(oot_woe['target'], oot_pred)
}

print("Calibration Analysis:")
for dataset, results in calibration_results.items():
    print(f"\n{dataset.upper()}:")
    print(f"  ECE: {results['ece']:.4f}")
    print(f"  MCE: {results['mce']:.4f}")
    print(f"  Brier Score: {results['brier_score']:.4f}")

# Calibration plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for idx, (dataset, results) in enumerate(calibration_results.items()):
    ax = axes[idx]
    bins = results['bins']
    
    ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    ax.scatter(bins['mean_predicted'], bins['mean_actual'], s=100, alpha=0.7)
    ax.plot(bins['mean_predicted'], bins['mean_actual'], 'b-', alpha=0.5)
    
    ax.set_xlabel('Mean Predicted Probability')
    ax.set_ylabel('Fraction of Positives')
    ax.set_title(f'Calibration Plot - {dataset.upper()}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Apply calibration if needed
if calibration_results['oot']['ece'] > 0.05:
    print("\n📊 Applying isotonic calibration...")
    test_pred_calibrated = calibration_analyzer.calibrate_predictions(
        y_test, test_pred, oot_pred, method='isotonic'
    )
    print(f"Calibrated OOT AUC: {roc_auc_score(oot_woe['target'], test_pred_calibrated):.4f}")

## 13. Risk Band Optimization

In [None]:
# Optimize risk bands
risk_band_optimizer = RiskBandOptimizer()

# Find optimal bands
risk_bands = risk_band_optimizer.optimize_bands(
    y_true=oot_woe['target'],
    y_scores=oot_pred,
    n_bands=5,
    method='quantile'
)

print("Optimized Risk Bands:")
print(risk_bands[['band', 'min_score', 'max_score', 'bad_rate', 'volume_pct']].round(4))

# Visualize risk bands
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bad rate by band
axes[0].bar(risk_bands['band'], risk_bands['bad_rate'], color='coral')
axes[0].set_xlabel('Risk Band')
axes[0].set_ylabel('Bad Rate')
axes[0].set_title('Bad Rate by Risk Band')
axes[0].grid(True, alpha=0.3)

# Volume distribution
axes[1].bar(risk_bands['band'], risk_bands['volume_pct'], color='skyblue')
axes[1].set_xlabel('Risk Band')
axes[1].set_ylabel('Volume %')
axes[1].set_title('Volume Distribution by Risk Band')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate Gini from bands
cumulative_bads = risk_bands['bad_rate'].cumsum() / risk_bands['bad_rate'].sum()
cumulative_volume = risk_bands['volume_pct'].cumsum()
gini_from_bands = 2 * np.trapz(cumulative_bads, cumulative_volume) - 1
print(f"\nGini coefficient from bands: {gini_from_bands:.4f}")

## 14. SHAP Analysis

In [None]:
# SHAP analysis for feature importance
try:
    import shap
    
    # Create explainer
    explainer = shap.Explainer(best_model, X_train)
    
    # Calculate SHAP values for test set
    shap_values = explainer(X_test[:1000])  # Use subset for speed
    
    # Summary plot
    plt.figure(figsize=(10, 6))
    shap.summary_plot(shap_values, X_test[:1000], show=False)
    plt.title('SHAP Feature Importance')
    plt.tight_layout()
    plt.show()
    
    # Get feature importance
    shap_importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': np.abs(shap_values.values).mean(axis=0)
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 Features by SHAP:")
    print(shap_importance.head(10))
    
except Exception as e:
    print(f"SHAP analysis not available: {e}")
    print("Using permutation importance instead...")
    
    from sklearn.inspection import permutation_importance
    
    perm_importance = permutation_importance(
        best_model, X_test, y_test, n_repeats=10, random_state=42
    )
    
    importance_df = pd.DataFrame({
        'feature': X_train.columns,
        'importance': perm_importance.importances_mean
    }).sort_values('importance', ascending=False)
    
    print("\nTop 10 Features by Permutation Importance:")
    print(importance_df.head(10))

## 15. Complete Pipeline Run (Alternative)

In [None]:
# Alternative: Run everything with CompletePipeline class
print("Running Complete Pipeline Class...\n")

# Initialize pipeline
complete_pipeline = CompletePipeline(config)

# Run pipeline
results = complete_pipeline.run(
    df=df,
    test_df=None,  # Will be split automatically
    oot_df=None    # Will be split automatically
)

print("\nPipeline Results:")
print(f"  Best Model: {results['best_model_name']}")
print(f"  Best Score: {results['best_score']:.4f}")
print(f"  Selected Features: {len(results['selected_features'])}")
print(f"  Reports saved to: {config.output_folder}")

# Advanced pipeline with custom models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Custom models
custom_models = {
    'rf_custom': RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        min_samples_split=50,
        random_state=42
    ),
    'xgb_custom': XGBClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.01,
        random_state=42
    ),
    'lgbm_custom': LGBMClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.01,
        random_state=42,
        verbose=-1
    )
}

# Initialize advanced pipeline
advanced_pipeline = AdvancedRiskPipeline(config)

# Set custom models
advanced_pipeline.model_builder.models.update(custom_models)

# Run advanced pipeline
advanced_results = advanced_pipeline.run(
    df=df,
    external_validation_df=None
)

print("\nAdvanced Pipeline Results:")
print(f"  Best Model: {advanced_results['best_model_name']}")
print(f"  Best Score: {advanced_results['best_score']:.4f}")
print(f"  Model Comparison available: {advanced_results['model_comparison'] is not None}")
print(f"  Monitoring metrics calculated: {bool(advanced_results['monitoring_metrics'])}")

In [None]:
# Advanced pipeline with custom models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Custom models
custom_models = {
    'rf_custom': RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        min_samples_split=50,
        random_state=42
    ),
    'xgb_custom': XGBClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.01,
        random_state=42
    ),
    'lgbm_custom': LGBMClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.01,
        random_state=42,
        verbose=-1
    )
}

# Initialize advanced pipeline
advanced_pipeline = AdvancedPipeline(config)

# Set custom models
advanced_pipeline.model_builder.models.update(custom_models)

# Run advanced pipeline
advanced_results = advanced_pipeline.run(
    df=df,
    external_validation_df=None
)

print("\nAdvanced Pipeline Results:")
print(f"  Best Model: {advanced_results['best_model_name']}")
print(f"  Best Score: {advanced_results['best_score']:.4f}")
print(f"  Model Comparison available: {advanced_results['model_comparison'] is not None}")
print(f"  Monitoring metrics calculated: {bool(advanced_results['monitoring_metrics'])}")

## 17. Generate Comprehensive Reports

In [None]:
# Generate all reports
reporter = Reporter(config)

# Create comprehensive report
report_data = {
    'model_performance': metrics_df,
    'feature_importance': shap_importance if 'shap_importance' in locals() else importance_df,
    'risk_bands': risk_bands,
    'psi_analysis': pd.DataFrame(feature_psi).T,
    'calibration_metrics': pd.DataFrame(calibration_results).T,
    'model_comparison': pd.DataFrame(model_builder.cv_scores_, index=['CV Score']).T
}

# Save Excel report
report_path = reporter.save_excel_report(
    report_data,
    filename='end_to_end_pipeline_report.xlsx'
)

print(f"\n📊 Comprehensive report saved to: {report_path}")

# Generate model documentation
model_docs = {
    'Model Type': model_builder.best_model_name_,
    'Training Date': datetime.now().strftime('%Y-%m-%d'),
    'Training Samples': len(X_train),
    'Features Used': len(selected_features),
    'Cross-Validation Score': f"{best_score:.4f}",
    'Test AUC': f"{metrics_df.loc['Test', 'AUC']:.4f}",
    'OOT AUC': f"{metrics_df.loc['OOT', 'AUC']:.4f}",
    'PSI (OOT)': f"{score_psi_oot:.4f}",
    'ECE (OOT)': f"{calibration_results['oot']['ece']:.4f}",
    'Number of Risk Bands': len(risk_bands)
}

print("\n📋 Model Documentation:")
for key, value in model_docs.items():
    print(f"  {key}: {value}")

## 18. Save Models and Artifacts

In [None]:
import joblib
import json

# Create output directory
output_dir = 'outputs/end_to_end_pipeline'
os.makedirs(output_dir, exist_ok=True)

# Save best model
model_path = os.path.join(output_dir, 'best_model.pkl')
joblib.dump(best_model, model_path)
print(f"✅ Model saved to: {model_path}")

# Save transformers
transformers = {
    'feature_engineer': engineer,
    'feature_selector': selector,
    'woe_transformer': woe_transformer,
    'imputer': imputer if 'imputer' in locals() else None
}

for name, transformer in transformers.items():
    if transformer is not None:
        transformer_path = os.path.join(output_dir, f'{name}.pkl')
        joblib.dump(transformer, transformer_path)
        print(f"✅ {name} saved to: {transformer_path}")

# Save configuration
config_dict = {
    'target_column': config.target_column,
    'selected_features': selected_features,
    'model_name': model_builder.best_model_name_,
    'model_score': float(best_score),
    'risk_bands': risk_bands.to_dict('records'),
    'psi_thresholds': {'warning': 0.1, 'critical': 0.25},
    'training_date': datetime.now().isoformat()
}

config_path = os.path.join(output_dir, 'pipeline_config.json')
with open(config_path, 'w') as f:
    json.dump(config_dict, f, indent=2)
print(f"✅ Configuration saved to: {config_path}")

print("\n🎉 End-to-End Pipeline Complete!")
print(f"All artifacts saved to: {output_dir}")

## 19. Model Deployment Readiness Check

In [None]:
# Deployment readiness checklist
print("🚀 DEPLOYMENT READINESS CHECK")
print("="*50)

checklist = {
    'Model Performance': {
        'Test AUC > 0.7': metrics_df.loc['Test', 'AUC'] > 0.7,
        'OOT AUC > 0.7': metrics_df.loc['OOT', 'AUC'] > 0.7,
        'Overfitting < 5%': overfit_score < 0.05
    },
    'Stability': {
        'Score PSI < 0.25': score_psi_oot < 0.25,
        'No unstable features': unstable_features.empty if 'unstable_features' in locals() else True
    },
    'Calibration': {
        'ECE < 0.1': calibration_results['oot']['ece'] < 0.1,
        'Brier Score < 0.25': calibration_results['oot']['brier_score'] < 0.25
    },
    'Documentation': {
        'Model saved': os.path.exists(model_path),
        'Config saved': os.path.exists(config_path),
        'Report generated': os.path.exists(report_path) if 'report_path' in locals() else False
    }
}

all_passed = True
for category, checks in checklist.items():
    print(f"\n{category}:")
    for check, passed in checks.items():
        status = "✅" if passed else "❌"
        print(f"  {status} {check}")
        if not passed:
            all_passed = False

print("\n" + "="*50)
if all_passed:
    print("✅ MODEL IS READY FOR DEPLOYMENT")
else:
    print("⚠️ Some checks failed. Review before deployment.")

## 20. Quick Scoring Function

In [None]:
def score_new_data(new_df, model_artifacts_dir='outputs/end_to_end_pipeline'):
    """
    Score new data using saved pipeline artifacts
    """
    # Load artifacts
    model = joblib.load(os.path.join(model_artifacts_dir, 'best_model.pkl'))
    feature_engineer = joblib.load(os.path.join(model_artifacts_dir, 'feature_engineer.pkl'))
    feature_selector = joblib.load(os.path.join(model_artifacts_dir, 'feature_selector.pkl'))
    woe_transformer = joblib.load(os.path.join(model_artifacts_dir, 'woe_transformer.pkl'))
    
    # Load config
    with open(os.path.join(model_artifacts_dir, 'pipeline_config.json'), 'r') as f:
        config = json.load(f)
    
    # Process new data
    df_eng = feature_engineer.transform(new_df)
    df_selected = df_eng[config['selected_features']]
    df_woe = woe_transformer.transform(df_selected)
    
    # Score
    scores = model.predict_proba(df_woe)[:, 1]
    
    # Add risk bands
    risk_bands = pd.DataFrame(config['risk_bands'])
    
    def assign_band(score):
        for _, band in risk_bands.iterrows():
            if band['min_score'] <= score <= band['max_score']:
                return band['band']
        return 'Unknown'
    
    # Create results
    results = pd.DataFrame({
        'score': scores,
        'risk_band': [assign_band(s) for s in scores],
        'risk_level': pd.cut(scores, bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
                            labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
    })
    
    return results

# Test scoring function
test_scores = score_new_data(test.head(100))
print("Sample Scoring Results:")
print(test_scores.head(10))
print("\nRisk Distribution:")
print(test_scores['risk_level'].value_counts())

## Summary

This notebook demonstrated the complete end-to-end risk model pipeline including:

✅ **Data Processing**: Loading, validation, and preprocessing
✅ **Feature Engineering**: Creating polynomial and interaction features
✅ **Feature Selection**: Multiple selection methods
✅ **WOE Transformation**: Weight of Evidence encoding
✅ **Model Training**: Multiple algorithms with cross-validation
✅ **Evaluation**: Comprehensive metrics on train/test/OOT
✅ **PSI Monitoring**: Population stability tracking
✅ **Calibration**: Analysis and correction
✅ **Risk Bands**: Optimized segmentation
✅ **SHAP Analysis**: Feature importance and interpretability
✅ **Reporting**: Comprehensive Excel reports
✅ **Model Persistence**: Saving all artifacts for deployment

The pipeline is now ready for production deployment!