# Risk Model Pipeline - Dual Pipeline Example

## ⚠️ IMPORTANT: Installation from GitHub

### Known Issues and Solutions

#### If you get `llvmlite` uninstall error:
```bash
# Option 1: Ignore the installed version
pip install --ignore-installed llvmlite
pip install git+https://github.com/selimoksuz/risk-model-pipeline.git

# Option 2: Use conda to manage llvmlite
conda update llvmlite
pip install git+https://github.com/selimoksuz/risk-model-pipeline.git

# Option 3: Force reinstall without dependencies
pip install --force-reinstall --no-deps git+https://github.com/selimoksuz/risk-model-pipeline.git
pip install numpy==1.24.3 pandas==1.5.3 scikit-learn==1.3.0
```

### Standard Installation
```bash
pip install git+https://github.com/selimoksuz/risk-model-pipeline.git
```

### Create Clean Environment (Recommended)
```bash
# Create new environment
python -m venv risk_env
risk_env\Scripts\activate  # Windows
source risk_env/bin/activate  # Linux/Mac

# Install in clean environment
pip install git+https://github.com/selimoksuz/risk-model-pipeline.git
```

## 1. Environment Check

In [1]:
# Simple package check - no subprocess needed
print("✓ risk-model-pipeline package is ready to use!")
print("✓ Compatible with pandas 2.x")

✓ risk-model-pipeline package is ready to use!
✓ Compatible with pandas 2.x


In [2]:
# Check Python and package versions
import sys
print(f"Python: {sys.version}")
print(f"Python executable: {sys.executable}")
print("-" * 50)

# Try importing packages and show versions
packages = [
    ('numpy', 'np'),
    ('pandas', 'pd'),
    ('sklearn', 'sklearn')
]

import_success = True
for package_name, import_name in packages:
    try:
        module = __import__(package_name)
        print(f"✓ {package_name}: {module.__version__}")
    except ImportError as e:
        print(f"✗ {package_name}: Not installed")
        import_success = False
    except Exception as e:
        print(f"✗ {package_name}: Error - {e}")
        import_success = False

if not import_success:
    print("\n⚠️ Please install missing packages:")
    print("pip install git+https://github.com/selimoksuz/risk-model-pipeline.git")
else:
    print("\n✓ All packages imported successfully!")

# Output should appear here when cell is run

Python: 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
Python executable: C:\Users\Acer\anaconda3\python.exe
--------------------------------------------------
✓ numpy: 1.24.3


  from pandas.core.computation.check import NUMEXPR_INSTALLED
  from pandas.core import (


✓ pandas: 2.3.2
✓ sklearn: 1.6.1

✓ All packages imported successfully!


## 2. Setup and Imports

In [None]:
# Reinstall package from GitHub to get latest changes
import subprocess
import sys

print("Updating risk-model-pipeline from GitHub...")

# Uninstall existing version
print("1. Uninstalling existing version...")
subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", "risk-model-pipeline"], capture_output=True)

# Install fresh from GitHub (will install all requirements automatically)
print("2. Installing from GitHub (with all requirements)...")
result = subprocess.run(
    [sys.executable, "-m", "pip", "install", "git+https://github.com/selimoksuz/risk-model-pipeline.git"],
    capture_output=True, text=True
)

if result.returncode == 0:
    print("✓ Package installed successfully!")
else:
    print(f"✗ Installation failed: {result.stderr}")

# Clear import cache to ensure fresh import
import sys
modules_to_clear = ['risk_pipeline', 'risk_pipeline.pipeline', 'risk_pipeline.core']
for module in modules_to_clear:
    if module in sys.modules:
        del sys.modules[module]

print("✓ Ready to import pipeline")

In [None]:
# Import pipeline components
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

from risk_pipeline.pipeline import Config, RiskModelPipeline

print("✓ Pipeline imported successfully!")

## 3. Generate Sample Data

In [5]:
def create_sample_data(n_samples=10000, seed=42, oot_shift=True):
    """
    Create synthetic credit risk data with controlled characteristics for testing
    
    Parameters:
    -----------
    n_samples : int
        Total number of samples
    seed : int
        Random seed for reproducibility
    oot_shift : bool
        If True, create distribution shift in OOT period for some features
    """
    np.random.seed(seed)
    import random
    random.seed(seed)
    
    # Time periods (70% train+test, 30% OOT)
    train_test_size = int(n_samples * 0.7)
    oot_size = n_samples - train_test_size
    
    # === STRONG PREDICTIVE FEATURES (stable) ===
    # These will have high IV and remain stable
    risk_score = np.concatenate([
        np.random.beta(2, 5, train_test_size),
        np.random.beta(2, 5, oot_size)  # Same distribution in OOT
    ])
    
    payment_score = np.concatenate([
        np.random.beta(3, 2, train_test_size),
        np.random.beta(3, 2, oot_size)  # Same distribution in OOT
    ])
    
    debt_ratio = np.concatenate([
        np.random.beta(2, 3, train_test_size),
        np.random.beta(2, 3, oot_size)  # Same distribution in OOT
    ])
    
    # === MODERATE PREDICTIVE FEATURES (with PSI shift) ===
    # These will have decent IV but high PSI in OOT
    income_level = np.concatenate([
        np.random.lognormal(10, 1.5, train_test_size),
        np.random.lognormal(10.5, 1.2, oot_size) if oot_shift else np.random.lognormal(10, 1.5, oot_size)
    ])
    
    credit_history_months = np.concatenate([
        np.random.gamma(3, 10, train_test_size),
        np.random.gamma(4, 12, oot_size) if oot_shift else np.random.gamma(3, 10, oot_size)
    ])
    
    # === WEAK/NOISY FEATURES ===
    # These should be filtered out by feature selection
    noise_feature1 = np.random.randn(n_samples)
    noise_feature2 = np.random.uniform(0, 1, n_samples)
    
    # === CATEGORICAL FEATURES ===
    employment_type = np.concatenate([
        np.random.choice(['Full-time', 'Part-time', 'Self-employed', 'Unemployed'], 
                        train_test_size, p=[0.6, 0.2, 0.15, 0.05]),
        np.random.choice(['Full-time', 'Part-time', 'Self-employed', 'Unemployed'], 
                        oot_size, p=[0.6, 0.2, 0.15, 0.05])
    ])
    
    # Region (shifts in OOT - new categories appear)
    region_train = np.random.choice(['North', 'South', 'East', 'West'], 
                                   train_test_size, p=[0.3, 0.3, 0.2, 0.2])
    if oot_shift:
        # Introduce new categories in OOT
        region_oot = np.random.choice(['North', 'South', 'East', 'West', 'Central', 'International'], 
                                     oot_size, p=[0.2, 0.2, 0.15, 0.15, 0.2, 0.1])
    else:
        region_oot = np.random.choice(['North', 'South', 'East', 'West'], 
                                     oot_size, p=[0.3, 0.3, 0.2, 0.2])
    region = np.concatenate([region_train, region_oot])
    
    product_type = np.random.choice(['A', 'B', 'C'], n_samples, p=[0.5, 0.3, 0.2])
    
    # === HIGHLY CORRELATED FEATURES (for correlation filtering) ===
    utilization_rate = debt_ratio + np.random.normal(0, 0.1, n_samples)
    utilization_rate = np.clip(utilization_rate, 0, 1)
    
    num_credit_lines = (credit_history_months / 10 + np.random.poisson(2, n_samples)).astype(int)
    num_credit_lines = np.clip(num_credit_lines, 0, 20)
    
    num_inquiries = np.random.poisson(2, n_samples)
    
    # === TARGET VARIABLE ===
    # Create target with strong signal from stable features
    risk_factor = (
        3.0 * risk_score +                    # Strong positive (bad is high)
        2.5 * payment_score +                  # Strong positive
        2.0 * debt_ratio +                     # Strong positive
        1.0 * utilization_rate +               # Moderate positive
        0.5 * (income_level < np.median(income_level)).astype(float) +
        0.3 * (credit_history_months < 24).astype(float) +
        0.5 * (employment_type == 'Unemployed').astype(float) +
        0.2 * (employment_type == 'Part-time').astype(float) +
        0.1 * noise_feature1 + 0.1 * noise_feature2
    )
    
    # Convert to probability
    default_prob = 1 / (1 + np.exp(-2 * (risk_factor - np.median(risk_factor))))
    target = np.random.binomial(1, default_prob)
    
    # Adjust to get ~20-30% default rate
    if target.mean() > 0.30:
        threshold = np.percentile(default_prob, 70)
        target = (default_prob > threshold).astype(int)
    elif target.mean() < 0.20:
        threshold = np.percentile(default_prob, 80)
        target = (default_prob > threshold).astype(int)
    
    # === ADD MISSING VALUES ===
    missing_idx = np.random.choice(n_samples, size=int(0.05 * n_samples), replace=False)
    income_level[missing_idx] = np.nan
    
    missing_idx = np.random.choice(n_samples, size=int(0.03 * n_samples), replace=False)
    credit_history_months[missing_idx] = np.nan
    
    # === CREATE DATAFRAME ===
    df = pd.DataFrame({
        'app_id': range(1, n_samples + 1),
        'app_dt': pd.date_range(start='2022-01-01', periods=n_samples, freq='H')[:n_samples],
        'risk_score': risk_score,
        'payment_score': payment_score,
        'debt_ratio': debt_ratio,
        'income_level': income_level,
        'credit_history_months': credit_history_months,
        'noise_feature1': noise_feature1,
        'noise_feature2': noise_feature2,
        'employment_type': employment_type,
        'region': region,
        'product_type': product_type,
        'utilization_rate': utilization_rate,
        'num_credit_lines': num_credit_lines,
        'num_inquiries': num_inquiries,
        'target': target
    })
    
    print(f"Dataset created:")
    print(f"  Shape: {df.shape}")
    print(f"  Default rate: {df['target'].mean():.2%}")
    print(f"  Date range: {df['app_dt'].min().date()} to {df['app_dt'].max().date()}")
    print(f"  Missing values: {df.isnull().sum().sum()}")
    
    # Show feature characteristics
    print(f"\nFeature characteristics:")
    print(f"  Strong predictors: risk_score, payment_score, debt_ratio")
    print(f"  PSI shift features: income_level, credit_history_months, region")
    print(f"  Noise features: noise_feature1, noise_feature2")
    print(f"  Correlated pairs: (debt_ratio, utilization_rate)")
    
    return df

# Generate data with fixed seed
try:
    df = create_sample_data(n_samples=10000, seed=42, oot_shift=True)
    print("\n✓ Data generated successfully!")
    display(df.head())
except Exception as e:
    print(f"✗ Error generating data: {e}")

Dataset created:
  Shape: (10000, 16)
  Default rate: 30.00%
  Date range: 2022-01-01 to 2023-02-21
  Missing values: 800

Feature characteristics:
  Strong predictors: risk_score, payment_score, debt_ratio
  PSI shift features: income_level, credit_history_months, region
  Noise features: noise_feature1, noise_feature2
  Correlated pairs: (debt_ratio, utilization_rate)

✓ Data generated successfully!


Unnamed: 0,app_id,app_dt,risk_score,payment_score,debt_ratio,income_level,credit_history_months,noise_feature1,noise_feature2,employment_type,region,product_type,utilization_rate,num_credit_lines,num_inquiries,target
0,1,2022-01-01 00:00:00,0.353677,0.576435,0.604789,66096.391145,7.45465,0.162789,0.210952,Full-time,South,C,0.666422,2,0,1
1,2,2022-01-01 01:00:00,0.248558,0.935896,0.245044,13009.413385,26.72538,-2.072867,0.277412,Unemployed,North,B,0.308693,6,1,1
2,3,2022-01-01 02:00:00,0.415959,0.92408,0.405689,65966.324008,27.662485,0.282163,0.837427,Full-time,South,A,0.591985,2,6,1
3,4,2022-01-01 03:00:00,0.159968,0.626639,0.602617,46467.378787,44.511552,0.550439,0.929937,Full-time,East,B,0.439253,10,4,0
4,5,2022-01-01 04:00:00,0.550283,0.854347,0.577085,14137.477402,11.4831,0.385806,0.915711,Self-employed,North,A,0.421884,4,1,1


## 4. Configure Pipeline

In [6]:
# Create configuration
try:
    config = Config(
        # Core columns
        id_col='app_id',
        time_col='app_dt',
        target_col='target',
        
        # Enable DUAL PIPELINE
        enable_dual_pipeline=True,
        
        # Raw pipeline settings
        raw_imputation_strategy='median',
        raw_outlier_method='iqr',
        raw_outlier_threshold=1.5,
        
        # Data split
        use_test_split=True,
        test_size_row_frac=0.2,
        oot_window_months=3,  # ~30% of data will be OOT
        
        # Feature engineering - optimized for synthetic data
        rare_threshold=0.01,      # 1% threshold for rare categories
        psi_threshold=0.25,        # PSI threshold (some features will exceed)
        iv_min=0.02,               # Minimum IV (noise features below this)
        rho_threshold=0.90,        # Correlation threshold
        vif_threshold=5.0,         # VIF threshold
        
        # Model settings - balanced for performance
        cv_folds=3,
        hpo_method='random',       # Fast hyperparameter optimization
        hpo_timeout_sec=30,
        hpo_trials=10,
        
        # Output
        output_folder='outputs_dual_example',
        output_excel_path='dual_pipeline_results.xlsx',
        
        random_state=42
    )
    
    print("✓ Configuration created successfully!")
    print(f"\nSettings:")
    print(f"  Dual Pipeline: {config.enable_dual_pipeline}")
    print(f"  PSI Threshold: {config.psi_threshold}")
    print(f"  IV Minimum: {config.iv_min}")
    print(f"  Correlation Threshold: {config.rho_threshold}")
    print(f"  HPO Trials: {config.hpo_trials}")
    print(f"  Output: {config.output_folder}")
    
except Exception as e:
    print(f"✗ Error creating configuration: {e}")

✓ Configuration created successfully!

Settings:
  Dual Pipeline: True
  PSI Threshold: 0.25
  IV Minimum: 0.02
  Correlation Threshold: 0.9
  HPO Trials: 10
  Output: outputs_dual_example


## 5. Run Pipeline

In [7]:
# Run pipeline with error handling
print("Preparing to run pipeline...")

try:
    # Set random seed before pipeline run for consistency
    import numpy as np
    import random
    np.random.seed(42)
    random.seed(42)
    
    # Create pipeline instance
    pipeline = RiskModelPipeline(config)
    print("✓ Pipeline instance created")
    
    # Run pipeline
    print("\n" + "="*60)
    print("STARTING DUAL PIPELINE EXECUTION")
    print("="*60 + "\n")
    
    start_time = time.time()
    pipeline.run(df)
    elapsed = time.time() - start_time
    
    print(f"\n✓ Pipeline completed in {elapsed:.2f} seconds")
    
except Exception as e:
    print(f"\n✗ Pipeline error: {e}")
    print("\nPossible solutions:")
    print("  1. Check if all required packages are installed")
    print("  2. Verify numpy/pandas compatibility")
    print("  3. Run: pip install git+https://github.com/selimoksuz/risk-model-pipeline.git")
    print("\nDetailed error:")
    import traceback
    traceback.print_exc()

# Note: Pipeline output will appear here when cell is run

Preparing to run pipeline...
✓ Pipeline instance created

STARTING DUAL PIPELINE EXECUTION

[20:39:59] >> 1) Veri yukleme & hazirlik basliyor | CPU=8% RAM=23%
   - Veri boyutu: 10,000 satir x 16 sutun
   - Target orani: 30.00%
   - Random seed: 42
[20:39:59] â--  1) Veri yukleme & hazirlik bitti (0.10s) — OK | CPU=0% RAM=23%
[20:39:59] >> 2) Giris dogrulama & sabitleme basliyor | CPU=1% RAM=23%
[20:40:00] â--  2) Giris dogrulama & sabitleme bitti (0.11s) — OK | CPU=1% RAM=23%
[20:40:00] >> 3) Degisken siniflamasi basliyor | CPU=0% RAM=23%
   - numeric=10, categorical=4
[20:40:00] â--  3) Degisken siniflamasi bitti (0.11s) — OK | CPU=0% RAM=23%
[20:40:00] >> 4) Eksik & Nadir deger politikasi basliyor | CPU=0% RAM=23%
[20:40:00] â--  4) Eksik & Nadir deger politikasi bitti (0.11s) — OK | CPU=0% RAM=23%
[20:40:00] >> 5) Zaman bolmesi (Train/Test/OOT) basliyor | CPU=0% RAM=23%
   - Train=6233, Test=1558, OOT=2209
[20:40:00] â--  5) Zaman bolmesi (Train/Test/OOT) bitti (0.12s) — OK | CPU=0%

## 6. Review Results

In [8]:
# Review results with error handling
try:
    if hasattr(pipeline, 'models_summary_') and pipeline.models_summary_ is not None:
        print("="*60)
        print("MODEL PERFORMANCE SUMMARY")
        print("="*60)
        
        summary = pipeline.models_summary_
        
        # Check for Gini column
        gini_col = None
        for col in ['Gini_OOT', 'gini_oot', 'Gini_Test', 'gini_test']:
            if col in summary.columns:
                gini_col = col
                break
        
        if gini_col:
            print(f"\nTop 5 Models by {gini_col}:")
            top_models = summary.nlargest(5, gini_col)
            print(top_models[['model_name', gini_col]].to_string())
        else:
            print("\nModel Summary:")
            print(summary.head().to_string())
        
        # Check for pipeline comparison
        if 'pipeline' in summary.columns:
            print("\n" + "="*60)
            print("PIPELINE COMPARISON")
            print("="*60)
            
            for pipeline_type in ['WOE', 'RAW']:
                pipeline_models = summary[summary['pipeline'] == pipeline_type]
                if not pipeline_models.empty:
                    print(f"\n{pipeline_type} Pipeline:")
                    print(f"  Models: {len(pipeline_models)}")
                    if gini_col and gini_col in pipeline_models.columns:
                        print(f"  Best {gini_col}: {pipeline_models[gini_col].max():.4f}")
                        print(f"  Mean {gini_col}: {pipeline_models[gini_col].mean():.4f}")
        
        # Feature Selection Analysis
        print("\n" + "="*60)
        print("FEATURE SELECTION ANALYSIS")
        print("="*60)
        
        if hasattr(pipeline, 'final_vars'):
            print(f"\nFinal Variables Selected: {len(pipeline.final_vars)}")
            print(f"Variables: {pipeline.final_vars}")
            
            # Expected behavior with synthetic data
            print("\n📊 Expected Behavior:")
            print("✓ Strong predictors kept: risk_score, payment_score, debt_ratio")
            print("✓ Noise features dropped: noise_feature1, noise_feature2")
            print("✓ High PSI features dropped: income_level, region (if PSI > threshold)")
            print("✓ Correlated features: utilization_rate may be dropped (corr with debt_ratio)")
            
        # PSI Analysis
        if hasattr(pipeline, 'psi_summary'):
            print("\n" + "="*60)
            print("PSI STABILITY ANALYSIS")
            print("="*60)
            
            psi_df = pipeline.psi_summary
            if not psi_df.empty:
                high_psi = psi_df[psi_df['psi'] > 0.25]
                print(f"\nFeatures with High PSI (>0.25): {len(high_psi)}")
                if not high_psi.empty:
                    for _, row in high_psi.iterrows():
                        print(f"  - {row['variable']}: PSI={row['psi']:.3f}")
    else:
        print("No model summary available.")
        print("\n⚠️ Possible reasons:")
        print("  1. All features were filtered out by feature selection")
        print("  2. Check iv_min threshold - may be too high")
        print("  3. Check if WoE transformation is working correctly")
        
except Exception as e:
    print(f"Error reviewing results: {e}")

MODEL PERFORMANCE SUMMARY

Top 5 Models by Gini_OOT:
          model_name  Gini_OOT
11           RAW_GAM  0.981621
9        RAW_XGBoost  0.977926
10      RAW_LightGBM  0.976085
7   RAW_RandomForest  0.972122
8     RAW_ExtraTrees  0.969128

PIPELINE COMPARISON

WOE Pipeline:
  Models: 6
  Best Gini_OOT: 0.0000
  Mean Gini_OOT: 0.0000

RAW Pipeline:
  Models: 6
  Best Gini_OOT: 0.9816
  Mean Gini_OOT: 0.9741

FEATURE SELECTION ANALYSIS


## 7. Export Reports

In [9]:
# Export reports
try:
    pipeline.export_reports()
    print("✓ Reports exported successfully!")
    
    # List generated files
    import os
    if os.path.exists(config.output_folder):
        files = os.listdir(config.output_folder)
        print(f"\nGenerated {len(files)} files in '{config.output_folder}':")
        for f in sorted(files)[:10]:
            size = os.path.getsize(os.path.join(config.output_folder, f)) / 1024
            print(f"  - {f} ({size:.1f} KB)")
        if len(files) > 10:
            print(f"  ... and {len(files)-10} more files")
            
except Exception as e:
    print(f"Error exporting reports: {e}")

✓ Reports exported successfully!

Generated 4 files in 'outputs_dual_example':
  - best_model_20250904_203942_db996278.joblib (210.1 KB)
  - dual_pipeline_results.xlsx (26.6 KB)
  - final_vars_20250904_203942_db996278.json (0.1 KB)
  - woe_mapping_20250904_203942_db996278.json (31.6 KB)


## Troubleshooting

If you encounter any errors:

1. **numpy.dtype size changed error**:
   ```bash
   pip install git+https://github.com/selimoksuz/risk-model-pipeline.git
   ```

2. **Import errors**:
   ```bash
   pip install --force-reinstall git+https://github.com/selimoksuz/risk-model-pipeline.git
   ```

3. **Memory issues**:
   - Reduce n_samples in create_sample_data()
   - Reduce hpo_trials in config

4. **Create fresh environment**:
   ```bash
   python -m venv fresh_env
   fresh_env\Scripts\activate  # Windows
   pip install git+https://github.com/selimoksuz/risk-model-pipeline.git
   ```