# Loan Approval Model Implementation

This notebook implements a loan approval model with comprehensive tracking. The implementation follows these steps:
1. Setup and Imports
2. Data Loading and Initial Processing
3. Feature Classification
4. Model Configuration
5. Training Pipeline Setup
6. Model Training and Evaluation
7. Results and Artifact Storage

1. Feature Classification and Processing
2. Model Training and Validation
3. Fairness Metrics and Protected Attributes
4. Model Explainability
5. Monitoring and Drift Detection
6. Governance and Compliance

In [38]:
# Load the dataset
import pandas as pd 
print("Loading the dataset...")
df = pd.read_csv('Data/Loan_dataset_india_110000.csv')

# Display initial information
print("\nDataset Shape:", df.shape)
print("\nColumns in the dataset:")
for col in df.columns:
    print(f"- {col}")

print("\nFirst few rows:")
display(df.head())

Loading the dataset...

Dataset Shape: (200000, 29)

Columns in the dataset:
- person_id
- education_level
- person_age
- person_emp_exp_months
- number_of_jobs_switched
- total_experience_months
- avg_tenure_months
- person_income_monthly_inr
- person_income_annual_inr
- currency
- loan_purpose
- loan_amount_inr
- credit_utilization_ratio
- credit_approved
- loan_id
- person_gender
- marital_status
- loan_int_rate
- credit_score
- previous_loan_defaults_on_file
- number_of_open_credit_lines
- number_of_hard_inquiries
- months_since_last_delinquency
- number_of_public_records
- co_applicant_income
- co_applicant_credit_score
- application_date
- geographic_region
- loan_tenure_years

First few rows:


  df = pd.read_csv('Data/Loan_dataset_india_110000.csv')


Unnamed: 0,person_id,education_level,person_age,person_emp_exp_months,number_of_jobs_switched,total_experience_months,avg_tenure_months,person_income_monthly_inr,person_income_annual_inr,currency,...,previous_loan_defaults_on_file,number_of_open_credit_lines,number_of_hard_inquiries,months_since_last_delinquency,number_of_public_records,co_applicant_income,co_applicant_credit_score,application_date,geographic_region,loan_tenure_years
0,P13694459720974,Bachelor,16y 10m,0.0,0.0,0.0,0.0,0.0,0.0,INR,...,No,3.0,0.0,999.0,0.0,0.0,0.0,7/5/2025,South-West,5.0
1,P61387552181268,High School,23y 6m,43.0,0.0,43.0,43.0,63961.0,767532.0,INR,...,No,6.0,3.0,999.0,0.0,0.0,0.0,9/22/2022,North,4.0
2,P82370628326614,Bachelor,16y 11m,0.0,0.0,0.0,0.0,0.0,0.0,INR,...,No,3.0,1.0,999.0,0.0,0.0,0.0,6/8/2025,Central,5.0
3,P29011497711050,Masters,30y 3m,47.0,2.0,79.0,26.33,108176.0,1298112.0,INR,...,No,3.0,2.0,999.0,0.0,0.0,0.0,8/17/2024,Central,4.0
4,P71947746577780,High School,26y 10m,36.0,0.0,36.0,36.0,31971.0,383652.0,INR,...,No,4.0,1.0,999.0,1.0,0.0,0.0,9/28/2022,North,6.0


In [39]:
# # Create metadata DataFrame 
# metadata_data = [
#     ['Personal Information', 'person_id', 'Unique identifier for each person', 'object'],
#     ['Personal Information', 'education_level', 'Education qualification of the applicant', 'object'],
#     ['Personal Information', 'person_age', 'Age of the applicant', 'object'],
#     ['Personal Information', 'person_gender', 'Gender of the applicant', 'object'],
#     ['Personal Information', 'marital_status', 'Marital status of the applicant', 'object'],
    
#     ['Employment Information', 'person_emp_exp_months', 'Current employment experience in months', 'float64'],
#     ['Employment Information', 'number_of_jobs_switched', 'Number of job changes', 'float64'],
#     ['Employment Information', 'total_experience_months', 'Total work experience in months', 'float64'],
#     ['Employment Information', 'avg_tenure_months', 'Average time spent in each job', 'float64'],
#     ['Employment Information', 'person_income_monthly_inr', 'Monthly income in INR', 'float64'],
#     ['Employment Information', 'person_income_annual_inr', 'Annual income in INR', 'float64'],
    
#     ['Loan Information', 'loan_purpose', 'Purpose of the loan', 'object'],
#     ['Loan Information', 'loan_amount_inr', 'Loan amount in INR', 'float64'],
#     ['Loan Information', 'loan_id', 'Unique identifier for the loan', 'object'],
#     ['Loan Information', 'loan_int_rate', 'Interest rate of the loan', 'float64'],
#     ['Loan Information', 'loan_tenure_years', 'Duration of the loan in years', 'float64'],
#     ['Loan Information', 'currency', 'Currency of the loan (all in INR)', 'object'],
#     ['Loan Information', 'credit_approved', 'Whether the loan was approved (1) or not (0)', 'float64'],
    
#     ['Credit Information', 'credit_utilization_ratio', 'Ratio of credit used to credit available', 'float64'],
#     ['Credit Information', 'credit_score', 'Credit score of the applicant', 'float64'],
#     ['Credit Information', 'previous_loan_defaults_on_file', 'Previous loan defaults record', 'object'],
#     ['Credit Information', 'number_of_open_credit_lines', 'Number of active credit lines', 'float64'],
#     ['Credit Information', 'number_of_hard_inquiries', 'Number of credit inquiries', 'float64'],
#     ['Credit Information', 'months_since_last_delinquency', 'Months since last late payment', 'float64'],
#     ['Credit Information', 'number_of_public_records', 'Number of public records', 'float64'],
    
#     ['Co-applicant Information', 'co_applicant_income', 'Income of the co-applicant', 'float64'],
#     ['Co-applicant Information', 'co_applicant_credit_score', 'Credit score of the co-applicant', 'float64'],
    
#     ['Application Details', 'application_date', 'Date of loan application', 'object'],
#     ['Application Details', 'geographic_region', 'Region where loan was applied', 'object']
# ]

# # Create DataFrame
# metadata_df = pd.DataFrame(metadata_data, columns=['Category', 'Feature Name', 'Description', 'Datatype'])

# # Save to CSV
# metadata_df.to_csv('Data/metadata.csv', index=False)

# print("Metadata file has been created and saved as 'metadata.csv' in the Data folder")
# display(metadata_df)

In [40]:
# Define the new column order
new_column_order = [
    'person_id',
    'loan_id',
    'education_level',
    'loan_purpose',
    'person_age',
    'total_experience_months',
    'number_of_jobs_switched',
    'person_emp_exp_months',
    'person_income_annual_inr',
    'application_date',
    'loan_amount_inr',
    'loan_int_rate',
    'loan_tenure_years',
    'credit_score',
    'credit_utilization_ratio',
    'number_of_open_credit_lines',
    'number_of_hard_inquiries',
    'months_since_last_delinquency',
    'number_of_public_records',
    'previous_loan_defaults_on_file',
    'co_applicant_income',
    'co_applicant_credit_score',
    'person_gender',
    'marital_status',
    'geographic_region',
    'credit_approved'
]

# Columns to drop
columns_to_drop = ['avg_tenure_months', 'person_income_monthly_inr', 'currency']

# Create new DataFrame with reordered columns and dropped columns
df_processed = df.drop(columns=columns_to_drop)[new_column_order]

# Display info about the processed dataset
print("Processed Dataset Information:")
print("-" * 30)
print(f"Number of rows: {len(df_processed)}")
print(f"Number of columns: {len(df_processed.columns)}")
print("\nFirst few rows of the processed dataset:")
display(df_processed.head())

# Save the processed dataset
df_processed.to_csv('Data/loan_data_processed.csv', index=False)
print("\nProcessed dataset has been saved as 'loan_data_processed.csv' in the Data folder")

Processed Dataset Information:
------------------------------
Number of rows: 200000
Number of columns: 26

First few rows of the processed dataset:


Unnamed: 0,person_id,loan_id,education_level,loan_purpose,person_age,total_experience_months,number_of_jobs_switched,person_emp_exp_months,person_income_annual_inr,application_date,...,number_of_hard_inquiries,months_since_last_delinquency,number_of_public_records,previous_loan_defaults_on_file,co_applicant_income,co_applicant_credit_score,person_gender,marital_status,geographic_region,credit_approved
0,P13694459720974,L56161784365314,Bachelor,Education,16y 10m,0.0,0.0,0.0,0.0,7/5/2025,...,0.0,999.0,0.0,No,0.0,0.0,Male,Single,South-West,0.0
1,P61387552181268,L65594787311146,High School,Education,23y 6m,43.0,0.0,43.0,767532.0,9/22/2022,...,3.0,999.0,0.0,No,0.0,0.0,Female,Married,North,0.0
2,P82370628326614,L37940392626487,Bachelor,Education,16y 11m,0.0,0.0,0.0,0.0,6/8/2025,...,1.0,999.0,0.0,No,0.0,0.0,Female,Single,Central,0.0
3,P29011497711050,L98417122446195,Masters,Personal,30y 3m,79.0,2.0,47.0,1298112.0,8/17/2024,...,2.0,999.0,0.0,No,0.0,0.0,Female,Single,Central,0.0
4,P71947746577780,L11789357160290,High School,Personal,26y 10m,36.0,0.0,36.0,383652.0,9/28/2022,...,1.0,999.0,1.0,No,0.0,0.0,Male,Single,North,0.0



Processed dataset has been saved as 'loan_data_processed.csv' in the Data folder


In [41]:
# Define feature classifications
protected_features = [
    'person_gender',
    'person_age',
    'marital_status',
    'education_level',
    'geographic_region'
]

derived_features = [
    'credit_utilization_ratio',  # Calculated from credit usage
    'avg_tenure_months',         # Calculated from experience
    'total_experience_months',   # Sum of all experience
    'person_income_annual_inr'   # Calculated from monthly income
]

# All other features are considered given features
given_features = [col for col in df_processed.columns 
                 if col not in protected_features + derived_features]

# Create a feature classification dictionary
feature_classification = {
    'Protected Features': protected_features,
    'Derived Features': derived_features,
    'Given Features': given_features
}

# Display feature classifications
print("Feature Classifications:")
print("-" * 50)
for category, features in feature_classification.items():
    print(f"\n{category}:")
    for feature in features:
        print(f"- {feature}")

# Create a DataFrame with feature classifications
feature_class_data = []
for feature in df_processed.columns:
    if feature in protected_features:
        category = 'Protected'
    elif feature in derived_features:
        category = 'Derived'
    else:
        category = 'Given'
    feature_class_data.append([feature, category])

feature_classification_df = pd.DataFrame(feature_class_data, 
                                      columns=['Feature', 'Classification'])

# Save feature classification to CSV
feature_classification_df.to_csv('Data/feature_classification.csv', index=False)
print("\nFeature classification has been saved to 'feature_classification.csv'")

# Display the classification DataFrame
print("\nFeature Classification Summary:")
print("-" * 50)
display(feature_classification_df)

Feature Classifications:
--------------------------------------------------

Protected Features:
- person_gender
- person_age
- marital_status
- education_level
- geographic_region

Derived Features:
- credit_utilization_ratio
- avg_tenure_months
- total_experience_months
- person_income_annual_inr

Given Features:
- person_id
- loan_id
- loan_purpose
- number_of_jobs_switched
- person_emp_exp_months
- application_date
- loan_amount_inr
- loan_int_rate
- loan_tenure_years
- credit_score
- number_of_open_credit_lines
- number_of_hard_inquiries
- months_since_last_delinquency
- number_of_public_records
- previous_loan_defaults_on_file
- co_applicant_income
- co_applicant_credit_score
- credit_approved

Feature classification has been saved to 'feature_classification.csv'

Feature Classification Summary:
--------------------------------------------------


Unnamed: 0,Feature,Classification
0,person_id,Given
1,loan_id,Given
2,education_level,Protected
3,loan_purpose,Given
4,person_age,Protected
5,total_experience_months,Derived
6,number_of_jobs_switched,Given
7,person_emp_exp_months,Given
8,person_income_annual_inr,Derived
9,application_date,Given


In [42]:
# Import required libraries
import pandas as pd
import numpy as np
from datetime import datetime
import json
import joblib
from dataclasses import dataclass, asdict
from typing import Dict, List, Any, Optional
# ML libraries
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (roc_auc_score, precision_recall_curve, average_precision_score,
                           precision_score, recall_score, f1_score, brier_score_loss,
                           confusion_matrix)
from scipy.stats import chi2_contingency

# Try to import SMOTE from imblearn, otherwise set to None
try:
    from imblearn.over_sampling import SMOTE
except Exception:
    SMOTE = None

# Define tracking classes for model metadata
@dataclass
class FeatureMetadata:
    name: str
    category: str  # 'protected', 'derived', or 'given'
    feature_type: str  # 'numeric' or 'categorical'
    preprocessing: str  # preprocessing method
    description: str

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Initialize tracking dictionary
tracking = {
    "run_id": datetime.now().strftime("%Y%m%d_%H%M%S"),
    "model_version": "1.0.0",
    "environment": {
        "python_version": pd.__version__,
        "sklearn_version": pd.__version__,
        "random_state": RANDOM_STATE
    }
}

print("Initial setup completed.")

Initial setup completed.


In [43]:
# Helper function to convert to JSON serializable format
def convert_to_json_serializable(obj):
    """Convert an object to a JSON serializable format."""
    if isinstance(obj, (np.int64, np.int32, np.float64, np.float32)):
        return obj.item()
    elif isinstance(obj, dict):
        return {k: convert_to_json_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_json_serializable(v) for v in obj]
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    return obj

In [44]:
# Define model configuration
model_config = {
    'solvers': ['saga', 'lbfgs'],  # Multiple solvers to try
    'cv_folds': 5,  # Number of cross-validation folds
    'solver_params': {
        'saga': {
            'C': [0.1, 1.0, 10.0, 100.0],  # Focus on higher C values for better fit
            'penalty': ['l1', 'l2'],  # Remove elasticnet for simplicity
            'class_weight': ['balanced'],  # Force balanced class weights
            'max_iter': [2000]  # Increase max iterations further
        },
        'lbfgs': {
            'C': [0.1, 1.0, 10.0, 100.0],  # Focus on higher C values
            'penalty': ['l2'],
            'class_weight': ['balanced'],  # Force balanced class weights
            'max_iter': [2000]  # Increase max iterations
        }
    }
}

In [45]:
# Use the processed dataset
print("Using the processed dataset...")
df = df_processed

# Define the features we want to keep and their types
feature_metadata = {
    # Protected features (only for fairness monitoring, not for training)
    'protected_features': {
        'person_gender': FeatureMetadata(
            name='person_gender',
            category='protected',
            feature_type='categorical',
            preprocessing='none',
            description='Gender of applicant'
        ),
        'person_age': FeatureMetadata(
            name='person_age',
            category='protected',
            feature_type='categorical',
            preprocessing='none',
            description='Age of applicant'
        ),
        'marital_status': FeatureMetadata(
            name='marital_status',
            category='protected',
            feature_type='categorical',
            preprocessing='none',
            description='Marital status'
        ),
        'education_level': FeatureMetadata(
            name='education_level',
            category='protected',
            feature_type='categorical',
            preprocessing='none',
            description='Education qualification'
        ),
        'geographic_region': FeatureMetadata(
            name='geographic_region',
            category='protected',
            feature_type='categorical',
            preprocessing='none',
            description='Geographic region'
        )
    },
    
    # Given features (raw input features)
    'given_features': {
        'person_id': FeatureMetadata(
            name='person_id',
            category='given',
            feature_type='categorical',
            preprocessing='none',
            description='Unique identifier for person'
        ),
        'loan_id': FeatureMetadata(
            name='loan_id',
            category='given',
            feature_type='categorical',
            preprocessing='none',
            description='Unique identifier for loan'
        ),
        'loan_purpose': FeatureMetadata(
            name='loan_purpose',
            category='given',
            feature_type='categorical',
            preprocessing='onehot',
            description='Purpose of the loan'
        ),
        'number_of_jobs_switched': FeatureMetadata(
            name='number_of_jobs_switched',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Number of jobs changed'
        ),
        'person_emp_exp_months': FeatureMetadata(
            name='person_emp_exp_months',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Current employment experience in months'
        ),
        'application_date': FeatureMetadata(
            name='application_date',
            category='given',
            feature_type='categorical',
            preprocessing='none',
            description='Date of loan application'
        ),
        'loan_amount_inr': FeatureMetadata(
            name='loan_amount_inr',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Loan amount requested'
        ),
        'loan_int_rate': FeatureMetadata(
            name='loan_int_rate',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Interest rate'
        ),
        'loan_tenure_years': FeatureMetadata(
            name='loan_tenure_years',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Loan tenure in years'
        ),
        'credit_score': FeatureMetadata(
            name='credit_score',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Credit score'
        ),
        'number_of_open_credit_lines': FeatureMetadata(
            name='number_of_open_credit_lines',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Number of active credit lines'
        ),
        'number_of_hard_inquiries': FeatureMetadata(
            name='number_of_hard_inquiries',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Number of credit inquiries'
        ),
        'months_since_last_delinquency': FeatureMetadata(
            name='months_since_last_delinquency',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Months since last late payment'
        ),
        'number_of_public_records': FeatureMetadata(
            name='number_of_public_records',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Number of public records'
        ),
        'previous_loan_defaults_on_file': FeatureMetadata(
            name='previous_loan_defaults_on_file',
            category='given',
            feature_type='categorical',
            preprocessing='onehot',
            description='Previous loan defaults record'
        ),
        'co_applicant_income': FeatureMetadata(
            name='co_applicant_income',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Income of co-applicant'
        ),
        'co_applicant_credit_score': FeatureMetadata(
            name='co_applicant_credit_score',
            category='given',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Credit score of co-applicant'
        )
    },
    
    # Derived features (calculated from other features)
    'derived_features': {
        'credit_utilization_ratio': FeatureMetadata(
            name='credit_utilization_ratio',
            category='derived',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Ratio of credit used to credit available'
        ),
        'total_experience_months': FeatureMetadata(
            name='total_experience_months',
            category='derived',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Total work experience in months'
        ),
        'person_income_annual_inr': FeatureMetadata(
            name='person_income_annual_inr',
            category='derived',
            feature_type='numeric',
            preprocessing='standard_scaler',
            description='Annual income in INR'
        )
    }
}

# Get lists of features by type
protected_features = list(feature_metadata['protected_features'].keys())
derived_features = list(feature_metadata['derived_features'].keys())
given_features = list(feature_metadata['given_features'].keys())

# Update tracking dictionary with feature metadata
tracking['feature_metadata'] = {
    category: {name: asdict(metadata) 
              for name, metadata in features.items()}
    for category, features in feature_metadata.items()
}

# Save feature metadata
with open('Data/feature_metadata.json', 'w') as f:
    json.dump(tracking['feature_metadata'], f, indent=2)

print("\nFeature metadata has been saved to 'feature_metadata.json'")
print("\nFeature counts:")
print(f"Protected features: {len(protected_features)}")
for f in protected_features:
    print(f"  - {f}")
print(f"\nDerived features: {len(derived_features)}")
for f in derived_features:
    print(f"  - {f}")
print(f"\nGiven features: {len(given_features)}")
for f in given_features:
    print(f"  - {f}")

Using the processed dataset...

Feature metadata has been saved to 'feature_metadata.json'

Feature counts:
Protected features: 5
  - person_gender
  - person_age
  - marital_status
  - education_level
  - geographic_region

Derived features: 3
  - credit_utilization_ratio
  - total_experience_months
  - person_income_annual_inr

Given features: 17
  - person_id
  - loan_id
  - loan_purpose
  - number_of_jobs_switched
  - person_emp_exp_months
  - application_date
  - loan_amount_inr
  - loan_int_rate
  - loan_tenure_years
  - credit_score
  - number_of_open_credit_lines
  - number_of_hard_inquiries
  - months_since_last_delinquency
  - number_of_public_records
  - previous_loan_defaults_on_file
  - co_applicant_income
  - co_applicant_credit_score


In [46]:
# Model training and evaluation functions
import os
def find_optimal_threshold(y_true, y_prob):
    """Find optimal threshold that maximizes F1 score. Return 0.5 if undefined."""
    try:
        precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
        f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-12)
        if len(thresholds) == 0:
            return 0.5
        optimal_idx = np.nanargmax(f1_scores)
        # thresholds length is len(f1_scores)-1, guard index
        if optimal_idx >= len(thresholds):
            return thresholds[-1]
        return thresholds[optimal_idx]
    except Exception:
        return 0.5


def evaluate_model(model, X, y, protected_features=None):
    """Evaluate model performance and fairness with threshold optimization"""
    # Get predictions - support pipelines and grid search wrappers
    try:
        if hasattr(model, 'predict_proba'):
            y_prob = model.predict_proba(X)[:, 1]
        else:
            # GridSearchCV wraps estimator in best_estimator_
            if hasattr(model, 'best_estimator_'):
                y_prob = model.best_estimator_.predict_proba(X)[:, 1]
            else:
                # Fallback: use decision_function and map to 0-1
                scores = model.decision_function(X)
                y_prob = (scores - scores.min()) / (scores.max() - scores.min() + 1e-12)
    except Exception as e:
        raise RuntimeError(f"Could not get predicted probabilities: {e}")
    
    # Handle degenerate y (single class)
    if len(np.unique(y)) < 2:
        # return trivial metrics with zero/NA values
        metrics = {
            'roc_auc': float('nan'),
            'avg_precision': float('nan'),
            'precision': 0.0,
            'recall': 0.0,
            'f1': 0.0,
            'brier': float('nan'),
            'optimal_threshold': 0.5
        }
        return metrics, {}
    
    # Find optimal threshold
    optimal_threshold = find_optimal_threshold(y, y_prob)
    y_pred = (y_prob >= optimal_threshold).astype(int)
    
    # Safely compute metrics with try/except for edge cases
    def safe_score(func, y_true, y_pred_or_prob, use_prob=False):
        try:
            return float(func(y_true, y_pred_or_prob))
        except Exception:
            return float('nan')
    
    metrics = {
        'roc_auc': safe_score(roc_auc_score, y, y_prob),
        'avg_precision': safe_score(average_precision_score, y, y_prob),
        'precision': safe_score(precision_score, y, y_pred),
        'recall': safe_score(recall_score, y, y_pred),
        'f1': safe_score(f1_score, y, y_pred),
        'brier': safe_score(brier_score_loss, y, y_prob),
        'optimal_threshold': float(optimal_threshold)
    }
    
    # Calculate confusion matrix
    try:
        cm = confusion_matrix(y, y_pred)
        metrics['confusion_matrix'] = cm.tolist()
    except Exception:
        metrics['confusion_matrix'] = None
    
    # Calculate fairness metrics if protected features are provided
    fairness_metrics = {}
    if protected_features is not None and len(protected_features) > 0:
        for col in protected_features.columns:
            group_metrics = {}
            for group in protected_features[col].unique():
                mask = protected_features[col] == group
                if mask.sum() == 0:
                    continue
                try:
                    gt = y[mask]
                    pred = y_pred[mask]
                    # Guard against empty slices or single-class in subgroup
                    precision = precision_score(gt, pred) if len(np.unique(gt)) > 1 else float('nan')
                    recall = recall_score(gt, pred) if len(np.unique(gt)) > 1 else float('nan')
                    f1 = f1_score(gt, pred) if len(np.unique(gt)) > 1 else float('nan')
                    group_metrics[group] = {
                        'size': int(mask.sum()),
                        'approval_rate': float(pred.mean()),
                        'precision': precision,
                        'recall': recall,
                        'f1': f1
                    }
                except Exception:
                    group_metrics[group] = {'size': int(mask.sum()), 'approval_rate': float('nan')}
            fairness_metrics[col] = group_metrics
    
    return metrics, fairness_metrics

def log_metrics(metrics, log_dir='logs'):
    os.makedirs(log_dir, exist_ok=True)
    for name, value in metrics.items():
        if name == 'confusion_matrix':
            continue
        path = os.path.join(log_dir, f"{name}.log")
        try:
            with open(path, "a") as f:
                f.write(f"{value}\n")
        except Exception as e:
            print(f"Could not log {name}: {e}")


In [47]:
# Create preprocessing configuration
preprocessing_config = {
    'numerical_features': {
        'strategy': 'standard_scaler',
        'features': [
            feature for feature, metadata in {
                **feature_metadata['given_features'],
                **feature_metadata['derived_features']
            }.items()
            if metadata.feature_type == 'numeric'
        ]
    },
    'categorical_features': {
        'strategy': 'onehot',
        'features': [
            feature for feature, metadata in {
                **feature_metadata['given_features'],
                **feature_metadata['derived_features']
            }.items()
            if metadata.feature_type == 'categorical' and metadata.preprocessing == 'onehot'
        ]
    },
    'excluded_features': [
        feature for feature, metadata in {
            **feature_metadata['given_features'],
            **feature_metadata['derived_features']
        }.items()
        if metadata.preprocessing == 'none'
    ]
}

# Add preprocessing config to tracking
tracking['preprocessing_config'] = preprocessing_config

# Save preprocessing configuration
with open('Data/preprocessing_config.json', 'w') as f:
    json.dump(preprocessing_config, f, indent=2)

print("Preprocessing configuration has been saved to 'preprocessing_config.json'")

Preprocessing configuration has been saved to 'preprocessing_config.json'


In [48]:
# Create preprocessing pipeline function
def create_preprocessing_pipeline(feature_metadata):
    """Create a preprocessing pipeline based on feature metadata"""
    
    # Get numerical and categorical features
    numerical_features = []
    categorical_features = []
    excluded_features = []
    
    # Combine given and derived features
    all_features = {**feature_metadata['given_features'], 
                   **feature_metadata['derived_features']}
    
    # Categorize features based on their type and preprocessing
    for feature, metadata in all_features.items():
        if metadata.preprocessing == 'standard_scaler':
            numerical_features.append(feature)
        elif metadata.preprocessing == 'onehot':
            categorical_features.append(feature)
        elif metadata.preprocessing == 'none':
            excluded_features.append(feature)
    
    # Create transformers with improved handling
    numeric_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),  # Use median for robustness
        ('scaler', RobustScaler())  # Use RobustScaler for better handling of outliers
    ])
    
    # OneHotEncoder arguments vary across sklearn versions; try to be compatible
    try:
        categorical_transformer = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(drop='first', sparse=False, handle_unknown='ignore'))
        ])
    except TypeError:
        # Fallback for older sklearn versions
        categorical_transformer = Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
        ])
    
    # Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ],
        remainder='drop'  # Drop any columns not specified
    )
    
    return preprocessor

# Model Execution with Data
Now we'll load the data and execute the model training pipeline with all the tracking configurations set up above.

In [None]:
# Use the processed dataset
print("Using processed dataset...")
df = df_processed.copy()  # Make a copy to avoid modifying original data

# Handle missing values in target variable
print("\nHandling missing values...")
df = df.dropna(subset=['credit_approved'])

# Ensure target is binary numeric
df['credit_approved'] = df['credit_approved'].astype(int)

# Analyze class distribution
print("\nTarget variable distribution:")
target_dist = df['credit_approved'].value_counts(normalize=True)
print(target_dist)

# If target is extremely imbalanced, note it
imbalance_ratio = target_dist.min() / target_dist.max()
print(f"Imbalance ratio (min/max): {imbalance_ratio:.4f}")

# Handle extreme values in numeric features
numeric_features = [f for f in df.columns if df[f].dtype in ['float64', 'int64'] 
                   and f not in ['credit_approved', 'person_id', 'loan_id']]

print("\nHandling extreme values in numeric features...")
for feature in numeric_features:
    if df[feature].isnull().any():
        print(f"Warning: {feature} contains null values, filling with median")
        df[feature] = df[feature].fillna(df[feature].median())
    
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    df[feature] = df[feature].clip(lower_bound, upper_bound)


In [None]:

# Separate features and target
try:
    training_features = [f for f in df.columns 
                        if f in given_features + derived_features 
                        and f not in ['person_id', 'loan_id', 'application_date']]
    
    X = df[training_features].copy()
    y = df['credit_approved'].astype(int)  # Ensure target is integer type
    protected_data = df[protected_features].copy()

    # Feature importance-based selection using correlation with target
    feature_correlations = {}
    for feature in training_features:
        if df[feature].dtype in ['float64', 'int64']:
            correlation = df[feature].corr(df['credit_approved'])
            feature_correlations[feature] = abs(correlation)
        else:
            try:
                # For categorical features, use Cramer's V
                contingency = pd.crosstab(df[feature], df['credit_approved'])
                chi2, _ = chi2_contingency(contingency)
                n = contingency.sum().sum()
                correlation = np.sqrt(chi2 / (n * (min(contingency.shape) - 1)))
                feature_correlations[feature] = correlation
            except Exception as e:
                print(f"Warning: Could not calculate correlation for {feature}: {str(e)}")
                feature_correlations[feature] = 0

    # Sort features by importance
    sorted_features = sorted(feature_correlations.items(), key=lambda x: abs(x[1]), reverse=True)
    print("\nTop 10 most important features:")
    for feature, correlation in sorted_features[:10]:
        print(f"{feature}: {correlation:.4f}")

    # Keep only the most important features
    top_k = min(15, len(sorted_features))
    top_features = [feature for feature, _ in sorted_features[:top_k]]  # Keep top features
    X = X[top_features]
    training_features = top_features
    if set(training_features).intersection(protected_features):
        raise ValueError('Protected features included in training features')

    print("\nFeature sets for training:")
    print(f"Training features ({len(training_features)}):")
    for f in training_features:
        print(f"- {f}")
    print(f"\nProtected features ({len(protected_features)}) - not used in training:")
    for f in protected_features:
        print(f"- {f}")

    print("\nData shapes:")
    print(f"X shape: {X.shape}")
    print(f"y shape: {y.shape}")
    print(f"protected_data shape: {protected_data.shape}")

    # Check for any remaining missing values
    if X.isnull().any().any():
        print("\nWarning: Some features still contain missing values:")
        print(X.isnull().sum()[X.isnull().sum() > 0])

    # Split the data with stratification
    print("\nSplitting data into train and test sets...")
    X_train, X_test, y_train, y_test, protected_train, protected_test = train_test_split(
        X, y, protected_data, test_size=0.2, random_state=RANDOM_STATE, stratify=y
    )

    # Create preprocessing pipeline
    print("\nCreating preprocessing pipeline...")
    preprocessor = create_preprocessing_pipeline(feature_metadata)

    # Optionally apply SMOTE to training set if available and imbalance is high
    apply_smote = False
    if SMOTE is not None and imbalance_ratio < 0.5:
        apply_smote = True

    if apply_smote:
        print("\nApplying SMOTE to balance classes in training set...")
        # Fit transformer on preprocessed numeric/categorical separately via pipeline
        # For simplicity, apply SMOTE on the numeric representation after preprocessor
        try:
            X_train_pre = preprocessor.fit_transform(X_train)
            sm = SMOTE(random_state=RANDOM_STATE)
            X_res, y_res = sm.fit_resample(X_train_pre, y_train)
            # After SMOTE we won't have column names - proceed with resampled arrays
            use_resampled = True
        except Exception as e:
            print(f"SMOTE failed: {e}")
            use_resampled = False
    else:
        use_resampled = False

    # Dictionary to store results for each solver
    results = {}

    print("\nTraining models with different solvers...")
    for solver in model_config['solvers']:
        print(f"\nTraining with {solver} solver...")
        params = model_config['solver_params'][solver]
        print(f"Solver params: {params}")
        
        # Build compatible parameter grid
        param_grid = {
            'classifier__C': params.get('C', [1.0]),
            'classifier__class_weight': params.get('class_weight', [None])
        }
        # Only include penalty if supported for the solver
        allowed_penalties = []
        if solver == 'saga':
            allowed_penalties = ['l1', 'l2']
        elif solver == 'lbfgs':
            allowed_penalties = ['l2']
        penalties = [p for p in params.get('penalty', []) if p in allowed_penalties]
        if penalties:
            param_grid['classifier__penalty'] = penalties

        try:
            # Build pipeline
            pipeline = Pipeline([
                ('preprocessor', preprocessor),
                ('classifier', LogisticRegression(solver=solver, random_state=RANDOM_STATE, max_iter=params.get('max_iter',[1000])[0]))
            ])

            # If SMOTE used and resampled arrays available, fit directly without GridSearch
            if use_resampled:
                print("Fitting on resampled data (no grid search)...")
                # Fit best estimator directly
                pipeline.fit(X_res, y_res)
                best_model = pipeline
                best_params = {'solver': solver, 'resampled': True}
            else:
                # Perform GridSearchCV
                cv = StratifiedKFold(n_splits=min(model_config['cv_folds'], max(2, int(len(y_train)/10))), shuffle=True, random_state=RANDOM_STATE)
                grid_search = GridSearchCV(pipeline, param_grid, cv=cv, scoring='f1', n_jobs=-1, verbose=1)
                grid_search.fit(X_train, y_train)
                best_model = grid_search
                best_params = grid_search.best_params_

            # Evaluate
            metrics, fairness = evaluate_model(best_model, X_test, y_test, protected_features=protected_test)
            try:
                log_metrics(metrics)
            except Exception as e:
                print(f"Logging metrics failed: {e}")
            metrics = convert_to_json_serializable(metrics)
            fairness = convert_to_json_serializable(fairness)
            best_params_serializable = convert_to_json_serializable(best_params)

            results[solver] = {
                'model': best_model,
                'metrics': metrics,
                'fairness': fairness,
                'best_params': best_params_serializable
            }

            print(f"Stored results for {solver} with f1={results[solver]['metrics'].get('f1')}")

        except Exception as e:
            print(f"Error training with {solver}: {type(e).__name__}: {e}")
            import traceback
            traceback.print_exc()
            continue

    # Fallback: if no logistic models trained, try DecisionTree and Dummy
    if not results:
        print("\nNo logistic models trained successfully, trying fallback models...")
        try:
            # Simple preprocessing using pandas get_dummies for categorical
            X_simple = pd.get_dummies(X, drop_first=True)
            X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_simple, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y)

            # Decision Tree fallback
            dt = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=6)
            dt.fit(X_train_s, y_train_s)
            dt_metrics, dt_fairness = evaluate_model(dt, X_test_s, y_test_s, protected_features=protected_test.reset_index(drop=True))

            # Dummy classifier baseline
            dummy = DummyClassifier(strategy='most_frequent')
            dummy.fit(X_train_s, y_train_s)
            dummy_metrics, _ = evaluate_model(dummy, X_test_s, y_test_s, protected_features=None)

            results['decision_tree_fallback'] = {'model': dt, 'metrics': convert_to_json_serializable(dt_metrics), 'fairness': convert_to_json_serializable(dt_fairness), 'best_params': {'max_depth': 6}}
            results['dummy_fallback'] = {'model': dummy, 'metrics': convert_to_json_serializable(dummy_metrics), 'fairness': {}, 'best_params': {'strategy': 'most_frequent'}}

            print("Fallback models trained and evaluated.")
        except Exception as e:
            print(f"Fallback training also failed: {type(e).__name__}: {e}")
            import traceback
            traceback.print_exc()
            raise Exception("No models could be trained (including fallbacks)")

    if not results:
        raise Exception("No models were successfully trained. Check logs above.")

    # Find best model based on F1 score (handle NaN)
    def safe_f1(s):
        try:
            v = results[s]['metrics'].get('f1')
            return -1.0 if v is None or (isinstance(v, float) and np.isnan(v)) else float(v)
        except Exception:
            return -1.0

    best_solver = max(results.keys(), key=safe_f1)
    best_model = results[best_solver]['model']

    print(f"\nBest model is {best_solver} solver")
    print("\nPerformance metrics:")
    for metric, value in results[best_solver]['metrics'].items():
        if metric not in ['confusion_matrix', 'optimal_threshold']:
            try:
                print(f"{metric}: {value:.4f}")
            except Exception:
                print(f"{metric}: {value}")
    print(f"Optimal threshold: {results[best_solver]['metrics']['optimal_threshold']}")

    print("\nFairness metrics for protected features:")
    for feature, metrics in results[best_solver]['fairness'].items():
        print(f"\n{feature}:")
        for group, values in metrics.items():
            print(f"  {group}:")
            for metric, value in values.items():
                print(f"    {metric}: {value}")

    # Save best model and results
    model_artifacts = {
        'model_metadata': tracking,
        'best_model_info': {
            'solver': best_solver,
            'metrics': results[best_solver]['metrics'],
            'fairness': results[best_solver]['fairness'],
            'best_params': results[best_solver]['best_params'],
            'training_features': training_features,
            'protected_features': protected_features
        }
    }

    with open('Data/model_artifacts.json', 'w') as f:
        json.dump(model_artifacts, f, indent=2)

    # Save the model itself
    try:
        joblib.dump(best_model, 'Data/best_model.joblib')
        print("Saved best model to Data/best_model.joblib")
    except Exception as e:
        print(f"Could not save model: {e}")

    print("\nModel artifacts and best model have been saved to the Data folder")

except Exception as e:
    print(f"Error in model training pipeline: {type(e).__name__}: {e}")
    import traceback
    traceback.print_exc()
    raise


Top 10 most important features:
loan_amount_inr: 0.1075
total_experience_months: 0.0685
loan_tenure_years: 0.0664
person_emp_exp_months: 0.0571
person_income_annual_inr: 0.0496
credit_utilization_ratio: 0.0426
number_of_open_credit_lines: 0.0239
number_of_jobs_switched: 0.0220
number_of_hard_inquiries: 0.0049
loan_int_rate: 0.0002

Feature sets for training:
Training features (15):
- loan_amount_inr
- total_experience_months
- loan_tenure_years
- person_emp_exp_months
- person_income_annual_inr
- credit_utilization_ratio
- number_of_open_credit_lines
- number_of_jobs_switched
- number_of_hard_inquiries
- loan_int_rate
- loan_purpose
- months_since_last_delinquency
- number_of_public_records
- previous_loan_defaults_on_file
- co_applicant_income

Protected features (5) - not used in training:
- person_gender
- person_age
- marital_status
- education_level
- geographic_region

Data shapes:
X shape: (110000, 15)
y shape: (110000,)
protected_data shape: (110000, 5)

Splitting data into tr

  c /= stddev[:, None]
  c /= stddev[None, :]
  c /= stddev[:, None]
  c /= stddev[None, :]


Error training with saga: ValueError: 
All the 40 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'credit_score'

The above exception was the dir

Traceback (most recent call last):
  File "/var/folders/6y/6bdr99fj10b8r3lfc1hyq1_40000gn/T/ipykernel_41243/3085048135.py", line 134, in <module>
    grid_search.fit(X_train, y_train)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1024, in fit
    self._run_search(evaluate_candidates)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1571, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1001, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Error training with lbfgs: ValueError: 
All the 20 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'credit_score'

The above exception was the di

Traceback (most recent call last):
  File "/var/folders/6y/6bdr99fj10b8r3lfc1hyq1_40000gn/T/ipykernel_41243/3085048135.py", line 134, in <module>
    grid_search.fit(X_train, y_train)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1024, in fit
    self._run_search(evaluate_candidates)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1571, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/lib/python3.13/site-packages/sklearn/model_selection/_search.py", line 1001, in evaluate_candidates
    _warn_or_raise_about_fit_failures(out, self.error_score)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Fallback models trained and evaluated.

Best model is decision_tree_fallback solver

Performance metrics:
roc_auc: 0.6313
avg_precision: 0.1830
precision: 0.1693
recall: 0.7966
f1: 0.2793
brier: 0.1138
Optimal threshold: 0.13434163701067617

Fairness metrics for protected features:

person_gender:
  Female:
    size: 9944
    approval_rate: nan
  Male:
    size: 12056
    approval_rate: nan

person_age:
  18y 11m:
    size: 49
    approval_rate: nan
  38y 8m:
    size: 36
    approval_rate: nan
  58y 10m:
    size: 35
    approval_rate: nan
  47y 7m:
    size: 37
    approval_rate: nan
  20y 1m:
    size: 42
    approval_rate: nan
  64y 2m:
    size: 30
    approval_rate: nan
  49y 11m:
    size: 30
    approval_rate: nan
  16y 11m:
    size: 58
    approval_rate: nan
  30y 1m:
    size: 29
    approval_rate: nan
  40y 7m:
    size: 34
    approval_rate: nan
  32y 7m:
    size: 40
    approval_rate: nan
  53y 10m:
    size: 45
    approval_rate: nan
  16y 7m:
    size: 56
    approval_

Exception ignored in: <function ResourceTracker.__del__ at 0x1026ddbc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x106f3dbc0>
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 82, in __del__
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 91, in _stop
  File "/opt/anaconda3/lib/python3.13/multiprocessing/resource_tracker.py", line 116, in _stop_locked
ChildProcessError: [Errno 10] No child processes
Exception ignored in: <function ResourceTracker.__del__ at 0x102ba1bc0>
Traceback (most recent call last