<h2> Predictive Modelling </h2>

<h3> Importing Libraries and Loading Data </h3>

First, we'll import all the necessary libraries for data manipulation, feature engineering, modeling, and evaluation. We'll also load the cleaned and transformed dataset prepared in the previous notebooks.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import numpy as np
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from imblearn.over_sampling import SMOTE
import sys
from scipy import stats
from imblearn.over_sampling import ADASYN
from imblearn.combine import SMOTETomek, SMOTEENN

<h4> Feature Engineering </h4>

In this step, we'll create new, insightful features from the existing data. This process, known as feature engineering, can significantly improve the performance of our predictive models by providing them with more relevant information. We will create features such as:

- credit_history_length: The length of the borrower's credit history.
- loan_to_income_ratio: The ratio of the loan amount to the borrower's annual income.
- instalment_to_income_ratio: The ratio of the loan's monthly installment to the borrower's monthly income.
- open_account_ratio : The ratio of open accounts and total accounts per person

In [6]:
df = pd.read_csv('loan_payments_versions/loan_payments_post_null_imputation.csv')
df = df.drop(['id', 'member_id'], axis=1)

df['earliest_credit_line'] = pd.to_datetime(df['earliest_credit_line'])
df['issue_date'] = pd.to_datetime(df['issue_date'])
df['credit_history_length_days'] = (df['issue_date'] - df['earliest_credit_line']).dt.days

df['loan_to_income_ratio'] = df['loan_amount'] / df['annual_inc']

df['open_account_ratio'] = df['open_accounts'] / df['total_accounts']
df['open_account_ratio'].fillna(0, inplace=True)

df['instalment_to_income_ratio'] = (
    df['instalment'] / (df['annual_inc']/12)
).replace([np.inf, -np.inf], np.nan)

df['instalment_to_income_ratio'].fillna(df['instalment_to_income_ratio'].median(), inplace=True)

df['earliest_credit_line'] = df['earliest_credit_line'].astype(str)
df['issue_date'] = df['issue_date'].astype(str)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['open_account_ratio'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['instalment_to_income_ratio'].fillna(df['instalment_to_income_ratio'].median(), inplace=True)


In [7]:
def skew_transformations(df: pd.DataFrame, skew_threshold: float = 0.5, tolerance: float = 0.05, recommend: bool = True, transform: bool = False ):
    recommendations = {
        'box_cox': [],
        'yeo_johnson': [],
        'no_transform': []
    }
    
    numeric_cols = df.select_dtypes(include=np.number).columns.tolist()

    for col in numeric_cols:
        original_skew = df[col].skew()

        if abs(original_skew) <= skew_threshold:
            continue

        has_non_positive = (df[col] <= 0).any()
        
        yj_transformed, _ = stats.yeojohnson(df[col])
        skew_yj = pd.Series(yj_transformed).skew()

        skew_bc = np.inf
        if not has_non_positive:
            bc_transformed, _ = stats.boxcox(df[col])
            skew_bc = pd.Series(bc_transformed).skew()

        best_transform_skew = min(abs(skew_bc), abs(skew_yj))

        if best_transform_skew < skew_threshold:
            if abs(skew_bc) <= abs(skew_yj) + tolerance:
                recommendations['box_cox'].append(col)
            else:
                recommendations['yeo_johnson'].append(col)
        else:
            recommendations['no_transform'].append(col)

    if recommend == True:
        print("Columns recommended for Box-Cox transformation:")
        if recommendations['box_cox']:
            for col in recommendations['box_cox']:
                print(f"  - {col}")
        else:
            print("  - None")

        print("\nColumns recommended for Yeo-Johnson transformation:")
        if recommendations['yeo_johnson']:
            for col in recommendations['yeo_johnson']:
                print(f"  - {col}")
        else:
            print("  - None")

        print("\nSkewed columns where no transformation was effective:")
        if recommendations['no_transform']:
            for col in recommendations['no_transform']:
                print(f"  - {col}")
        else:
            print("  - None")

    if transform == True:
        for col in recommendations['box_cox']:
            df[col], _ = stats.boxcox(df[col])
            print(f"Applied Box-Cox transformation to '{col}'.")
        for col in recommendations['yeo_johnson']:
            df[col], _ = stats.yeojohnson(df[col])
            print(f"Applied Yeo-Johnson transformation to '{col}'.")

    return df

In [8]:
skew_transformations(df,transform=True)

Columns recommended for Box-Cox transformation:
  - loan_amount
  - funded_amount
  - instalment
  - annual_inc
  - open_accounts
  - total_accounts
  - total_payment
  - total_rec_int
  - credit_history_length_days
  - loan_to_income_ratio
  - open_account_ratio
  - instalment_to_income_ratio

Columns recommended for Yeo-Johnson transformation:
  - funded_amount_inv
  - inq_last_6mths
  - total_payment_inv
  - total_rec_prncp
  - last_payment_amount

Skewed columns where no transformation was effective:
  - delinq_2yrs
  - out_prncp
  - out_prncp_inv
  - total_rec_late_fee
  - recoveries
  - collection_recovery_fee
  - collections_12_mths_ex_med
Applied Box-Cox transformation to 'loan_amount'.
Applied Box-Cox transformation to 'funded_amount'.
Applied Box-Cox transformation to 'instalment'.
Applied Box-Cox transformation to 'annual_inc'.
Applied Box-Cox transformation to 'open_accounts'.
Applied Box-Cox transformation to 'total_accounts'.
Applied Box-Cox transformation to 'total_payme

Unnamed: 0,loan_amount,funded_amount,funded_amount_inv,term,int_rate,instalment,grade,sub_grade,employment_length,home_ownership,...,last_payment_date,last_payment_amount,last_credit_pull_date,collections_12_mths_ex_med,policy_code,application_type,credit_history_length_days,loan_to_income_ratio,open_account_ratio,instalment_to_income_ratio
0,68.067206,66.565134,118.808918,36 months,7.49,16.250750,A,A4,5 years,MORTGAGE,...,2022-01,4.910558,2022-01,0.0,1,INDIVIDUAL,6.439166,-1.170138,-0.699891,-1.517138
1,82.085119,80.149885,148.884013,36 months,6.99,19.755785,A,A3,9 years,RENT,...,2022-01,5.293713,2022-01,0.0,1,INDIVIDUAL,6.192767,-0.975149,-0.636054,-1.396256
2,88.179744,86.049365,162.331735,36 months,7.49,21.350927,A,A4,8 years,MORTGAGE,...,2021-10,7.763081,2021-10,0.0,1,INDIVIDUAL,6.259452,-1.072907,-0.796190,-1.455587
3,86.087774,84.024811,157.691813,36 months,14.31,21.634826,C,C4,1 year,RENT,...,2021-06,7.815112,2021-06,0.0,1,INDIVIDUAL,5.989336,-0.806591,-0.671795,-1.249544
4,86.087774,84.024811,157.691813,36 months,6.03,20.649467,A,A1,10+ years,MORTGAGE,...,2022-01,5.380851,2022-01,0.0,1,INDIVIDUAL,6.178572,-1.362764,-0.674296,-1.644080
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54095,30.698122,30.206701,0.000000,36 months,16.08,6.929430,F,F2,< 1 year,RENT,...,2016-03,5.507457,2016-03,0.0,1,INDIVIDUAL,6.001760,-1.644617,-0.465160,-1.805182
54096,61.087638,59.792063,33.294103,36 months,9.64,14.650246,B,B4,1 year,RENT,...,2016-12,4.712596,2016-12,0.0,1,INDIVIDUAL,5.843142,-0.847326,0.000000,-1.301705
54097,52.402161,51.353703,56.261077,36 months,7.75,12.241126,A,A3,1 year,OWN,...,2016-09,4.371804,2016-08,0.0,1,INDIVIDUAL,6.001760,-1.498827,-0.272802,-1.725990
54098,59.115307,57.876854,99.876726,36 months,13.16,14.434660,C,C3,< 1 year,RENT,...,2016-10,4.057488,2021-04,0.0,1,INDIVIDUAL,6.473221,-0.979261,-0.176267,-1.369149


In [9]:
def drop_outlier_rows(DataFrame: pd.DataFrame, column_name: str, z_score_threshold: int):

    mean = np.mean(DataFrame[column_name]) 
    std = np.std(DataFrame[column_name]) 
    z_scores = (DataFrame[column_name] - mean) / std 
    abs_z_scores = pd.Series(abs(z_scores)) 
    mask = abs_z_scores < z_score_threshold
    DataFrame = DataFrame[mask]         
    return DataFrame

In [10]:
outlier_columns = ['loan_amount', 'funded_amount', 'funded_amount_inv', 'int_rate', 'instalment', 'annual_inc', 'dti', 'open_accounts', 'total_accounts', 'total_payment', 'total_payment_inv', 'total_rec_prncp', 'total_rec_int', 'last_payment_amount']

print(f'Before: The DataFrame has {df.shape[0]} rows.') 

for column in outlier_columns: 
    df = drop_outlier_rows(df, column, 3) 
    
print(f'After: The DataFrame has {df.shape[0]} rows.') 

Before: The DataFrame has 54100 rows.
After: The DataFrame has 52903 rows.


The data has been again adjusted for Skewness and Outliers in the presence of the newly added features.

<h3> Data Preparation for Modeling </h3>

Before we can train our models, we need to prepare the data. This involves several key steps:
- Defining the Target Variable: <br> We will define our target variable, is_good_loan, which will be a binary indicator (1 for a good loan, 0 for a bad loan). This is the variable our models will learn to predict.
- Removal of Leaky/Redundant Columns: <br> We will not consider the columns like total_rec_prncp, total_rec_int etc . These features are considered "leaky" because they contain information about the loan's outcome that would not be available at the time of prediction, which can lead to an artificially inflated and unrealistic model performance.

In [11]:
df = pd.read_csv('loan_payments_versions/loan_payments_transformed.csv')
df = df.drop(['id', 'member_id'], axis=1)

good_loan_statuses = [
    "Fully Paid",
    "Does not meet the credit policy. Status:Fully Paid",
]
bad_loan_statuses = [
    "Charged Off",
    "Does not meet the credit policy. Status:Charged Off",
]

historical_df = df[
    df["loan_status"].isin(good_loan_statuses + bad_loan_statuses)
].copy()
historical_df["loan_status"] = historical_df["loan_status"].apply(
    lambda x: 1 if x in good_loan_statuses else 0
)
leaky_columns = [
    'last_payment_date',
    'last_payment_amount',
    'last_credit_pull_date',
    'recoveries',
    'collection_recovery_fee',
    'total_payment',
    'total_rec_prncp',
    'total_rec_int',
    'total_rec_late_fee',
]

historical_df = historical_df.drop(columns=leaky_columns)
categorical_cols = historical_df.select_dtypes(include='object').columns.tolist()
print(categorical_cols)

['term', 'grade', 'sub_grade', 'employment_length', 'home_ownership', 'verification_status', 'issue_date', 'payment_plan', 'purpose', 'earliest_credit_line', 'application_type']


- One-hot Encoding: <br> Machine learning models require numerical input. We'll convert our categorical features (like purpose, home_ownership etc) into a numerical format using one-hot encoding.


In [12]:
historical_df_encoded = pd.get_dummies(historical_df,columns=categorical_cols, drop_first=True)

- Data Splitting: <br> We will split our dataset into a training set and a testing set. The model will learn from the training set, and we will evaluate its performance on the unseen testing set to ensure it generalizes well.
- Feature Scaling: <br> We'll scale our numerical features to ensure they are on a similar scale. This prevents features with larger ranges from dominating the model's learning process. We will use StandardScaler for this.
- Dimensionality Reduction with PCA: <br> To reduce the complexity of our data and potentially improve model performance, we'll use Principal Component Analysis (PCA). PCA will transform our features into a smaller set of uncorrelated components while retaining most of the original data's variance.


In [13]:
X = historical_df_encoded.drop('loan_status', axis=1)
Y = historical_df_encoded['loan_status']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(explained_variance >= 0.95) + 1

print(f"Number of components that explain at least 95% of variance: {n_components}")

Number of components that explain at least 95% of variance: 633


<h3> Modeling Iteration 1: Baseline Models on Imbalanced Data </h3>

Our first step is to establish a performance baseline. We will train our standard classification models on the raw, imbalanced training data.

In [14]:
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(X_pca, Y, test_size=0.3, random_state=42, stratify=Y)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=300, random_state=42, n_jobs=-1),
    "LightGBM": LGBMClassifier(random_state=42)
}

for name, model in models.items():
    print(f"--- Training and Evaluating {name} ---")
    
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_test)
    
    print(f"\nAccuracy Score: {accuracy_score(y_test, predictions):.4f}")
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("\nClassification Report:")
    print(classification_report(y_test, predictions))
    print("====================================================\n")


--- Training and Evaluating Logistic Regression ---

Accuracy Score: 0.8846

Confusion Matrix:
[[ 619 1066]
 [  72 8101]]

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.37      0.52      1685
           1       0.88      0.99      0.93      8173

    accuracy                           0.88      9858
   macro avg       0.89      0.68      0.73      9858
weighted avg       0.89      0.88      0.86      9858


--- Training and Evaluating Random Forest ---

Accuracy Score: 0.8284

Confusion Matrix:
[[  29 1656]
 [  36 8137]]

Classification Report:
              precision    recall  f1-score   support

           0       0.45      0.02      0.03      1685
           1       0.83      1.00      0.91      8173

    accuracy                           0.83      9858
   macro avg       0.64      0.51      0.47      9858
weighted avg       0.77      0.83      0.76      9858


--- Training and Evaluating LightGBM ---
[LightGBM] [Info]

Loan datasets are typically imbalanced as can be seen here, with far more good loans than bad ones. A naive model can achieve high accuracy simply by always predicting the majority class ("good loan"). This step highlighted the problem: we saw high accuracy but very poor recall for the minority class (bad loans), meaning the model fails at its primary goal of identifying risky loans.

<h3> Modeling Iteration 2: Addressing Imbalance with Class Weights </h3>

Our next attempt to improve the model involves using a simple yet effective technique: class weighting.

Many scikit-learn models have a class_weight='balanced' parameter. This technique modifies the loss function, applying a higher penalty to misclassifications of the minority class (bad loans). In essence, it tells the model, "Pay more attention to getting the bad loans right, even if it means making a few more mistakes on the good loans."

In [15]:
#Addressing Class Imbalance because of which the models have a very low recall score

X_train, X_test, y_train, y_test = train_test_split(X_pca, Y, test_size=0.3, random_state=42, stratify=Y)

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced'),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, class_weight='balanced'),
    "LightGBM": LGBMClassifier(random_state=42, class_weight='balanced')
}

for name, model in models.items():
    print(f"--- Training and Evaluating {name} (Balanced) ---")
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"\nAccuracy Score: {accuracy_score(y_test, predictions):.4f}")
    print("\nConfusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("\nClassification Report:")
    print(classification_report(y_test, predictions))
    print("====================================================\n")

--- Training and Evaluating Logistic Regression (Balanced) ---

Accuracy Score: 0.8842

Confusion Matrix:
[[1349  336]
 [ 806 7367]]

Classification Report:
              precision    recall  f1-score   support

           0       0.63      0.80      0.70      1685
           1       0.96      0.90      0.93      8173

    accuracy                           0.88      9858
   macro avg       0.79      0.85      0.82      9858
weighted avg       0.90      0.88      0.89      9858


--- Training and Evaluating Random Forest (Balanced) ---

Accuracy Score: 0.8295

Confusion Matrix:
[[  12 1673]
 [   8 8165]]

Classification Report:
              precision    recall  f1-score   support

           0       0.60      0.01      0.01      1685
           1       0.83      1.00      0.91      8173

    accuracy                           0.83      9858
   macro avg       0.71      0.50      0.46      9858
weighted avg       0.79      0.83      0.75      9858


--- Training and Evaluating LightGBM

All the models were retrained with this parameter and compare the new classification reports to our baseline. We saw a significant improvement in the recall for bad loans, though overall accuracy slightly decreased. This is a worthwhile trade-off.

<h3> Modeling Iteration 3: Comparative Analysis of Resampling Techniques Across Multiple Models </h3>

This section implements a systematic experiment to determine the most effective strategy for handling our imbalanced dataset. The goal is to compare how different over-sampling techniques perform when paired with various classification algorithms.

Resamplers Used: <br>
- SMOTE: Creates new minority class samples by interpolating between existing minority class neighbors.
- ADASYN : Generates more synthetic samples for minority class instances that are harder to learn, focusing on those near the decision boundary.
- SMOTE-Tomek: A hybrid method that first uses SMOTE to create synthetic minority samples and then removes Tomek links (pairs of nearest neighbors from opposite classes) to clean up noise.
- SMOTE-ENN: Another hybrid method that first uses SMOTE to oversample the minority class and then uses Edited Nearest Neighbours to remove majority class samples that are misclassified by their neighbors.

Models Used: <br>
- Logistic Regression: A reliable and interpretable linear model.
- Random Forest: A powerful ensemble model based on decision trees.
- LightGBM: A highly efficient and often top-performing gradient-boosting model.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X_pca, Y, test_size=0.3, random_state=42, stratify=Y)

samplers = {
    'SMOTE':SMOTE(random_state=42),
    "ADASYN": ADASYN(random_state=42),
    "SMOTE-Tomek": SMOTETomek(random_state=42),
    "SMOTE-ENN": SMOTEENN(random_state=42)
}

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    "LightGBM": LGBMClassifier(random_state=42),
}

for sampler_name, sampler in samplers.items():
    print(f"================== Evaluating with {sampler_name} ==================")
    
    print(f"Applying {sampler_name}...")
    X_train_resampled, y_train_resampled = sampler.fit_resample(X_train, y_train)
    print("Resampling complete.")
    print("-" * 30)
    
    for model_name, model in models.items():
        print(f"--- Training and Evaluating {model_name} (with {sampler_name}) ---")
        
        model.fit(X_train_resampled, y_train_resampled)
        
        predictions = model.predict(X_test)
        
        print(f"\nAccuracy Score: {accuracy_score(y_test, predictions):.4f}")
        print("\nConfusion Matrix:")
        print(confusion_matrix(y_test, predictions))
        print("\nClassification Report:")
        print(classification_report(y_test, predictions))
        print("----------------------------------------------------\n")

Applying SMOTE...
Resampling complete.
------------------------------
--- Training and Evaluating Logistic Regression (with SMOTE) ---

Accuracy Score: 0.9066

Confusion Matrix:
[[1337  348]
 [ 573 7600]]

Classification Report:
              precision    recall  f1-score   support

           0       0.70      0.79      0.74      1685
           1       0.96      0.93      0.94      8173

    accuracy                           0.91      9858
   macro avg       0.83      0.86      0.84      9858
weighted avg       0.91      0.91      0.91      9858

----------------------------------------------------

--- Training and Evaluating Random Forest (with SMOTE) ---

Accuracy Score: 0.8039

Confusion Matrix:
[[ 225 1460]
 [ 473 7700]]

Classification Report:
              precision    recall  f1-score   support

           0       0.32      0.13      0.19      1685
           1       0.84      0.94      0.89      8173

    accuracy                           0.80      9858
   macro avg       

Based on the output, we can draw some clear conclusions about which models and resampling techniques are most effective for your loan prediction task. The key to success here is not just overall accuracy, but the model's ability to correctly identify the minority class (bad loans, labeled as 0)

Based on a comparative analysis, the optimal strategy for predicting loan defaults is using a Logistic Regression model trained on data balanced with the SMOTE-Tomek resampling technique. This combination proved superior by achieving the best F1-score (0.78) for identifying bad loans while maintaining the highest overall accuracy (92%). While various resampling methods were tested, SMOTE-Tomek's hybrid approach of creating synthetic data and cleaning noisy examples provided the cleanest decision boundary for the linear model. In contrast, more complex tree-based models like Random Forest and LightGBM consistently failed to effectively identify the minority class, and aggressive resampling with SMOTE-ENN drastically hurt precision, making the chosen combination the most reliable and balanced solution.