PROJECT 1 : LOAN DEFAULT RISK (LENDING CLUB)

The objective of this project is to build a credit risk model that predicts the probability of loan default.
The model is designed to support credit approval decisions, risk segmentation, and portfolio risk management.

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

1. Data Preparation:

In [3]:
df = pd.read_csv("lending_club_loan_two.csv")

df.head()



Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address
0,10000.0,36 months,11.44,329.48,B,B4,Marketing,10+ years,RENT,117000.0,...,16.0,0.0,36369.0,41.8,25.0,w,INDIVIDUAL,0.0,0.0,"0174 Michelle Gateway\r\nMendozaberg, OK 22690"
1,8000.0,36 months,11.99,265.68,B,B5,Credit analyst,4 years,MORTGAGE,65000.0,...,17.0,0.0,20131.0,53.3,27.0,f,INDIVIDUAL,3.0,0.0,"1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113"
2,15600.0,36 months,10.49,506.97,B,B3,Statistician,< 1 year,RENT,43057.0,...,13.0,0.0,11987.0,92.2,26.0,f,INDIVIDUAL,0.0,0.0,"87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113"
3,7200.0,36 months,6.49,220.65,A,A2,Client Advocate,6 years,RENT,54000.0,...,6.0,0.0,5472.0,21.5,13.0,f,INDIVIDUAL,0.0,0.0,"823 Reid Ford\r\nDelacruzside, MA 00813"
4,24375.0,60 months,17.27,609.33,C,C5,Destiny Management Inc.,9 years,MORTGAGE,55000.0,...,13.0,0.0,24584.0,69.8,43.0,f,INDIVIDUAL,1.0,0.0,"679 Luna Roads\r\nGreggshire, VA 11650"


We start by selecting a limited set of finance-relevant features such as loan amount, interest rate, income,
credit grade, debt-to-income ratio, and revolving credit utilization.
This step ensures the model remains interpretable and aligned with real credit risk practices.

In [4]:
selected_columns = [
    "loan_amnt",
    "term",
    "int_rate",
    "grade",
    "sub_grade",
    "annual_inc",
    "home_ownership",
    "purpose",
    "dti",
    "open_acc",
    "revol_util",
    "mort_acc",
    "loan_status"
]
df = df[selected_columns]
df.shape


(396030, 13)

2. Target Variable Definition:

The target variable is loan default.
Loans with status 'Charged Off' or 'Default' are labeled as default (1),
while fully paid loans are labeled as non-default (0).
The observed default rate is approximately 19.6%, which is realistic for consumer credit portfolios.

Converting "loan_status" into a binary default flag then mapping it

In [5]:
df["loan_status"].value_counts()

loan_status
Fully Paid     318357
Charged Off     77673
Name: count, dtype: int64

In [6]:
df["default"] = np.where(
    df["loan_status"].isin(["Charged Off", "Default"]),
    1,
    0
)

In [7]:
df["default"].value_counts(normalize=True)

default
0    0.803871
1    0.196129
Name: proportion, dtype: float64

3. Feature Engineering & Encoding:

Categorical variables such as credit grade, loan purpose, and home ownership are encoded using one-hot encoding.
Numerical variables are scaled to ensure stable and interpretable model coefficients.

In [8]:
df = df.drop(columns=["loan_status"])

In [9]:
df.isna().mean().sort_values(ascending=False)

mort_acc          0.095435
revol_util        0.000697
int_rate          0.000000
term              0.000000
loan_amnt         0.000000
sub_grade         0.000000
grade             0.000000
annual_inc        0.000000
home_ownership    0.000000
dti               0.000000
purpose           0.000000
open_acc          0.000000
default           0.000000
dtype: float64

In [10]:
df["mort_acc"] = df["mort_acc"].fillna(df["mort_acc"].median())
df["revol_util"] = df["revol_util"].fillna(df["revol_util"].median())

In [11]:
df["mort_acc"] = df["mort_acc"].fillna(df["mort_acc"].median())
df["revol_util"] = df["revol_util"].fillna(df["revol_util"].median())


In [12]:
df["term"] = df["term"].str.replace(" months", "").astype(int)


In [13]:
categorical_cols = [
    "grade",
    "sub_grade",
    "home_ownership",
    "purpose"
]

df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)


In [14]:
df.shape

(396030, 67)

4. Model Choice:

Goal: Predict probability of default, create a risk score, and simulate an approval policy.

In [15]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["default"])
y = df["default"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


Logistic Regression is used as the baseline model due to its interpretability and widespread use
in credit risk and regulatory environments.
Class imbalance is handled using class weighting to ensure default cases are properly learned.

In [17]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    n_jobs=-1
)

log_reg.fit(X_train_scaled, y_train)


In [18]:
y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]


In [19]:
from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix(y_test, y_pred)


array([[50603, 28987],
       [ 6329, 13089]])

In [20]:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.89      0.64      0.74     79590
           1       0.31      0.67      0.43     19418

    accuracy                           0.64     99008
   macro avg       0.60      0.65      0.58     99008
weighted avg       0.78      0.64      0.68     99008



5. Model Evaluation:

In [21]:
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_test, y_prob)
roc_auc


np.float64(0.7103861242674454)

The model achieves a ROC-AUC score of 0.71, indicating good discrimination between defaulting and non-defaulting loans.
Recall for defaulted loans is prioritized, ensuring that a majority of risky loans are correctly identified.

6. Risk Score & Approval Policy:

Predicted probabilities are converted into a risk score ranging from 0 to 100.
A simulated approval threshold is applied to reject high-risk loans.
This policy significantly reduces the portfolio default rate, demonstrating the model’s business value.

In [22]:
risk_score = (y_prob * 100).round(2)


In [23]:
results = X_test.copy()
results["default_actual"] = y_test.values
results["risk_score"] = risk_score


In [24]:
threshold = 60

results["approved"] = (results["risk_score"] < threshold).astype(int)


In [25]:
portfolio_summary = results.groupby("approved")["default_actual"].mean()
portfolio_summary


approved
0    0.370190
1    0.139672
Name: default_actual, dtype: float64

7. Business Conclusion:

This project demonstrates how data-driven credit risk models can support loan approval decisions
and reduce portfolio risk.
The approach is realistic, interpretable, and directly applicable to finance and risk management teams.

In [26]:
results.to_csv(
    "credit_risk_scored_loans.csv",
    index=False
)