# **Loan Default Prediction**

This notebook demonstrates training a **PyTorch neural network** to predict whether a loan will default.

## **Notebook Outline**
1. **Train-Test Split & Imbalance Handling**
2. **PyTorch Model Building & Training**
3. **Evaluation & Threshold Tuning**

We'll see both **Markdown explanations** (like this one) and **Code cells** demonstrating each step.

---

In [6]:
# (Cell 1) 1. Imports & Setup
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (confusion_matrix, precision_score, recall_score, 
                             f1_score, roc_auc_score, classification_report)
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from imblearn.over_sampling import RandomOverSampler

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

plt.style.use('seaborn')  # optional aesthetics
print("Setup complete.")

Setup complete.


  plt.style.use('seaborn')  # optional aesthetics


## **Cleaned Data Loading For Traning**
We load the "cleaned_train_data.csv" file that we saved after EDA analysis for training.

In [7]:
df = pd.read_csv("cleaned_train_data.csv")


## **6. Train-Test Split & Imbalance Handling**

We'll do a stratified split (to preserve the ~7% minority class). Then we optionally oversample the minority class using `RandomOverSampler` from `imblearn`.

In [8]:
# (Cell 5) Train-Val Split, Scale, and Oversample

y = df['bad_flag'].astype(int)
X = df.drop(columns=['bad_flag'])

# Stratified split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Train shape:", X_train.shape, "Val shape:", X_val.shape)
print("Positive class in train:", (y_train==1).mean())
print("Positive class in val:  ", (y_val==1).mean())

# Scale numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled   = scaler.transform(X_val)

# Convert back to DataFrame (optional)
X_train = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_val   = pd.DataFrame(X_val_scaled,   columns=X_val.columns,   index=X_val.index)

# OverSampling
ros = RandomOverSampler(random_state=42, sampling_strategy=0.15)
X_train_ros, y_train_ros = ros.fit_resample(X_train, y_train)
print("After oversampling:", X_train_ros.shape)
print("Positive class proportion:", (y_train_ros==1).mean())

Train shape: (151565, 32) Val shape: (37892, 32)
Positive class in train: 0.06929040345726256
Positive class in val:   0.06930222738308878
After oversampling: (162222, 32)
Positive class proportion: 0.13043237045530198


## **7. PyTorch Model Building & Training**
We'll define a **DeeperNet** with a couple of hidden layers. We'll use a **pos_weight** in the BCE loss to handle imbalance. Then train for a set number of epochs.

**Why this architecture?**
- A 2-layer MLP is more expressive than a single-layer and still relatively fast.
- We use ReLU activation for simplicity.
- We use Adam optimizer with a small learning rate. In practice, you might tune more extensively.


In [9]:
# (Cell 6) Build PyTorch model & Dataloaders

# Prepare final train sets
use_oversample = True  # set to False if you want to skip oversampling

if use_oversample:
    X_train_final, y_train_final = X_train_ros, y_train_ros
else:
    X_train_final, y_train_final = X_train, y_train

X_train_t = torch.tensor(X_train_final.values, dtype=torch.float32)
y_train_t = torch.tensor(y_train_final.values, dtype=torch.float32)

X_val_t = torch.tensor(X_val.values, dtype=torch.float32)
y_val_t = torch.tensor(y_val.values, dtype=torch.float32)

train_dataset = TensorDataset(X_train_t, y_train_t)
val_dataset   = TensorDataset(X_val_t,   y_val_t)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=64, shuffle=False)

class DeeperNet(nn.Module):
    def __init__(self, input_dim, hidden_dim=64):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, 1)
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)  # no sigmoid here for BCEWithLogitsLoss
        return x

# Initialize the network
model = DeeperNet(input_dim=X_train.shape[1], hidden_dim=64)
print(model)

# pos_weight for imbalance
neg_count = (y_train == 0).sum()
pos_count = (y_train == 1).sum()
pos_weight_val = torch.tensor([neg_count / pos_count], dtype=torch.float32)

criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight_val)
optimizer = optim.Adam(model.parameters(), lr=0.001)

epochs = 50
for epoch in range(epochs):
    model.train()
    total_loss = 0.0
    correct, total = 0, 0

    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_X).squeeze()
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

        # Compute train accuracy
        probs = torch.sigmoid(outputs)
        preds = (probs > 0.5).float()
        correct += (preds == batch_y).sum().item()
        total   += batch_y.size(0)

    train_acc = correct / total
    avg_train_loss = total_loss / len(train_loader)

    # Validation
    model.eval()
    val_loss, val_correct, val_total = 0.0, 0, 0
    with torch.no_grad():
        for val_X, val_y in val_loader:
            val_outputs = model(val_X).squeeze()
            v_loss = criterion(val_outputs, val_y)
            val_loss += v_loss.item()

            val_probs = torch.sigmoid(val_outputs)
            val_preds = (val_probs > 0.5).float()
            val_correct += (val_preds == val_y).sum().item()
            val_total   += val_y.size(0)

    avg_val_loss = val_loss / len(val_loader)
    val_acc = val_correct / val_total

    if (epoch+1) % 1 == 0:
        print(f"[Epoch {epoch+1}/{epochs}] "
              f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.4f} | "
              f"Val Loss: {avg_val_loss:.4f}, Val Acc: {val_acc:.4f}")


DeeperNet(
  (fc1): Linear(in_features=32, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=1, bias=True)
)
[Epoch 1/50] Train Loss: 1.5505, Train Acc: 0.3276 | Val Loss: 1.2859, Val Acc: 0.2803
[Epoch 2/50] Train Loss: 1.5289, Train Acc: 0.3550 | Val Loss: 1.2447, Val Acc: 0.3435
[Epoch 3/50] Train Loss: 1.5147, Train Acc: 0.3682 | Val Loss: 1.2955, Val Acc: 0.3393
[Epoch 4/50] Train Loss: 1.5028, Train Acc: 0.3763 | Val Loss: 1.2743, Val Acc: 0.3486
[Epoch 5/50] Train Loss: 1.4899, Train Acc: 0.3866 | Val Loss: 1.2356, Val Acc: 0.4032
[Epoch 6/50] Train Loss: 1.4812, Train Acc: 0.3957 | Val Loss: 1.2765, Val Acc: 0.3347
[Epoch 7/50] Train Loss: 1.4693, Train Acc: 0.4050 | Val Loss: 1.3105, Val Acc: 0.3283
[Epoch 8/50] Train Loss: 1.4586, Train Acc: 0.4136 | Val Loss: 1.2423, Val Acc: 0.4261
[Epoch 9/50] Train Loss: 1.4486, Train Acc: 0.4252 | Val Loss: 1.2817, Val Acc: 0.3738
[Epoch 10/50] Train Loss:

## **8. Evaluation & Threshold Tuning**
We'll compute **precision, recall, F1, and AUC** on the validation set. Then we can see if a threshold other than 0.5 yields better F1.

In [10]:
# (Cell 7) Final Evaluation
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for val_X, val_y in val_loader:
        logits = model(val_X).squeeze()
        probs = torch.sigmoid(logits)
        all_preds.extend(probs.cpu().numpy())
        all_labels.extend(val_y.cpu().numpy())

# Default threshold = 0.5
pred_labels = [1 if p >= 0.5 else 0 for p in all_preds]

prec = precision_score(all_labels, pred_labels)
rec  = recall_score(all_labels, pred_labels)
f1   = f1_score(all_labels, pred_labels)
auc  = roc_auc_score(all_labels, all_preds)

print(f"Precision: {prec:.4f}, Recall: {rec:.4f}, F1: {f1:.4f}, AUC: {auc:.4f}")

# Classification report
print("\nClassification Report:\n", classification_report(all_labels, pred_labels, target_names=["class 0", "class 1"]))

Precision: 0.0966, Recall: 0.6580, F1: 0.1685, AUC: 0.6346

Classification Report:
               precision    recall  f1-score   support

     class 0       0.96      0.54      0.69     35266
     class 1       0.10      0.66      0.17      2626

    accuracy                           0.55     37892
   macro avg       0.53      0.60      0.43     37892
weighted avg       0.90      0.55      0.66     37892



### Threshold Tuning
To find the best F1 threshold, we can systematically try thresholds from 0.0 to 1.0 in small increments. This is just a demonstration—pick a threshold that aligns with your business needs (precision vs. recall).

In [11]:
# (Cell 8) Threshold Tuning
best_t, best_f1 = 0, 0
import numpy as np

for t in np.arange(0.0, 1.01, 0.01):
    temp_preds = [1 if p >= t else 0 for p in all_preds]
    current_f1 = f1_score(all_labels, temp_preds)
    if current_f1 > best_f1:
        best_f1 = current_f1
        best_t = t

print("Best threshold:", best_t)
print("Best F1 at that threshold:", best_f1)

Best threshold: 0.65
Best F1 at that threshold: 0.17880794701986755
