# Loan Amount Prediction & Classification Model  
### End-to-End Modeling with Preprocessed Lending Data

This notebook builds two machine learning models using a fully cleaned and preprocessed dataset (`df_model_cleaned.csv`):

1. **A Regression Model**  
   Predicts the exact loan amount requested by a borrower.  
2. **A Classification Model**  
   Categorizes each loan request into Low, Medium, or High amount buckets.

---

## ðŸ“Œ Background

The dataset used in this notebook (`df_model_cleaned.csv`) was created in a separate preprocessing workflow by merging and transforming the following cleaned tables:

- `borrower_info_cleaned`  
- `credit_history_cleaned`  
- `delinquency_risk_cleaned`  
- `loan_base_cleaned`

All cleaning, merging, and feature preparation operations were performed outside this notebook.  
Here, we focus solely on modeling, evaluation, and interpretation.

---

## ðŸ“˜ Modeling Workflow Overview

### **1. Regression Task**
- Select stable, high-signal features  
- Train a Random Forest Regressor  
- Apply log-transform to stabilize the target variable  
- Evaluate performance using:
  - MAE  
  - RMSE  
  - RÂ²  
- Compare the model against a baseline predictor  
- Extract and visualize feature importances  

### **2. Classification Task**
- Convert the continuous loan amount into 3 buckets:
  - Low (â‰¤10k), Medium (10kâ€“20k), High (>20k)
- Build a classification dataset using proven regression features  
- Train a Random Forest Classifier  
- Evaluate with:
  - Accuracy  
  - Full classification report (precision, recall, F1-score)  
  - Confusion matrix  

---

## ðŸŽ¯ Objective

The purpose of this notebook is to create a robust and interpretable pipeline for predicting loan amounts and categorizing borrowers into meaningful risk/size segments.  
The workflow demonstrates:

- Proper handling of mixed data types  
- Preprocessing pipelines  
- Model training with scikit-learn  
- Evaluation against baselines  
- Feature importance analysis  

---

## ðŸš€ Let's Begin

We start by importing all required libraries and loading the combined modeling dataset.


## 1. Regression Modeling

### Import Required Libraries

In this section, we import all necessary Python libraries for data manipulation, visualization, preprocessing, model training, and evaluation.  
These include:
- **Pandas & NumPy** for data handling  
- **Matplotlib & Seaborn** for exploratory visualizations  
- **Scikit-learn** modules for preprocessing, encoding, modeling, and performance metrics


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


### Load Preprocessed Modeling Dataset

We load the `df_model_cleaned.csv` file, which serves as the final input dataset for model development.  
This dataset was previously created in a separate preprocessing notebook by merging and transforming the following cleaned tables:

- `borrower_info_cleaned`
- `credit_history_cleaned`
- `delinquency_risk_cleaned`
- `loan_base_cleaned`

Since those preparation steps were completed earlier, only the resulting dataset is loaded here.  
Below, we display the dataset's shape and the first few rows to verify its structure.


In [None]:
df = pd.read_csv("df_model_cleaned.csv")

print("Shape:", df.shape)
df.head(3)


### Select Relevant Features for Modeling

In this step, we define the feature set to be used for training the model.  
We exclude noisy or unstable variables (such as delinquency-related columns) and keep only the most reliable and predictive features.  
The selected feature list includes both numerical and categorical variables:

- Income and credit limit indicators  
- Account and credit activity metrics  
- Employment information  
- Categorical attributes such as loan purpose, home ownership, and application type  

After defining the feature set, we construct a clean regression dataset by selecting the chosen columns along with the target variable `loan_amnt`, and removing any rows containing missing values.  
Finally, we create the feature matrix **X** and target vector **y** for subsequent modeling steps.


In [None]:
# Remove noisy columns and keep only the strongest predictive features
feature_cols = [
    'annual_inc_capped',
    'term_clean',
    'total_bc_limit_capped',
    'total_il_high_credit_limit_capped',
    'total_acc',
    'num_rev_accts',
    'emp_length_clean',
    'purpose',          # Categorical
    'home_ownership',   # Categorical
    'application_type'  # Categorical
]

# We no longer include noisy variables such as delinquency-related columns
print("Selected Feature Columns:")
print(feature_cols)

# Redefine X and y
df_reg = df[feature_cols + ['loan_amnt']].dropna()
X = df_reg[feature_cols]
y = df_reg['loan_amnt']


### Prepare Regression Dataset

We subset the dataframe to include only the selected feature columns along with the target variable `loan_amnt`, and then remove any rows containing missing values.  
This results in a clean and consistent dataset for modeling.  

Afterward, we define:
- **X** â†’ the feature matrix  
- **y** â†’ the target variable representing the loan amount  

We also print the shape of the regression dataset to verify the final size.


In [None]:
df_reg = df[feature_cols + ['loan_amnt']].dropna()

print("Regression Dataset Shape:", df_reg.shape)

X = df_reg[feature_cols]
y = df_reg['loan_amnt']


### Separate Numerical and Categorical Features

To prepare the dataset for preprocessing, we identify which features are numerical and which are categorical.  
- Numerical columns are detected automatically based on their data types.  
- Categorical columns are determined by selecting the remaining features that are not numeric.

This separation is required because different preprocessing steps (imputation, encoding, scaling, etc.) will be applied to each feature type.


In [None]:
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

categorical_features = [c for c in feature_cols if c not in numeric_features]

print("Numeric:", numeric_features)
print("Categorical:", categorical_features)


### Build Preprocessing Pipelines

We define preprocessing steps for both numerical and categorical features:

- **Numerical Features:**  
  Missing values are imputed using the median, which is robust to outliers.

- **Categorical Features:**  
  A pipeline is created that first imputes missing values with the most frequent category,  
  and then applies One-Hot Encoding while ignoring unseen categories during inference.

These preprocessing components are combined into a `ColumnTransformer`,  
ensuring the correct transformations are applied to each feature group before model training.


In [None]:
numeric_transformer = SimpleImputer(strategy='median')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


### Split the Data and Apply Log Transformation to the Target Variable

We begin by splitting the dataset into training and testing sets using an 80/20 ratio.  
To improve model performance and stabilize variance, we apply a **logarithmic transformation** (`log1p`) to the target variable (`loan_amnt`).  

This transformation helps:
- Normalize the distribution of loan amounts  
- Reduce skewness  
- Improve the model's ability to capture patterns in the data  

We then print the shapes of the resulting datasets and preview both the original and transformed target values.


In [None]:
# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# IMPORTANT: Apply log transformation to the target variable
# This normalizes the distribution and reduces model bias.
y_train_log = np.log1p(y_train)

print("Train:", X_train.shape, "Test:", X_test.shape)
print("y_train (original) head:", y_train.head(3).values)
print("y_train (log-transformed) head:", y_train_log.head(3).values)


### Build and Train the Random Forest Regression Pipeline

We create a full modeling pipeline that includes both preprocessing and model training.  
The pipeline consists of:

- **Preprocessing Step:**  
  Applies numerical imputation, categorical imputation, and one-hot encoding using the previously defined `preprocessor`.

- **Modeling Step (Random Forest Regressor):**  
  A tree-based ensemble model configured with:
  - 200 estimators  
  - Maximum depth of 8 to reduce overfitting  
  - `min_samples_leaf=100` to enforce more general decision splits  
  - Parallel processing enabled via `n_jobs=-1`  

The model is trained using the **log-transformed target variable** (`y_train_log`) to improve stability and prediction accuracy.  
Once training is completed, we confirm that the model has been successfully fitted.


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# Build the full modeling pipeline
rf_reg = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,
        max_depth=8,          # Reduced from 10 to 8 for less overfitting
        min_samples_leaf=100, # Increased from 50 to 100 for more general splits
        random_state=42,
        n_jobs=-1
    ))
])

# Train the model using the log-transformed target variable
rf_reg.fit(X_train, y_train_log)
print("Model trained using the log-transformed target variable.")


### Generate Predictions and Evaluate Model Performance

After training the model on the log-transformed target, we generate predictions for both the training and test sets.  
Since the model outputs values in logarithmic scale, we apply the inverse transformation (`expm1`) to bring predictions back to their original currency scale.

We then evaluate model performance using the following metrics calculated on the **actual loan amount values**:

- **MAE (Mean Absolute Error):** Measures average absolute prediction error  
- **RMSE (Root Mean Squared Error):** Punishes larger errors more heavily  
- **RÂ² Score:** Indicates how much variance in the target is explained by the model  

Finally, we print the training and testing results to assess model accuracy and generalization.


In [None]:
# 1. Generate predictions (log-scale outputs)
y_pred_log_train = rf_reg.predict(X_train)
y_pred_log_test = rf_reg.predict(X_test)

# 2. Convert predictions back from log scale (inverse transform)
y_pred_train = np.expm1(y_pred_log_train)
y_pred_test = np.expm1(y_pred_log_test)

# 3. Calculate performance metrics using actual values
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)

rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))

r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)

print("=== UPDATED RandomForest Results ===")
print(f"Train MAE : {mae_train:.2f}")
print(f"Test MAE  : {mae_test:.2f}")
print("-" * 30)
print(f"Train RMSE: {rmse_train:.2f}")
print(f"Test RMSE : {rmse_test:.2f}")
print("-" * 30)
print(f"Train R2  : {r2_train:.3f}")
print(f"Test R2   : {r2_test:.3f}")


### Establish a Baseline Model for Comparison

To evaluate whether our machine learning model provides meaningful improvements, we create a simple baseline model.  
This baseline predicts the **mean loan amount from the training set** for every sample in the test set.

We then compute MAE, RMSE, and RÂ² for this baseline.  
Comparing these metrics with the Random Forest model helps determine how much predictive value the trained model adds beyond a trivial guess.


In [None]:
# Baseline model: predict the mean loan amount from the training set for all test samples
baseline_pred = np.full_like(y_test, y_train.mean(), dtype=float)

baseline_mae = mean_absolute_error(y_test, baseline_pred)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
baseline_r2 = r2_score(y_test, baseline_pred)

print("=== Baseline (mean prediction) ===")
print(f"Baseline MAE : {baseline_mae:.2f}")
print(f"Baseline RMSE: {baseline_rmse:.2f}")
print(f"Baseline R2  : {baseline_r2:.3f}")


### Extract and Display Feature Importances

To understand which variables contribute most to the model's predictions,  
we extract feature importance scores from the trained Random Forest model.

Because categorical features were one-hot encoded during preprocessing,  
we first retrieve the full expanded feature name list using `get_feature_names_out()`.  
We then pair these names with their corresponding importance values and sort them in descending order.

Finally, we display the top 20 most influential features.


In [None]:
# Extract the expanded feature names after preprocessing
feature_names = rf_reg.named_steps['preprocess'].get_feature_names_out()

# Retrieve feature importance scores from the trained Random Forest model
importances = rf_reg.named_steps['model'].feature_importances_

# Create a DataFrame for easier inspection and sort by importance
fi = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

fi.head(20)


### Visualize the Top Feature Importances

To better interpret the most influential predictors in the model,  
we visualize the top 10 features based on their importance scores.

A horizontal bar plot is generated to highlight which features the Random Forest model relies on most when estimating loan amounts.


In [None]:
top_n = 10
plt.figure(figsize=(8,5))
sns.barplot(data=fi.head(top_n), x='importance', y='feature')
plt.title("Feature Importance (Top 10)")
plt.show()


### Summary Statistics of the Target Variable

Before interpreting prediction performance, it is useful to examine the distribution of the target variable (`loan_amnt`).  
The summary statistics below provide insights into the central tendency, spread, and overall scale of loan amounts in the dataset.


In [None]:
df_reg['loan_amnt'].describe()


### Recalculate Baseline Metrics for Comparison

Once again, we compute baseline performance metrics using a simple model that predicts the **mean loan amount from the training set** for every test instance.  
This provides a straightforward benchmark to evaluate how much the machine learning model improves over a naive prediction strategy.

The baseline metrics reported include:
- **MAE:** Mean Absolute Error  
- **RMSE:** Root Mean Squared Error  
- **RÂ²:** Coefficient of Determination  


In [None]:
# Baseline model: predict the mean loan amount from the training set for all test samples
baseline_pred = np.full_like(y_test, y_train.mean(), dtype=float)

baseline_mae = mean_absolute_error(y_test, baseline_pred)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
baseline_r2 = r2_score(y_test, baseline_pred)

print("Baseline MAE :", baseline_mae)
print("Baseline RMSE:", baseline_rmse)
print("Baseline R2  :", baseline_r2)


## 2. Classification Modeling

### Convert the Target Variable into Loan Amount Buckets (Classification Setup)

To transition from regression to a classification problem, we transform the continuous loan amount (`loan_amnt`) into categorical buckets:

- **0:** Low (â‰¤ 10,000)  
- **1:** Medium (10,001â€“20,000)  
- **2:** High (> 20,000)

A custom function (`create_loan_bucket`) assigns each loan to its corresponding class, creating the new target variable `loan_bucket`.

Next, we define the feature set using predictors that performed well in the regression model.  
Finally, we prepare **X** (features) and **y** (class labels) and print their shapes to confirm the dataset is ready for classification modeling.


In [None]:
# 1. Convert the continuous target variable into categorical buckets (binning)
def create_loan_bucket(amount):
    if amount <= 10000:
        return 0  # Low
    elif amount <= 20000:
        return 1  # Medium
    else:
        return 2  # High

# Create the new classification target column
df['loan_bucket'] = df['loan_amnt'].apply(create_loan_bucket)

print("Class Distribution:")
print(df['loan_bucket'].value_counts())

# 2. Select features (based on successful regression predictors)
feature_cols = [
    'annual_inc_capped',
    'term_clean',
    'total_bc_limit_capped',
    'total_il_high_credit_limit_capped',
    'total_acc',
    'purpose',          # Categorical
    'home_ownership'    # Categorical
]

# 3. Prepare X and y (y is now loan_bucket instead of loan_amnt)
X = df[feature_cols]
y = df['loan_bucket']

print("Dataset is ready for classification!")
print("X shape:", X.shape)
print("y shape:", y.shape)


### Build and Train the Classification Pipeline

In this section, we prepare the dataset for a multi-class classification task by defining preprocessing steps and training a **Random Forest Classifier**.

**1. Feature Type Separation**  
We specify which features are numerical and which are categorical so that appropriate preprocessing can be applied.

**2. Preprocessing Pipelines**  
- Numerical features: missing values are imputed using the median.  
- Categorical features: missing values are filled with the most frequent category, followed by One-Hot Encoding to convert text labels into numeric vectors.

These transformations are combined into a `ColumnTransformer` to ensure the correct preprocessing is applied automatically.

**3. Train/Test Split**  
The dataset is split into an 80% training set and a 20% test set.

**4. Model Definition**  
A `RandomForestClassifier` is used, leveraging:
- 100 decision trees  
- Full parallelization (`n_jobs=-1`)  
- Fixed randomness for reproducibility  

The model is wrapped inside a pipeline to ensure preprocessing and prediction flow seamlessly.

**5. Model Training and Evaluation**  
We train the model, generate predictions on the test set, and evaluate performance using:
- **Accuracy score**
- **Classification report** (precision, recall, F1-score)
- **Confusion matrix**

These outputs allow us to assess how well the model classifies loans into Low, Medium, and High buckets.


In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 1. Define numerical and categorical feature groups
numeric_features = [
    'annual_inc_capped', 'term_clean', 'total_bc_limit_capped',
    'total_il_high_credit_limit_capped', 'total_acc'
]
categorical_features = ['purpose', 'home_ownership']

# 2. Preprocessing pipelines
numeric_transformer = SimpleImputer(strategy='median')

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))    # Convert categories to numeric
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# 3. Train/test split (20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 4. Model definition (Random Forest Classifier)
rf_clf = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier(
        n_estimators=100,
        random_state=42,
        n_jobs=-1
    ))
])

# 5. Train the model
print("Training the model... (This may take 1â€“2 minutes depending on dataset size)")
rf_clf.fit(X_train, y_train)

# 6. Prediction and evaluation
print("Generating predictions on the test set...")
y_pred = rf_clf.predict(X_test)

print("-" * 30)
print(f"Model Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("-" * 30)

print("\nDetailed Classification Report:\n")
print(classification_report(
    y_test, y_pred,
    target_names=['Low (0â€“10k)', 'Medium (10â€“20k)', 'High (20k+)']
))

print("-" * 30)
print("\nConfusion Matrix:\n")
print(confusion_matrix(y_test, y_pred))
