## Case Study: Loan Approval Prediction using Random Forest and Compare with Ligistic Regression and Decision Tree

This notebook demonstrates how to use a Random Forest classifier to predict loan approvals based on features such as income, credit score, loan amount, employment status, and debt-to-income ratio.


In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

## Step 1: Generate Synthetic Dataset

We create a dataset with features representing financial information and define a loan approval criterion.


In [None]:
# Data Size:
ds=10000
# Set seed for reproducibility
np.random.seed(42)

# Generate synthetic dataset
data = {
    'income': np.random.randint(20000, 100000, ds),
    'credit_score': np.random.randint(300, 850, ds),
    'loan_amount': np.random.randint(5000, 50000, ds),
    'employment_status': np.random.randint(0, 2, ds),  # 0 = Unemployed, 1 = Employed
    'debt_to_income_ratio': np.random.uniform(0.1, 0.5, ds),
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Define loan approval criteria (target variable)
df['loan_approved'] = np.where(
    (df['credit_score'] > 650) & (df['income'] > 40000) & (df['debt_to_income_ratio'] < 0.35),
    1, 0
)

# Display the first five rows
df.head()

## Step 2: Split Dataset

We split the dataset into training and testing sets for model evaluation.


In [None]:
# Split dataset into training and testing sets
X = df.drop(columns=['loan_approved'])
y = df['loan_approved']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dataset shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Data Normalization

In [None]:
# Standardize numerical features (optional but often beneficial)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 3: Train Random Forest with Decision Tree and Logistic Regression Classifier

We train a three models to learn patterns from the dataset.


In [None]:
logistic_regression = LogisticRegression()
decision_tree = DecisionTreeClassifier(max_depth=8, random_state=42)
random_forest = RandomForestClassifier(
    n_estimators=100,       # Increase the number of trees
    max_depth=8,           # Restrict depth to reduce overfitting
    max_features='sqrt',    # Use sqrt(p) number of features for each split
    random_state=42
    )

## Step 4: Make Predictions

We use the trained model to predict loan approvals on the test set.


In [None]:
# Logistic Regression
logistic_regression.fit(X_train_scaled, y_train)
y_pred_lr = logistic_regression.predict(X_test_scaled)

# Decision Tree
decision_tree.fit(X_train_scaled, y_train)
y_pred_dt = decision_tree.predict(X_test_scaled)

# Random Forest
random_forest.fit(X_train_scaled, y_train)
y_pred_rf = random_forest.predict(X_test_scaled)

## Step 5: Evaluate Model

We evaluate the model's performance using accuracy and a classification report.


In [None]:
def calculate_metrics(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    return accuracy, precision, recall

# Calculate metrics for each model
metrics_lr = calculate_metrics(y_test, y_pred_lr)
metrics_dt = calculate_metrics(y_test, y_pred_dt)
metrics_rf = calculate_metrics(y_test, y_pred_rf)

# Step 7: Print results
print("Logistic Regression - Accuracy: {:.2f}%, Precision: {:.2f}%, Recall: {:.2f}%".format(metrics_lr[0]*100, metrics_lr[1]*100, metrics_lr[2]*100))
print("Decision Tree - Accuracy: {:.2f}%, Precision: {:.2f}%, Recall: {:.2f}%".format(metrics_dt[0]*100, metrics_dt[1]*100, metrics_dt[2]*100))
print("Random Forest - Accuracy: {:.2f}%, Precision: {:.2f}%, Recall: {:.2f}%".format(metrics_rf[0]*100, metrics_rf[1]*100, metrics_rf[2]*100))