Loan Default Prediction Project

#Defining the Problem and Data Collection (Exercise 1)

## Problem Statement: Predict the likelihood for incoming loan applications, that if approved, will result in default, using historical data.

## Data Types: Accumulate data on the number of loan applications, the applicant's demographic and financial profile, the requested loan profile (date, amount, loan structure, purpose) , the application result, and for those approved, a historical timeline of payment and closure.

## Data Collection: Data will be obtained via our internal systems, credit bureaus, and available government records.

#Feature Selection and Model Choice (Exercise 2)

## Projected Relevant Features and Justification

### Demographic Information (Gender, Married, Dependents, Education, Self-Employed, Property Area): Useful for identifying trends/correlations between applicant's profile and level of risk.

### Financial Information (Applicant Income, Coapplicant Income, Credit History): Useful for comparing current income to determine ability to pay, and to assess the applicant's financial history, past performance, including other debt to assess their behavior.

### Loan Information (Loan Amount, Loan Amount Term, Loan Status) When connected to the other two categories, it is useful to draw correlations.

#Training, Evaluating, and Optimizing the Model (Exercise 3)

## Model: I will implement the classification model to predict which applicants are most likely to result in a loan default.

## Training: In order to train the model, I will use the K-Fold Cross-Validation. This is because I want to ensure that we have enough data subsets to train the model, evaluate performance against another subset, and continuously optimize it until we reach an acceptable threshold. I will also use the F-1 Score because I am not sure if there's an uneven distribution between applications likely to default vs. those who are qualified and could be rejected. Finally, for the loss function, I’ll use the Binary Crossentropy to see how well the predicted probabilities performed against what really happened using the historical data subsets.


# Designing Machine Learning Solutions (Exercise 4)

## Predicting future stock prices: Supervised Learning will enable us to use the labeled data to predict a numerical value and classify the data into categories. We can use the dataset to give the algorithm direct feedback.

## Organizing a Library of Books into genres or categories based on similarities. As with the prior, Supervised Learning based on genre and other similar categories (book type, author, topic, etc), will enable us to classify the data into categories - in other words, organizing the library.

## Program a robot to navigate and find the shortest path in a maze: I'd use Reinforcement Learning, because the robot learns by exploring the maze and receiving rewards or penalties depending on its actions. Since the robot doesn't have labeled data in advance; the machine learns through consequences to find the shortest path.

# Designing an Evaluation Strategy (Exercise 5)

## Supervised Learning: My metrics will include accuracy, precision / recall, and the F1-score so that I can train the model on historical data and compare performance against actuals. I'd also use the K-Fold Cross Validation to determine performance.

## Unsupervised Learning: I’d look at the silhouette score or use the elbow method to see if the clusters make sense. Not sure how I will comopare without a historical dataset.

## Reinforcement Learning: I’d measure how much reward it earns over time and check if it’s improving or converging with a good strategy. Not sure how I will select the right one.  


In [None]:
Loan Default Prediction Project

#Defining the Problem and Data Collection (Exercise 1)

## Problem Statement: Predict the likelihood for incoming loan applications, that if approved, will result in default, using historical data.

## Data Types: Accumulate data on the number of loan applications, the applicant's demographic and financial profile, the requested loan profile (date, amount, loan structure, purpose) , the application result, and for those approved, a historical timeline of payment and closure.

## Data Collection: Data will be obtained via our internal systems, credit bureaus, and available government records.

data_types =
["number of loan applications", "applicant's demographic profile", "applicant's financial profile", "requested loan profile (date, amount, structure, purpose)", "application result", "for approved loans: timeline of payments and closure"]

data_sources =
["our internal systems (applications, servicing, collections)", "credit bureaus", "available government records / public economic indicators"]

print(problem_statement)
print("Data Types:", data_types)
print("Data Sources:", data_sources)

#Feature Selection and Model Choice (Exercise 2)

## Projected Relevant Features and Justification

### Demographic Information (Gender, Married, Dependents, Education, Self-Employed, Property Area): Useful for identifying trends/correlations between applicant's profile and level of risk.

### Financial Information (Applicant Income, Coapplicant Income, Credit History): Useful for comparing current income to determine ability to pay, and to assess the applicant's financial history, past performance, including other debt to assess their behavior.

### Loan Information (Loan Amount, Loan Amount Term, Loan Status) When connected to the other two categories, it is useful to draw correlations.

import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import mutual_info_classif
from sklearn.ensemble import RandomForestClassifier

rng = np.random.default_rng(11)
if USE_SYNTHETIC:
    n = 2500
    df = pd.DataFrame({
        'Gender': rng.choice(['Male','Female'], size=n),
        'Married': rng.choice(['Yes','No'], size=n, p=[0.6,0.4]),
        'Dependents': rng.integers(0,4, size=n),
        'Education': rng.choice(['Graduate','Not Graduate'], size=n, p=[0.7,0.3]),
        'Self_Employed': rng.choice(['Yes','No'], size=n, p=[0.15,0.85]),
        'Property_Area': rng.choice(['Urban','Semiurban','Rural'], size=n, p=[0.4,0.35,0.25]),
        'ApplicantIncome': np.clip(rng.normal(5000, 2000, size=n), 500, 20000).round(0),
        'CoapplicantIncome': np.clip(rng.normal(1500, 1000, size=n), 0, 15000).round(0),
        'Credit_History': rng.choice([1.0, 0.0, np.nan], size=n, p=[0.75, 0.2, 0.05]),
        'LoanAmount': np.clip(rng.normal(150, 60, size=n), 20, 600).round(1),
        'Loan_Amount_Term': rng.choice([180,240,300,360], size=n, p=[0.1,0.2,0.3,0.4]),
        'Loan_Status': rng.choice(['Y','N'], size=n, p=[0.85,0.15]),
        'Purpose': rng.choice(['Home','Car','Education','Personal'], size=n, p=[0.5,0.2,0.15,0.15])
    })
    logit = (
        -3.2 - 0.00022*df['ApplicantIncome'] - 0.0001*df['CoapplicantIncome']
        - 1.0*df['Credit_History'].fillna(0) + 0.004*df['LoanAmount']
        + 0.15*(df['Purpose'].isin(['Personal','Education']).astype(int))
    )
    p = 1/(1+np.exp(-logit))
    df['Default'] = rng.binomial(1, p)
else:
    df = pd.read_csv(CSV_PATH)
    # Expect a binary target column named 'Default' (0/1).

y = df['Default']
X = df.drop(columns=['Default'])
cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
num_cols = X.select_dtypes(include=['number','bool']).columns.tolist()

preprocess = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]), cat_cols)
])

# Fit transform once for feature scoring
Xenc = preprocess.fit_transform(X)
mi = mutual_info_classif(Xenc, y, random_state=11)
ohe = preprocess.named_transformers_['cat']['oh'] if cat_cols else None
feat_names = num_cols + (ohe.get_feature_names_out(cat_cols).tolist() if ohe is not None else [])
import pandas as pd
mi_series = pd.Series(mi, index=feat_names).sort_values(ascending=False)
print("Top features by Mutual Information (filter):\n", mi_series.head(12))

rf = Pipeline([('prep', preprocess), ('rf', RandomForestClassifier(n_estimators=250, random_state=11, class_weight='balanced'))])
rf.fit(X, y)
rf_imp = pd.Series(rf.named_steps['rf'].feature_importances_, index=feat_names).sort_values(ascending=False)
print("\nTop features by RandomForest importance (embedded):\n", rf_imp.head(12))

#Training, Evaluating, and Optimizing the Model (Exercise 3)

## Model: I will implement the classification model to predict which applicants are most likely to result in a loan default.

## Training: In order to train the model, I will use the K-Fold Cross-Validation. This is because I want to ensure that we have enough data subsets to train the model, evaluate performance against another subset, and continuously optimize it until we reach an acceptable threshold. I will also use the F-1 Score because I am not sure if there's an uneven distribution between applications likely to default vs. those who are qualified and could be rejected. Finally, for the loss function, I’ll use the Binary Crossentropy to see how well the predicted probabilities performed against what really happened using the historical data subsets.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             roc_auc_score, average_precision_score, log_loss)
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=11)

def pipe(model):
    return Pipeline([('prep', preprocess), ('clf', model)])

models = {
    'LogReg': LogisticRegression(max_iter=300, class_weight='balanced'),
    'RF': RandomForestClassifier(n_estimators=300, random_state=11, class_weight='balanced')
}

for name, mdl in models.items():
    p = pipe(mdl)
    p.fit(X_train, y_train)
    yp = p.predict(X_test)
    yproba = p.predict_proba(X_test)[:,1]
    print(f"\n=== {name} (Test) ===")
    print("Accuracy:", round(accuracy_score(y_test, yp), 3))
    print("Precision:", round(precision_score(y_test, yp), 3))
    print("Recall:", round(recall_score(y_test, yp), 3))
    print("F1:", round(f1_score(y_test, yp), 3))
    print("ROC-AUC:", round(roc_auc_score(y_test, yproba), 3))
    print("PR-AUC:", round(average_precision_score(y_test, yproba), 3))
    print("Log Loss:", round(log_loss(y_test, yproba), 3))

    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
    cv = cross_validate(p, X, y, cv=skf, scoring=['f1','roc_auc','average_precision'])
    print("CV F1 (mean±std): {:.3f}±{:.3f}".format(cv['test_f1'].mean(), cv['test_f1'].std()))
    print("CV ROC-AUC (mean±std): {:.3f}±{:.3f}".format(cv['test_roc_auc'].mean(), cv['test_roc_auc'].std()))
    print("CV PR-AUC (mean±std): {:.3f}±{:.3f}".format(cv['test_average_precision'].mean(), cv['test_average_precision'].std()))

# Designing Machine Learning Solutions (Exercise 4)

## Predicting future stock prices: Supervised Learning will enable us to use the labeled data to predict a numerical value and classify the data into categories. We can use the dataset to give the algorithm direct feedback.

## Organizing a Library of Books into genres or categories based on similarities. As with the prior, Supervised Learning based on genre and other similar categories (book type, author, topic, etc), will enable us to classify the data into categories - in other words, organizing the library.

## Program a robot to navigate and find the shortest path in a maze: I'd use Reinforcement Learning, because the robot learns by exploring the maze and receiving rewards or penalties depending on its actions. Since the robot doesn't have labeled data in advance; the machine learns through consequences to find the shortest path.

# Designing an Evaluation Strategy (Exercise 5)

## Supervised Learning: My metrics will include accuracy, precision / recall, and the F1-score so that I can train the model on historical data and compare performance against actuals. I'd also use the K-Fold Cross Validation to determine performance.

## Unsupervised Learning: I’d look at the silhouette score or use the elbow method to see if the clusters make sense. Not sure how I will comopare without a historical dataset.

## Reinforcement Learning: I’d measure how much reward it earns over time and check if it’s improving or converging with a good strategy. Not sure how I will select the right one.


# ---- Unsupervised demo: K-Means with silhouette + elbow ----
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

Xb, _ = make_blobs(n_samples=900, centers=4, cluster_std=0.7, random_state=21)
ks = list(range(2,9))
inertias, sils = [], []
for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=21)
    km.fit(Xb)
    inertias.append(km.inertia_)
    sils.append(silhouette_score(Xb, km.labels_))

plt.figure()
plt.plot(ks, inertias, marker='o')
plt.xlabel('k'); plt.ylabel('Inertia'); plt.title('Elbow Method (K-Means)'); plt.show()

plt.figure()
plt.plot(ks, sils, marker='o')
plt.xlabel('k'); plt.ylabel('Silhouette'); plt.title('Silhouette vs k'); plt.show()

print('Best k by silhouette:', ks[int(np.argmax(sils))])

# ---- RL demo: simple epsilon-greedy bandit to visualize cumulative reward ----
import numpy as np
rng = np.random.default_rng(21)
k = 5
true_means = rng.uniform(0.1, 0.9, size=k)
best_mean = true_means.max()
def run_bandit(eps=0.1, steps=1500):
    q = np.zeros(k); n = np.zeros(k); rewards = []
    for t in range(steps):
        a = rng.integers(0,k) if rng.random()<eps else int(np.argmax(q))
        r = rng.normal(true_means[a], 0.1)
        n[a]+=1; q[a]+= (r-q[a])/n[a]
        rewards.append(r)
    return np.array(rewards)
r01 = run_bandit(0.1)
r00 = run_bandit(0.0)
cum_opt = np.arange(1,len(r01)+1)*best_mean
import matplotlib.pyplot as plt
plt.figure(); plt.plot(r01.cumsum(), label='eps=0.1'); plt.plot(r00.cumsum(), label='eps=0.0'); plt.plot(cum_opt, label='optimal');
plt.xlabel('Steps'); plt.ylabel('Cumulative Reward'); plt.title('Bandit: Cumulative Reward'); plt.legend(); plt.show()
print('True arm means:', np.round(true_means,3))

