### Modeling

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb

In [2]:
X_train_resampled_controlled_df = pd.read_csv('X_train_preprocessed.csv')
y_train_resampled_controlled_df = pd.read_csv('y_train_preprocessed.csv')

X_test_transformed_df = pd.read_csv('X_test_transformed.csv')
y_test_df = pd.read_csv('y_test.csv')

The first model I am going to use is Logistic Regression.

In [3]:
lr_model = LogisticRegression()

#convert to 1D array
y_train_flat = y_train_resampled_controlled_df.values.ravel()

lr_model.fit(X_train_resampled_controlled_df, y_train_flat)

In [4]:
y_pred_lr = lr_model.predict(X_test_transformed_df)

#convert to 1D array
y_test_flat = y_test_df.values.ravel()

accuracy = accuracy_score(y_test_flat, y_pred_lr)
print("Accuracy:", accuracy)

conf_matrix = confusion_matrix(y_test_flat, y_pred_lr)
print("Confusion Matrix:")
print(conf_matrix)

class_report = classification_report(y_test_flat, y_pred_lr)
print("Classification Report:")
print(class_report)

Accuracy: 0.6820638339533973
Confusion Matrix:
[[30745 14425]
 [ 1812  4088]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.69      0.56     51070
weighted avg       0.86      0.68      0.74     51070



Which metric should I use?

Accuracy measures the proportion of correctly classified instances out of the total instances. In the context of a bank loan default problem, accuracy tells you the overall proportion of correct predictions made by your model. 

A confusion matrix is a table that summarizes the performance of a classification algorithm. It compares the actual target values with the predicted values and shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In the context of the bank loan default problem:

True Positives (TP): Instances where the model correctly predicts a loan default.
True Negatives (TN): Instances where the model correctly predicts no default.
False Positives (FP): Instances where the model incorrectly predicts a default when there is none (Type I error).
False Negatives (FN): Instances where the model incorrectly predicts no default when there is one (Type II error).

A classification report provides a summary of various evaluation metrics, including precision, recall, F1-score, and support, for each class (in binary classification, typically "positive" and "negative" classes). These metrics are calculated based on the concepts of true positives, true negatives, false positives, and false negatives:

Precision: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. In the context of a bank loan default problem, precision tells you how many of the predicted defaults are actual defaults. 

Recall (or Sensitivity): Recall measures the proportion of true positive predictions out of all actual positive instances. In the context of a bank loan default problem, recall tells you how many of the actual defaults were correctly predicted by the model. 

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. It is particularly useful when the classes are imbalanced.

Support: Support is the number of actual occurrences of the class in the specified dataset.

I believe looking at recall will be the best metric for this problem as we are trying to minimize false negatives (Type II errors) because we don't want the model to predict that a customer will not default (negative prediction) when they actually do default.

Recall at .69 is not very good. Let's see if I can improve the model with hyperparameter tuning.

In [5]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2']}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='recall')

grid_search.fit(X_train_resampled_controlled_df, y_train_flat)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_lr_model = grid_search.best_estimator_

Best Hyperparameters: {'C': 1, 'penalty': 'l2'}


In [6]:
y_pred_best_lr = best_lr_model.predict(X_test_transformed_df)

accuracy_best_lr = accuracy_score(y_test_flat, y_pred_best_lr)
print("Accuracy:", accuracy_best_lr)

conf_matrix_best_lr = confusion_matrix(y_test_flat, y_pred_best_lr)
print("Confusion Matrix:")
print(conf_matrix_best_lr)

class_report_best_lr = classification_report(y_test_flat, y_pred_best_lr)
print("Classification Report:")
print(class_report_best_lr)

Accuracy: 0.6820638339533973
Confusion Matrix:
[[30745 14425]
 [ 1812  4088]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.69      0.56     51070
weighted avg       0.86      0.68      0.74     51070



There is no change in performance from hyperparameter tuning. Next I will try Principal Component Analysis to see if reducing dimensionality will improve my model.

In [7]:
pca = PCA(n_components=0.95)

X_train_pca = pca.fit_transform(X_train_resampled_controlled_df)
X_test_pca = pca.transform(X_test_transformed_df)

lr_model_pca = LogisticRegression()
lr_model_pca.fit(X_train_pca, y_train_flat)

In [8]:
y_pred_lr_pca = lr_model_pca.predict(X_test_pca)

accuracy_lr_pca = accuracy_score(y_test_flat, y_pred_lr_pca)
print("Accuracy with PCA:", accuracy_lr_pca)

conf_matrix_lr_pca = confusion_matrix(y_test_flat, y_pred_lr_pca)
print("Confusion Matrix with PCA:")
print(conf_matrix_lr_pca)

class_report_lr_pca = classification_report(y_test_flat, y_pred_lr_pca)
print("Classification Report with PCA:")
print(class_report_lr_pca)

Accuracy with PCA: 0.6777951830820442
Confusion Matrix with PCA:
[[30530 14640]
 [ 1815  4085]]
Classification Report with PCA:
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.68      0.56     51070
weighted avg       0.86      0.68      0.74     51070



With PCA, there is still no improvement.  Next, I will look at a different ML model - Random Forest Classifier.

In [9]:
rf_model = RandomForestClassifier()

rf_model.fit(X_train_resampled_controlled_df, y_train_flat)

y_pred_rf = rf_model.predict(X_test_transformed_df)

In [10]:
accuracy_rf = accuracy_score(y_test_flat, y_pred_rf)
print("Accuracy with Random Forest Classifier:", accuracy_rf)

conf_matrix_rf = confusion_matrix(y_test_flat, y_pred_rf)
print("Confusion Matrix with Random Forest Classifier:")
print(conf_matrix_rf)

class_report_rf = classification_report(y_test_flat, y_pred_rf)
print("Classification Report with Random Forest Classifier:")
print(class_report_rf)

Accuracy with Random Forest Classifier: 0.8126688858429606
Confusion Matrix with Random Forest Classifier:
[[38935  6235]
 [ 3332  2568]]
Classification Report with Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.92      0.86      0.89     45170
           1       0.29      0.44      0.35      5900

    accuracy                           0.81     51070
   macro avg       0.61      0.65      0.62     51070
weighted avg       0.85      0.81      0.83     51070



Recall decreased significantly from .69 (lr_model) to .43. Even though accuracy improved, recall decreased significantly. For this problem, I am not overly concerned with accuracy but rather recall. 

I will apply hyperparameter tuning to see if I can improve the model.

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='recall', n_jobs=-1, verbose=2)

grid_search.fit(X_train_resampled_controlled_df, y_train_flat)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_rf_model = grid_search.best_estimator_

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [None]:
y_pred_best_rf = best_rf_model.predict(X_test_transformed_df)

accuracy_best_rf = accuracy_score(y_test_flat, y_pred_best_rf)
print("Accuracy:", accuracy_best_rf)

conf_matrix_best_rf = confusion_matrix(y_test_flat, y_pred_best_rf)
print("Confusion Matrix:")
print(conf_matrix_best_rf)

class_report_best_rf = classification_report(y_test_flat, y_pred_best_rf)
print("Classification Report:")
print(class_report_best_rf)

By hyperparameter tuning, recall increased slightly but it is still significantly lower than the Logistic Regression model. Rather than spend more time on the rf_model, I am going to try a different ML model.  The next ML model I will try will be Support Vector Classifier.

In [None]:
svc_model = SVC()

svc_model.fit(X_train_resampled_controlled_df, y_train_flat)

y_pred_svc = svc_model.predict(X_test_transformed_df)

accuracy_svc = accuracy_score(y_test_flat, y_pred_svc)
print("Accuracy with SVC:", accuracy_svc)

conf_matrix_svc = confusion_matrix(y_test_flat, y_pred_svc)
print("Confusion Matrix with SVC:")
print(conf_matrix_svc)

class_report_svc = classification_report(y_test_flat, y_pred_svc)
print("Classification Report with SVC:")
print(class_report_svc)