### Modeling

In [15]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

In [16]:
X_train_resampled_controlled_df = pd.read_csv('X_train_preprocessed.csv')
y_train_resampled_controlled_df = pd.read_csv('y_train_preprocessed.csv')

X_test_transformed_df = pd.read_csv('X_test_transformed.csv')
y_test_df = pd.read_csv('y_test.csv')

The first model I am going to use is Logistic Regression.

In [17]:
lr_model = LogisticRegression()

#convert to 1D array
y_train_flat = y_train_resampled_controlled_df.values.ravel()

lr_model.fit(X_train_resampled_controlled_df, y_train_flat)

In [18]:
y_pred_lr = lr_model.predict(X_test_transformed_df)

#convert to 1D array
y_test_flat = y_test_df.values.ravel()

accuracy = accuracy_score(y_test_flat, y_pred_lr)
print("Accuracy:", accuracy)

conf_matrix = confusion_matrix(y_test_flat, y_pred_lr)
print("Confusion Matrix:")
print(conf_matrix)

class_report = classification_report(y_test_flat, y_pred_lr)
print("Classification Report:")
print(class_report)

Accuracy: 0.6820638339533973
Confusion Matrix:
[[30745 14425]
 [ 1812  4088]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.69      0.56     51070
weighted avg       0.86      0.68      0.74     51070


Which metric should I use?

Accuracy measures the proportion of correctly classified instances out of the total instances. In the context of a bank loan default problem, accuracy tells you the overall proportion of correct predictions made by your model. 

A confusion matrix is a table that summarizes the performance of a classification algorithm. It compares the actual target values with the predicted values and shows the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In the context of the bank loan default problem:

True Positives (TP): Instances where the model correctly predicts a loan default.
True Negatives (TN): Instances where the model correctly predicts no default.
False Positives (FP): Instances where the model incorrectly predicts a default when there is none (Type I error).
False Negatives (FN): Instances where the model incorrectly predicts no default when there is one (Type II error).

A classification report provides a summary of various evaluation metrics, including precision, recall, F1-score, and support, for each class (in binary classification, typically "positive" and "negative" classes). These metrics are calculated based on the concepts of true positives, true negatives, false positives, and false negatives:

Precision: Precision measures the proportion of true positive predictions out of all positive predictions made by the model. In the context of a bank loan default problem, precision tells you how many of the predicted defaults are actual defaults. 

Recall (or Sensitivity): Recall measures the proportion of true positive predictions out of all actual positive instances. In the context of a bank loan default problem, recall tells you how many of the actual defaults were correctly predicted by the model. 

F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. It is particularly useful when the classes are imbalanced.

Support: Support is the number of actual occurrences of the class in the specified dataset.

I believe looking at recall will be the best metric for this problem as we are trying to minimize false negatives (Type II errors) because we don't want the model to predict that a customer will not default (negative prediction) when they actually do default.

Recall at .69 is not very good. Let's see if I can improve the model with hyperparameter tuning.

In [19]:
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2']}

grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='recall')

grid_search.fit(X_train_resampled_controlled_df, y_train_flat)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_lr_model = grid_search.best_estimator_

Best Hyperparameters: {'C': 1, 'penalty': 'l2'}


In [20]:
y_pred_best_lr = best_lr_model.predict(X_test_transformed_df)

accuracy_best_lr = accuracy_score(y_test_flat, y_pred_best_lr)
print("Accuracy:", accuracy_best_lr)

conf_matrix_best_lr = confusion_matrix(y_test_flat, y_pred_best_lr)
print("Confusion Matrix:")
print(conf_matrix_best_lr)

class_report_best_lr = classification_report(y_test_flat, y_pred_best_lr)
print("Classification Report:")
print(class_report_best_lr)

Accuracy: 0.6820638339533973
Confusion Matrix:
[[30745 14425]
 [ 1812  4088]]
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.69      0.56     51070
weighted avg       0.86      0.68      0.74     51070


There is no change in performance from hyperparameter tuning. Next I will try Principal Component Analysis to see if reducing dimensionality will improve my model.

In [21]:
desired_ratios = [0.6, 0.65, 0.50, 0.55]

# Load X_test_transformed_df and y_test_flat
X_test_transformed_df = pd.read_csv('X_test_transformed.csv')
y_test_flat = pd.read_csv('y_test.csv').values.ravel()

# Define hyperparameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l2']}

for desired_ratio in desired_ratios:
    # Load resampled and preprocessed data
    X_train_resampled_df = pd.read_csv(f'X_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_resampled_df = pd.read_csv(f'y_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_flat = y_train_resampled_df.values.ravel()

    # Fit logistic regression model
    lr_model = LogisticRegression()
    lr_model.fit(X_train_resampled_df, y_train_flat)

    # Predict using the fitted model
    y_pred_lr = lr_model.predict(X_test_transformed_df)

    # Evaluate the model
    accuracy = accuracy_score(y_test_flat, y_pred_lr)
    print(f"Accuracy (Desired Ratio {desired_ratio}): {accuracy}")

    conf_matrix = confusion_matrix(y_test_flat, y_pred_lr)
    print(f"Confusion Matrix (Desired Ratio {desired_ratio}):")
    print(conf_matrix)

    class_report = classification_report(y_test_flat, y_pred_lr)
    print(f"Classification Report (Desired Ratio {desired_ratio}):")
    print(class_report)

    # Hyperparameter tuning
    grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, scoring='recall')
    grid_search.fit(X_train_resampled_df, y_train_flat)

    # Best hyperparameters
    best_params = grid_search.best_params_
    print(f"Best Hyperparameters (Desired Ratio {desired_ratio}):", best_params)

    best_lr_model = grid_search.best_estimator_

    # Evaluate best model
    y_pred_best_lr = best_lr_model.predict(X_test_transformed_df)

    accuracy_best_lr = accuracy_score(y_test_flat, y_pred_best_lr)
    print(f"Accuracy (Best Model, Desired Ratio {desired_ratio}):", accuracy_best_lr)

    conf_matrix_best_lr = confusion_matrix(y_test_flat, y_pred_best_lr)
    print(f"Confusion Matrix (Best Model, Desired Ratio {desired_ratio}):")
    print(conf_matrix_best_lr)

    class_report_best_lr = classification_report(y_test_flat, y_pred_best_lr)
    print(f"Classification Report (Best Model, Desired Ratio {desired_ratio}):")
    print(class_report_best_lr)

Accuracy (Desired Ratio 0.6): 0.779185431760329
Confusion Matrix (Desired Ratio 0.6):
[[36629  8541]
 [ 2736  3164]]
Classification Report (Desired Ratio 0.6):
              precision    recall  f1-score   support

           0       0.93      0.81      0.87     45170
           1       0.27      0.54      0.36      5900

    accuracy                           0.78     51070
   macro avg       0.60      0.67      0.61     51070
weighted avg       0.85      0.78      0.81     51070
Best Hyperparameters (Desired Ratio 0.6): {'C': 10, 'penalty': 'l2'}
Accuracy (Best Model, Desired Ratio 0.6): 0.7792245936949286
Confusion Matrix (Best Model, Desired Ratio 0.6):
[[36631  8539]
 [ 2736  3164]]
Classification Report (Best Model, Desired Ratio 0.6):
              precision    recall  f1-score   support

           0       0.93      0.81      0.87     45170
           1       0.27      0.54      0.36      5900

    accuracy                           0.78     51070
   macro avg       0.60      0

Switching to markdown... not using this model but I don't want to delete the code.'

pca = PCA(n_components=0.95)

X_train_pca = pca.fit_transform(X_train_resampled_controlled_df)
X_test_pca = pca.transform(X_test_transformed_df)

lr_model_pca = LogisticRegression()
lr_model_pca.fit(X_train_pca, y_train_flat)

y_pred_lr_pca = lr_model_pca.predict(X_test_pca)

accuracy_lr_pca = accuracy_score(y_test_flat, y_pred_lr_pca)
print("Accuracy with PCA:", accuracy_lr_pca)

conf_matrix_lr_pca = confusion_matrix(y_test_flat, y_pred_lr_pca)
print("Confusion Matrix with PCA:")
print(conf_matrix_lr_pca)

class_report_lr_pca = classification_report(y_test_flat, y_pred_lr_pca)
print("Classification Report with PCA:")
print(class_report_lr_pca)

With PCA, there is still no improvement.  Next, I will look at a different ML model - Random Forest Classifier.

In [22]:
desired_ratios = [0.6, 0.65, 0.50, 0.55]

X_test_transformed_df = pd.read_csv('X_test_transformed.csv')
y_test_flat = pd.read_csv('y_test.csv').values.ravel()

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

for desired_ratio in desired_ratios:
    X_train_resampled_df = pd.read_csv(f'X_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_resampled_df = pd.read_csv(f'y_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_flat = y_train_resampled_df.values.ravel()

    # Initialize RandomForestClassifier
    rf_model = RandomForestClassifier()

    # Fit the model
    rf_model.fit(X_train_resampled_df, y_train_flat)

    # Predict using the fitted model
    y_pred_rf = rf_model.predict(X_test_transformed_df)

    # Evaluate the model
    accuracy_rf = accuracy_score(y_test_flat, y_pred_rf)
    print(f"Accuracy (Desired Ratio {desired_ratio}) with RandomForestClassifier:", accuracy_rf)

    conf_matrix_rf = confusion_matrix(y_test_flat, y_pred_rf)
    print(f"Confusion Matrix (Desired Ratio {desired_ratio}) with RandomForestClassifier:")
    print(conf_matrix_rf)

    class_report_rf = classification_report(y_test_flat, y_pred_rf)
    print(f"Classification Report (Desired Ratio {desired_ratio}) with RandomForestClassifier:")
    print(class_report_rf)

    # Hyperparameter tuning
    grid_search = GridSearchCV(rf_model, param_grid, cv=5, scoring='recall', n_jobs=-1, verbose=2)
    grid_search.fit(X_train_resampled_df, y_train_flat)

    # Best hyperparameters
    best_params = grid_search.best_params_
    print(f"Best Hyperparameters (Desired Ratio {desired_ratio}):", best_params)

    best_rf_model = grid_search.best_estimator_

    # Evaluate best model
    y_pred_best_rf = best_rf_model.predict(X_test_transformed_df)

    accuracy_best_rf = accuracy_score(y_test_flat, y_pred_best_rf)
    print(f"Accuracy (Best Model, Desired Ratio {desired_ratio}):", accuracy_best_rf)

    conf_matrix_best_rf = confusion_matrix(y_test_flat, y_pred_best_rf)
    print(f"Confusion Matrix (Best Model, Desired Ratio {desired_ratio}):")
    print(conf_matrix_best_rf)

    class_report_best_rf = classification_report(y_test_flat, y_pred_best_rf)
    print(f"Classification Report (Best Model, Desired Ratio {desired_ratio}):")
    print(class_report_best_rf)


Accuracy (Desired Ratio 0.6) with RandomForestClassifier: 0.7911494027804974
Confusion Matrix (Desired Ratio 0.6) with RandomForestClassifier:
[[37492  7678]
 [ 2988  2912]]
Classification Report (Desired Ratio 0.6) with RandomForestClassifier:
              precision    recall  f1-score   support

           0       0.93      0.83      0.88     45170
           1       0.27      0.49      0.35      5900

    accuracy                           0.79     51070
   macro avg       0.60      0.66      0.61     51070
weighted avg       0.85      0.79      0.82     51070

Fitting 5 folds for each of 81 candidates, totalling 405 fits
Best Hyperparameters (Desired Ratio 0.6): {'max_depth': 20, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
Accuracy (Best Model, Desired Ratio 0.6): 0.7854121793616604
Confusion Matrix (Best Model, Desired Ratio 0.6):
[[37158  8012]
 [ 2947  2953]]
Classification Report (Best Model, Desired Ratio 0.6):
              precision    recall  f1-scor



[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   4.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  17.0s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   8.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   3.7s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   7.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   7.5s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   4.1s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   7.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=  15.7s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estim



Best Hyperparameters (Desired Ratio 0.5): {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 100}
Accuracy (Best Model, Desired Ratio 0.5): 0.6824946152339926
Confusion Matrix (Best Model, Desired Ratio 0.5):
[[30811 14359]
 [ 1856  4044]]
Classification Report (Best Model, Desired Ratio 0.5):
              precision    recall  f1-score   support

           0       0.94      0.68      0.79     45170
           1       0.22      0.69      0.33      5900

    accuracy                           0.68     51070
   macro avg       0.58      0.68      0.56     51070
weighted avg       0.86      0.68      0.74     51070
Accuracy (Desired Ratio 0.55) with RandomForestClassifier: 0.7447033483454083
Confusion Matrix (Desired Ratio 0.55) with RandomForestClassifier:
[[34545 10625]
 [ 2413  3487]]
Classification Report (Desired Ratio 0.55) with RandomForestClassifier:
              precision    recall  f1-score   support

           0       0.93      0.76      0.84  

Recall decreased significantly from .69 (lr_model) to .43. Even though accuracy improved, recall decreased significantly. For this problem, I am not overly concerned with accuracy but rather recall. 

By hyperparameter tuning, recall increased slightly but it is still significantly lower than the Logistic Regression model. Rather than spend more time on the rf_model, I am going to try a different ML model.  The next ML model I will try will be Support Vector Classifier.

NOT USING

svc_model = SVC()

svc_model.fit(X_train_resampled_controlled_df, y_train_flat)

y_pred_svc = svc_model.predict(X_test_transformed_df)

accuracy_svc = accuracy_score(y_test_flat, y_pred_svc)
print("Accuracy with SVC:", accuracy_svc)

conf_matrix_svc = confusion_matrix(y_test_flat, y_pred_svc)
print("Confusion Matrix with SVC:")
print(conf_matrix_svc)

class_report_svc = classification_report(y_test_flat, y_pred_svc)
print("Classification Report with SVC:")
print(class_report_svc)

param_grid = {
    'C': [0.1, 1, 10],                # regularization parameter
    'kernel': ['linear', 'rbf'],      # kernel type
    'gamma': ['scale', 'auto'],       # kernel coefficient (only for rbf kernel)
}

svc_model = SVC()

grid_search = GridSearchCV(svc_model, param_grid, cv=5, scoring='recall', n_jobs=-1)

grid_search.fit(X_train_resampled_controlled_df, y_train_flat)

best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

best_svc_model = grid_search.best_estimator_

y_pred_best_svc = best_svc_model.predict(X_test_transformed_df)

accuracy_best_svc = accuracy_score(y_test_flat, y_pred_best_svc)
print("Accuracy with Best SVC Model:", accuracy_best_svc)

conf_matrix_best_svc = confusion_matrix(y_test_flat, y_pred_best_svc)
print("Confusion Matrix with Best SVC Model:")
print(conf_matrix_best_svc)

class_report_best_svc = classification_report(y_test_flat, y_pred_best_svc)
print("Classification Report with Best SVC Model:")
print(class_report_best_svc)

In [24]:
desired_ratios = [0.6, 0.65, 0.50, 0.55]

# Load X_test_transformed_df and y_test_flat
X_test_transformed_df = pd.read_csv('X_test_transformed.csv')
y_test_flat = pd.read_csv('y_test.csv').values.ravel()

# Define hyperparameter grid for XGBoost
param_grid = {
    'max_depth': [3, 6, 9],
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

for desired_ratio in desired_ratios:
    X_train_resampled_df = pd.read_csv(f'X_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_resampled_df = pd.read_csv(f'y_train_preprocessed_ratio_{str(int(desired_ratio * 100)).zfill(2)}_.csv')
    y_train_flat = y_train_resampled_df.values.ravel()

    xgb_model = XGBClassifier()

    xgb_model.fit(X_train_resampled_df, y_train_flat)

    y_pred_xgb = xgb_model.predict(X_test_transformed_df)

    # Evaluate the model
    accuracy_xgb = accuracy_score(y_test_flat, y_pred_xgb)
    print(f"Accuracy (Desired Ratio {desired_ratio}) with XGBoost:", accuracy_xgb)

    conf_matrix_xgb = confusion_matrix(y_test_flat, y_pred_xgb)
    print(f"Confusion Matrix (Desired Ratio {desired_ratio}) with XGBoost:")
    print(conf_matrix_xgb)

    class_report_xgb = classification_report(y_test_flat, y_pred_xgb)
    print(f"Classification Report (Desired Ratio {desired_ratio}) with XGBoost:")
    print(class_report_xgb)

    # Hyperparameter tuning
    grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='recall', n_jobs=-1)
    grid_search.fit(X_train_resampled_df, y_train_flat)

    # Best hyperparameters
    best_params = grid_search.best_params_
    print(f"Best Hyperparameters (Desired Ratio {desired_ratio}) with XGBoost:", best_params)

    best_xgb_model = grid_search.best_estimator_

    # Evaluate best model
    y_pred_best_xgb = best_xgb_model.predict(X_test_transformed_df)

    accuracy_best_xgb = accuracy_score(y_test_flat, y_pred_best_xgb)
    print(f"Accuracy (Best Model, Desired Ratio {desired_ratio}) with XGBoost:", accuracy_best_xgb)

    conf_matrix_best_xgb = confusion_matrix(y_test_flat, y_pred_best_xgb)
    print(f"Confusion Matrix (Best Model, Desired Ratio {desired_ratio}) with XGBoost:")
    print(conf_matrix_best_xgb)

    class_report_best_xgb = classification_report(y_test_flat, y_pred_best_xgb)
    print(f"Classification Report (Best Model, Desired Ratio {desired_ratio}) with XGBoost:")
    print(class_report_best_xgb)

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   8.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   3.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   4.0s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=50; total time=   4.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   8.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   3.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   3.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   7.4s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=50; total time=   4.2s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimato



Best Hyperparameters (Desired Ratio 0.6) with XGBoost: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 300, 'subsample': 0.8}
Accuracy (Best Model, Desired Ratio 0.6) with XGBoost: 0.7831212061875856
Confusion Matrix (Best Model, Desired Ratio 0.6) with XGBoost:
[[36788  8382]
 [ 2694  3206]]
Classification Report (Best Model, Desired Ratio 0.6) with XGBoost:
              precision    recall  f1-score   support

           0       0.93      0.81      0.87     45170
           1       0.28      0.54      0.37      5900

    accuracy                           0.78     51070
   macro avg       0.60      0.68      0.62     51070
weighted avg       0.86      0.78      0.81     51070
Accuracy (Desired Ratio 0.65) with XGBoost: 0.8073232817701195
Confusion Matrix (Desired Ratio 0.65) with XGBoost:
[[38531  6639]
 [ 3201  2699]]
Classification Report (Desired Ratio 0.65) with XGBoost:
              precision    recall  f1-score   support

          



Best Hyperparameters (Desired Ratio 0.65) with XGBoost: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 300, 'subsample': 0.8}
Accuracy (Best Model, Desired Ratio 0.65) with XGBoost: 0.8120422948893675
Confusion Matrix (Best Model, Desired Ratio 0.65) with XGBoost:
[[38768  6402]
 [ 3197  2703]]
Classification Report (Best Model, Desired Ratio 0.65) with XGBoost:
              precision    recall  f1-score   support

           0       0.92      0.86      0.89     45170
           1       0.30      0.46      0.36      5900

    accuracy                           0.81     51070
   macro avg       0.61      0.66      0.63     51070
weighted avg       0.85      0.81      0.83     51070
Accuracy (Desired Ratio 0.5) with XGBoost: 0.6745251615429803
Confusion Matrix (Desired Ratio 0.5) with XGBoost:
[[30455 14715]
 [ 1907  3993]]
Classification Report (Desired Ratio 0.5) with XGBoost:
              precision    recall  f1-score   support

         



Best Hyperparameters (Desired Ratio 0.5) with XGBoost: {'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Accuracy (Best Model, Desired Ratio 0.5) with XGBoost: 0.6862933228901508
Confusion Matrix (Best Model, Desired Ratio 0.5) with XGBoost:
[[30941 14229]
 [ 1792  4108]]
Classification Report (Best Model, Desired Ratio 0.5) with XGBoost:
              precision    recall  f1-score   support

           0       0.95      0.68      0.79     45170
           1       0.22      0.70      0.34      5900

    accuracy                           0.69     51070
   macro avg       0.58      0.69      0.57     51070
weighted avg       0.86      0.69      0.74     51070
Accuracy (Desired Ratio 0.55) with XGBoost: 0.7249853142745252
Confusion Matrix (Desired Ratio 0.55) with XGBoost:
[[33471 11699]
 [ 2346  3554]]
Classification Report (Desired Ratio 0.55) with XGBoost:
              precision    recall  f1-score   support

          



Best Hyperparameters (Desired Ratio 0.55) with XGBoost: {'colsample_bytree': 1.0, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Accuracy (Best Model, Desired Ratio 0.55) with XGBoost: 0.7394948110436655
Confusion Matrix (Best Model, Desired Ratio 0.55) with XGBoost:
[[34087 11083]
 [ 2221  3679]]
Classification Report (Best Model, Desired Ratio 0.55) with XGBoost:
              precision    recall  f1-score   support

           0       0.94      0.75      0.84     45170
           1       0.25      0.62      0.36      5900

    accuracy                           0.74     51070
   macro avg       0.59      0.69      0.60     51070
weighted avg       0.86      0.74      0.78     51070


Recall improved to 0.54 but it is still less than the lr_model.

I believe the reason for the low recall scores is because of the preprocessed data.  I will need to go back and re-examine this data and then re-run the models again.  I will have this discussion with my mentor during the weekly call.