In order to mitigate risk, I built and evaluated several machine-learning models to predict credit risk using free data from LendingClub. I employed the imbalanced-learn and Scikit-learn libraries to build and evaluate models using the two following techniques:
For this approach, I used the imbalanced learn library to resample the LendingClub data; built and evaluated logistic regression classifiers using the resampled data.
Refer to: Resampling Notebook
-
Which model had the best balanced accuracy score?
SMOTEENN had the best balanced accuracy score of
0.7975462408998795
versus0.7752245065690078
;0.7966770207605626
;0.7856360112968401
for Cluster Centroids, SMOTE, and Random Oversampler respectively. -
Which model had the best recall score?
SMOTE had the best recall score:
0.88
. -
Which model had the best geometric mean score?
SMOTEENN had the best geometric mean score:
0.79
.
For this method, I trained and compared two different ensemble classifiers to predict loan risk and evaluate each model. I used the Balanced Random Forest Classifier and the Easy Ensemble Classifier. For the ensemble learners, I used 100 estimators (n_estimators=100
) for both models.
Refer to: Ensemble Notebook
-
Which model had the best balanced accuracy score?
Easy Ensemble Classifier had the best balanced accuracy score:
0.931601605553446
versus0.7855345052746622
for Balanced Random Forest Classifier. -
Which model had the best recall score?
Easy Ensemble Classifier had the best recall score:
0.94
versus
0.90
for Balanced Random Forest Classifier. -
Which model had the best geometric mean score?
Easy Ensemble Classifier had the best geometric mean score:
0.93
versus
0.78
for Balanced Random Forest Classifier. -
What are the top three features?
Top three features are the following:
(0.09175752102205247, 'total_rec_prncp'), (0.06410003199501778, 'total_pymnt_inv'), (0.05764917485461809, 'total_pymnt')