Loan default prediction with Berka Dataset using XGboost model.
To provides mechanisms in determining which consumers should receive loans and to benefit banks in increasing profits.
We use Berka dataset also known as PKDD'99 Financial dataset which contains 606 successful and 76 not successful loans along with their personal and transaction information. The data relationship is depicted in the diagram below.
- Imbalanced data (606 negative class, 76 possitive class)
- Feature engineering (Creation, Extraction, Transformation)
- We used only information from before the loan was accepted (because our Goal is to make decision to issue the loan)
- We tried several models, including LGBM, RandomForest, and XGboost (we used auto ML and discovered that these models are the best), and in the end, we used XGboost with feature selection (using feature important) and Grid-search to tune hyperparameters because it gives the best results.
- We used SMOTE to handle imbalanced data
- Profit is calculated using the formula profit = revenue - cost, where revenue is money earned by the bank from interest and cost is defaulted money. the more information is in
profit_analysis.ipynb
- First, preprocess the raw data run
data_manipulation.ipynb
. The results will be saved intransformed_data/final_transformed_data.csv
- second, train model using
model.ipynb
. The results will be inreport/report_xgb.csv
which contains true label and probability of prediction in each account - then, we run
profit_analysis.ipynb
to createreport/ori_profit.csv
which is the original profit, and the final resultreport/report_xgb_threshold_profit.csv
which is the profit after using this model in each threshold and each interest rate
- Initial model performance
model | Acc | F1 | ROC_AUC |
---|---|---|---|
LGBMClassifier | 0.925 | 0.553 | 0.743 |
RandomForestClassifier | 0.924 | 0.544 | 0.764 |
XGBClassifier | 0.923 | 0.596 | 0.738 |
- Performance after using best params, best feature, and SMOTE
model | Acc | F1 | ROC_AUC |
---|---|---|---|
LGBMClassifier | 0.919 | 0.572 | 0.731 |
RandomForestClassifier | 0.912 | 0.616 | 0.791 |
XGBClassifier | 0.927 | 0.645 | 0.784 |
We created an interactive dashboard with Power BI to visualize the profit we've made.