Skip to content

sorayutmild/loan-default-prediction

Repository files navigation

loan-default-prediction

Loan default prediction with Berka Dataset using XGboost model.

banker cat

Goal

To provides mechanisms in determining which consumers should receive loans and to benefit banks in increasing profits.

Dataset

We use Berka dataset also known as PKDD'99 Financial dataset which contains 606 successful and 76 not successful loans along with their personal and transaction information. The data relationship is depicted in the diagram below.

ER diagram of dataset

Challenges

  • Imbalanced data (606 negative class, 76 possitive class)
  • Feature engineering (Creation, Extraction, Transformation)

Experiments

  • We used only information from before the loan was accepted (because our Goal is to make decision to issue the loan)
  • We tried several models, including LGBM, RandomForest, and XGboost (we used auto ML and discovered that these models are the best), and in the end, we used XGboost with feature selection (using feature important) and Grid-search to tune hyperparameters because it gives the best results.
  • We used SMOTE to handle imbalanced data
  • Profit is calculated using the formula profit = revenue - cost, where revenue is money earned by the bank from interest and cost is defaulted money. the more information is in profit_analysis.ipynb

How to run

  • First, preprocess the raw data run data_manipulation.ipynb. The results will be saved in transformed_data/final_transformed_data.csv
  • second, train model using model.ipynb. The results will be in report/report_xgb.csv which contains true label and probability of prediction in each account
  • then, we run profit_analysis.ipynb to create report/ori_profit.csv which is the original profit, and the final result report/report_xgb_threshold_profit.csv which is the profit after using this model in each threshold and each interest rate

Results

Performance results

  • Initial model performance
model Acc F1 ROC_AUC
LGBMClassifier 0.925 0.553 0.743
RandomForestClassifier 0.924 0.544 0.764
XGBClassifier 0.923 0.596 0.738
  • Performance after using best params, best feature, and SMOTE
model Acc F1 ROC_AUC
LGBMClassifier 0.919 0.572 0.731
RandomForestClassifier 0.912 0.616 0.791
XGBClassifier 0.927 0.645 0.784

Power BI Visualization

We created an interactive dashboard with Power BI to visualize the profit we've made.

Dashboard

Links

Power BI dashboard
Slide presentation

About

Loan default prediction with Berka Dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published