Using data collected at the start of the loan, I predict whether a loan will go default. At 75% recall, my model cut false positive rate almost by half compared to the strict credit score cutoff approach. This project is published in Towards Data Science on medium: Link.
To predict whether a loan is good or bad using data collected when the loan is originated.
Single Family Loan-Level dataset downloaded from Freddie Mac's website
Year range: 1999-2003 Only completed/terminated loans are used. A good loan is a loan that has been fully paid-off; a bad loan is a loan that was terminated by other reasons. Raw data is a stored in sqlite database.
- map values from letter to numeric values in true vs false columns
- clear NaN fields
- label encode catagorial fields
Distribution of Credit Score by Loan Outcome
Subsample majority class (good loans) to match minority class (bad loans).
Resample minority class (bad loans) to match majority class (good loans).
- Gradient Boosting Classifier
- Random Forest Classifier
- SGD Classfier (Logistic Classifier)
- Hard Voting Classifier of all of the above
- Light GBM
Can we solve this problem with anamoly detection algorithm?
- Balanced Bagging Classifier
- Balanced Random Forest Classifier
- Easy Ensemble Classifier
- Voting Classifier
- Light GBM
How does it fare compared to status quo?