# Overview

## Main Purpose

We would like to build a model to predict whether one is going to be defaulted on loan. 

In other words, we will build a model to automatically approve/disapprove one's loan request.

## About the data set

- The data is taken from Kaggle.com
- About 6000 entries with 13 features
    - loan: amount of loan
    - mortdue: amount of mortgage
    - value: value of property (house)
    - reason: reason for loan
    - job: one's profession
    - yoj: duration of employment
    - **derog: number of bad/negative report**
    - **delinq: number of delinquent credit line**
    - clage: number of months of the oldest credit lines
    - ninq: number of recent credit lines
    - clno: total number of credit lines
    - debtnc: debt to income ratio
    
- Many missing values on important features
- imbalanced


## Key Takeaway

Some of features have a lot of missing values. 

The questions are whether we should drop them or fill them up. 

**DEROG** and **DELINQ** having many missing values, yet seem very important to decision making. 

If you were a banker, you would of course want to know your client's credit history. 

**We are going to build another (sub)model to predict and fill up the missing values of these features.**

## Steps

1. Information and description of the data
2. **Handle the missing values** 
3. Undersampling -- accounting for imbalanced dependent varaible
4. Dropping variables  
    - correlation
    - feautre selection -- which would work the best? why and why not others?
    - feature extraction (?) -- would PCA be needed? why and why not? 
5. Modeling
6. Evaluation
7. Conclusion

## Conclusion

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('hmeq.csv')

In [3]:
df

Unnamed: 0,BAD,LOAN,MORTDUE,VALUE,REASON,JOB,YOJ,DEROG,DELINQ,CLAGE,NINQ,CLNO,DEBTINC
0,1,1100,25860.0,39025.0,HomeImp,Other,10.5,0.0,0.0,94.366667,1.0,9.0,
1,1,1300,70053.0,68400.0,HomeImp,Other,7.0,0.0,2.0,121.833333,0.0,14.0,
2,1,1500,13500.0,16700.0,HomeImp,Other,4.0,0.0,0.0,149.466667,1.0,10.0,
3,1,1500,,,,,,,,,,,
4,0,1700,97800.0,112000.0,HomeImp,Office,3.0,0.0,0.0,93.333333,0.0,14.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5955,0,88900,57264.0,90185.0,DebtCon,Other,16.0,0.0,0.0,221.808718,0.0,16.0,36.112347
5956,0,89000,54576.0,92937.0,DebtCon,Other,16.0,0.0,0.0,208.692070,0.0,15.0,35.859971
5957,0,89200,54045.0,92924.0,DebtCon,Other,15.0,0.0,0.0,212.279697,0.0,15.0,35.556590
5958,0,89800,50370.0,91861.0,DebtCon,Other,14.0,0.0,0.0,213.892709,0.0,16.0,34.340882
