# Customer Default Rate Analysis and Prediction: 
# Part 2, Build Default Payment Prediction Model

## Background
### Problem

Over the past year or so Credit One has seen an increase in the number of customers who have defaulted on loans 
they have secured from various partners. Credit One, as their credit scoring service, could risk losing business 
if the problem is not solved right away. They have enlisted the help of our Data Science team to design and 
implement a creative, empirically sound solution. 

### Goals

* Examining current customer demographics.
* Understand what traits relate to customers’ current credit obligations.
* Identify which attributes relate significantly to customer default rates.
  * Here is the [Exploratory Data Analysis](https://github.com/snowlee26/Portfolio-/blob/master/Formal%20EDA%20.ipynb) for the first three goals. 
* Build a predictive model to better classify potential “at risk” customers.

### Dataset Information

* This data source aimed at the case of customers default payments in Taiwan.
* Dataset contains 30,000 customer defualt information. 
* Attributes include:
     * Amount of the given credit, Gender, Education
     * Marital Status, Age, History of past payments
     * Amount of bill statement, Amount of previous payment 



## Model Building and Evaluating Process

### Pre-processing and Feature Engineering
[Exploratory Data Analysis](https://github.com/snowlee26/Portfolio-/blob/master/Formal%20EDA%20.ipynb)


In [5]:
# import numpy, pandas, scipy, math, matplotlib
import numpy as np
import pandas as pd
import scipy 
from math import sqrt
import matplotlib.pyplot as plt

* The dataset we upload here has been pre-processed from the previous EDA section.
* We used df.info() to make sure that all the attributes are in the numerical form. 

In [6]:
rawData = pd.read_csv('new_credit.csv')
rawData.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,0,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,0,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,0,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,1,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,2,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [7]:
rawData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
LIMIT_BAL                     30000 non-null int64
SEX                           30000 non-null int64
EDUCATION                     30000 non-null int64
MARRIAGE                      30000 non-null int64
AGE                           30000 non-null int64
PAY_0                         30000 non-null int64
PAY_2                         30000 non-null int64
PAY_3                         30000 non-null int64
PAY_4                         30000 non-null int64
PAY_5                         30000 non-null int64
PAY_6                         30000 non-null int64
BILL_AMT1                     30000 non-null int64
BILL_AMT2                     30000 non-null int64
BILL_AMT3                     30000 non-null int64
BILL_AMT4                     30000 non-null int64
BILL_AMT5                     30000 non-null int64
BILL_AMT6                     30000 non-null int64
PAY_AMT1                

### Building models 
**We were asked to predict if certain customer will be default for the next month payment. The result will be either 
Yes or No, so this is a classification problem, we will be using classification classifiers to build the models.**

* All the features besides "default payment next month" were selected as independent variables.
* "default payment next month" was selected as the dependent variable, the one we need to predict.
* We used train_test_split() function to randomly split training and testing datasets.
* After spliting the datasets, we used four different classification classifiers to build four different models.
  * Logistic Regression
  * Random Forest
  * Gradient Boosting
  * Supportive Vector Machine

In [8]:
# Specify features
features = rawData.iloc[:,0:23]
print('Summary of features')
features.head()

Summary of features


Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6
0,20000,2,2,1,0,2,2,-1,-1,-2,...,689,0,0,0,0,689,0,0,0,0
1,120000,2,2,2,0,-1,2,0,0,0,...,2682,3272,3455,3261,0,1000,1000,1000,0,2000
2,90000,2,2,2,0,0,0,0,0,0,...,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000
3,50000,2,2,1,1,0,0,0,0,0,...,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000
4,50000,1,2,1,2,-1,0,-1,0,0,...,35835,20940,19146,19131,2000,36681,10000,9000,689,679


In [9]:
# Specify dependent variable
depVar = rawData['default payment next month']

In [10]:
from sklearn.model_selection import train_test_split

# establish train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, depVar, test_size=0.25)

In [11]:
# use the shape function to double check that the split was made as needed:
X_train.shape, X_test.shape

((22500, 23), (7500, 23))

#### Logistic Regression

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
# name the classifier
modelLR = LogisticRegression()

In [39]:
# fit the model and time it
import time

t0 = time.time()
modelLR.fit(X_train, y_train)
t1 = time.time()

total = t1-t0
total


0.16478800773620605

In [15]:
# check the accuracy 
modelLR.score(X_train, y_train)

0.7783555555555556

#### Random Forest

In [16]:
from sklearn.ensemble import RandomForestClassifier

In [17]:
# name the classifier
modelRF = RandomForestClassifier(n_estimators=100)

In [18]:
# fit the model and time it
t0 = time.time()
modelRF.fit(X_train, y_train)
t1 = time.time()

total = t1-t0
total

6.912303924560547

In [19]:
# check the accuracy
modelRF.score(X_train, y_train)

0.9968444444444444

#### Gradient Boosting

In [33]:
from sklearn.ensemble import GradientBoostingClassifier

In [34]:
# name the classifier
GB = GradientBoostingClassifier()

In [35]:
# fit the model and time it
t0 = time.time()
modelGB = GB.fit(X_train, y_train)
t1 = time.time()

total = t1-t0
total

3.8301010131835938

In [107]:
# check the accuracy
modelGB.score(X_train, y_train)

0.8268888888888889

#### SVM

In [20]:
from sklearn.svm import SVC

In [21]:
# name the classifier
modelSVM = SVC()

In [40]:
# fit the model and time it
t0 = time.time()
modelSVM.fit(X_train, y_train)
t1 = time.time()

total = t1-t0
total

85.1027672290802

In [84]:
# check the accuracy
modelSVM.score(X_train, y_train)

0.9904444444444445

**Both Random Forest and SVM models have accuracy above 0.99, it is obvious that those two models are overfitting.
We are going to use Cross Validation to solve this problem.**

### Use Cross Validation

Cross-validation is a resampling procedure. It splits a given data sample into numbers of groups. Each group is called
a fold, total k(any number) folds. We chose a training set of the size of one fold, train our model on that partition, 
all the folds will be examined. Then we evaluate the results on the remaining test data. The final results/scores are 
averaged out.
This could avoid overfitting problem. 

In [1]:
from sklearn.model_selection import cross_val_score

In [41]:
# use 3 fold validation
cross_val_score(LogisticRegression(random_state=0), X_train, y_train, cv=3)

array([0.7784    , 0.7784    , 0.77826667])

In [23]:
cross_val_score(RandomForestClassifier(n_estimators=25, random_state=0), X_train, y_train, cv=3)

array([0.81453333, 0.81186667, 0.81373333])

In [91]:
GB_cv = GradientBoostingClassifier(n_estimators=25, random_state=0)
scores = cross_val_score(GB_cv, X_train, y_train, cv=3)
scores

array([0.82442341, 0.82426667, 0.81490865])

In [42]:
cross_val_score(SVC(random_state=0), X_train, y_train, cv=3)

array([0.78053333, 0.77933333, 0.77813333])

* Cross Validation sucessfully avoided overfitting problem. 
* Model with the highest accuracy is Gradient Boosting. 

### Test and Evaluate the Models

* For Logestic Regression and Gradient Boosting. We will use the models built from train_test_split() on the testing
dataset, since the accuracies are almost the same with the ones built from cross validation, at the same time they
took much less time to run. 
* For Random Forest and SVM, will use the models built from cross validation, since the ones from train_test_split()
are overfitting. 
* For evaluating, We will use confusion matrix to acculate the accuracy and we will also print out classification report
for each model include precision, recall, f1-score and support. 
  * Precision – Accuracy of positive predictions. It indicates among all the positive predictions, how many are real positive.
  * Recall – Fraction of positives that were correctly identified. It indicates among all the positive instances, how many did you predict right. 
  * F1 score – It takes both precision and recall into consideration. The best score is 1 and the worst is 0. 

**Logestic Regression**

In [26]:
from sklearn.metrics import confusion_matrix
predictions_LR = modelLR.predict(X_test)
confusion_matrix(y_test, predictions_LR)

array([[5850,    0],
       [1650,    0]])

accuracy = 0.780

In [44]:
from sklearn.metrics import classification_report
reportLR = classification_report(y_test, predictions_LR)
print(reportLR)

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      5850
           1       0.00      0.00      0.00      1650

   micro avg       0.78      0.78      0.78      7500
   macro avg       0.39      0.50      0.44      7500
weighted avg       0.61      0.78      0.68      7500



**Random Forest**

In [30]:
from sklearn.model_selection import cross_val_predict
RF_cv = RandomForestClassifier(n_estimators=25, random_state=0)
modelRF_cv = RF_cv.fit(X_train, y_train)
predictions_RF_cv = cross_val_predict(modelRF_cv, X_test, y_test, cv=3)
confusion_matrix(y_test, predictions_RF_cv)

array([[5451,  399],
       [1031,  619]])

accuracy = 0.809

In [46]:
reportRF = classification_report(y_test, predictions_RF_cv)
print(reportRF)

              precision    recall  f1-score   support

           0       0.84      0.93      0.88      5850
           1       0.61      0.38      0.46      1650

   micro avg       0.81      0.81      0.81      7500
   macro avg       0.72      0.65      0.67      7500
weighted avg       0.79      0.81      0.79      7500



**Gradient Boosting**

In [36]:
predictions_GB = modelGB.predict(X_test)
confusion_matrix(y_test, predictions_GB)

array([[5554,  296],
       [1048,  602]])

accuracy = 0.820

In [47]:
reportGB = classification_report(y_test, predictions_GB)
print(reportGB)

              precision    recall  f1-score   support

           0       0.84      0.95      0.89      5850
           1       0.67      0.36      0.47      1650

   micro avg       0.82      0.82      0.82      7500
   macro avg       0.76      0.66      0.68      7500
weighted avg       0.80      0.82      0.80      7500



**SVM**

In [38]:
SVM_cv = SVC()
modelSVM_cv = SVM_cv.fit(X_train, y_train)
predictions_SVM_cv = cross_val_predict(modelSVM_cv, X_test, y_test, cv=3)
confusion_matrix(y_test, predictions_SVM_cv)

array([[5823,   27],
       [1607,   43]])

accuracy = 0.782

In [48]:
reportSVM = classification_report(y_test, predictions_SVM_cv)
print(reportSVM)

              precision    recall  f1-score   support

           0       0.78      1.00      0.88      5850
           1       0.61      0.03      0.05      1650

   micro avg       0.78      0.78      0.78      7500
   macro avg       0.70      0.51      0.46      7500
weighted avg       0.75      0.78      0.70      7500



**Comparison of Model Performances by Graphs**

* Time taken for each model to run(using train_test_split() function).
![](TimeTaken.png)

* Accuracy and F1 score of each model. 

|Accuracy|F1 Score|
|--------|--------|
|![](Accuracy.png)|![](F1.png)|

In [None]:
**Model Selection - Gradient Boosting**

After comparing all the models we have built: Logestic Regression, Random Forest, Gradient Boosting and SVM. The 
optimal model we have chosen is Gradient Boosting, based on the F1 Score and time taken. 

Logestic Regression and SVM have relativly low accuracy and F1 score. Especially for Logestic Regression, both precision
and recall for default class are 0. This could lead to very bad prediction, the model doesn't have the ability to detect
default customers for the next month. 

Random Forest and Gradient Boosting have very similar accuracy and F1 score. But Gradient Boosting takes almost half of 
the time to build than Random Forest. 


### Suggestions to Credit One

* Deploy the final model to different platforms: computer, smartphone and tablet etc.  
* Use the deployed model as a reference to see if the client is more likely to be default or not. 
* Auto-approval for loans below certain amount using the model to improve the efficiency.  
* Maintain the model regularly to ensure its best performance. 