# This notebook will present a project focusing on Credit Risk.
## Plan

Predict the probability of default by various ML models:
- Logistic regression
- Discriminant Analysis
- Naive Bayesian
- KNN
- RF
- Classification trees
- Neural networks

## Data

### Data Source: UCI Machine Learning Repository.

- https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients#
- https://bradzzz.gitbooks.io/ga-seattle-dsi/content/dsi/dsi_05_classification_databases/2.1-lesson/assets/datasets/DefaultCreditCardClients_yeh_2009.pdf

### Data Set Information:

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural network is the only one that can accurately estimate the real probability of default.

### Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. 

This study reviewed the literature and used the following 23 variables as explanatory variables:

- X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
- X2: Gender (1 = male; 2 = female).
- X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- X4: Marital status (1 = married; 2 = single; 3 = others).
- X5: Age (year).
- X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: 

    - X6 = the repayment status in September, 2005; 
    - X7 = the repayment status in August, 2005; 
    - . . .;
    - X11 = the repayment status in April, 2005. 

> The measurement scale for the repayment status is: 
- -1 = pay duly; 
- 1 = payment delay for one month; 
- 2 = payment delay for two months; 
- . . .; 
- 8 = payment delay for eight months; 
- 9 = payment delay for nine months and above.


- X12-X17: Amount of bill statement (NT dollar). 
    - X12 = amount of bill statement in September, 2005; 
    - X13 = amount of bill statement in August, 2005; 
    - . . .; 
    - X17 = amount of bill statement in April, 2005.

- X18-X23: Amount of previous payment (NT dollar). 
    - X18 = amount paid in September, 2005; 
    - X19 = amount paid in August, 2005; 
    - . . .;
    - X23 = amount paid in April, 2005.

## Environment Setup and EDA

In [15]:
import pandas as pd
import numpy as np 

import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline 
import warnings
warnings.filterwarnings('ignore')

In [17]:
data = pd.read_excel('Credit_Risk_Prediction\data_uci\default_of_credit_card_clients.xls', header=[0,1], index_col=0)
data.columns = data.columns.map(':'.join)
data.head()

Unnamed: 0,X1:LIMIT_BAL,X2:SEX,X3:EDUCATION,X4:MARRIAGE,X5:AGE,X6:PAY_0,X7:PAY_2,X8:PAY_3,X9:PAY_4,X10:PAY_5,...,X15:BILL_AMT4,X16:BILL_AMT5,X17:BILL_AMT6,X18:PAY_AMT1,X19:PAY_AMT2,X20:PAY_AMT3,X21:PAY_AMT4,X22:PAY_AMT5,X23:PAY_AMT6,Y:default payment next month
1,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
2,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
3,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
4,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
5,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 1 to 30000
Data columns (total 24 columns):
X1:LIMIT_BAL                    30000 non-null int64
X2:SEX                          30000 non-null int64
X3:EDUCATION                    30000 non-null int64
X4:MARRIAGE                     30000 non-null int64
X5:AGE                          30000 non-null int64
X6:PAY_0                        30000 non-null int64
X7:PAY_2                        30000 non-null int64
X8:PAY_3                        30000 non-null int64
X9:PAY_4                        30000 non-null int64
X10:PAY_5                       30000 non-null int64
X11:PAY_6                       30000 non-null int64
X12:BILL_AMT1                   30000 non-null int64
X13:BILL_AMT2                   30000 non-null int64
X14:BILL_AMT3                   30000 non-null int64
X15:BILL_AMT4                   30000 non-null int64
X16:BILL_AMT5                   30000 non-null int64
X17:BILL_AMT6                   30000 non-n

In [18]:
data.describe()

Unnamed: 0,X1:LIMIT_BAL,X2:SEX,X3:EDUCATION,X4:MARRIAGE,X5:AGE,X6:PAY_0,X7:PAY_2,X8:PAY_3,X9:PAY_4,X10:PAY_5,...,X15:BILL_AMT4,X16:BILL_AMT5,X17:BILL_AMT6,X18:PAY_AMT1,X19:PAY_AMT2,X20:PAY_AMT3,X21:PAY_AMT4,X22:PAY_AMT5,X23:PAY_AMT6,Y:default payment next month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,...,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,...,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,1.133187,...,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,...,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,...,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,...,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


In [14]:
data.isnull()

Unnamed: 0,X1:LIMIT_BAL,X2:SEX,X3:EDUCATION,X4:MARRIAGE,X5:AGE,X6:PAY_0,X7:PAY_2,X8:PAY_3,X9:PAY_4,X10:PAY_5,...,X15:BILL_AMT4,X16:BILL_AMT5,X17:BILL_AMT6,X18:PAY_AMT1,X19:PAY_AMT2,X20:PAY_AMT3,X21:PAY_AMT4,X22:PAY_AMT5,X23:PAY_AMT6,Y:default payment next month
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
29997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
29998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
29999,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
