<a href="https://colab.research.google.com/github/rgozun/Credit-Default-Prediction/blob/main/UCI_Credit_Default_Prediction_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Credit Payment Default
Ralph Gozun
Data Host: UCI Machine Learning Repository
Host URL: https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

Abstract: This work serves to test different predictive algorithms aiming to accurately determine if a banking customer will default on their loan repayments.

Metadata: This research employed a binary variable:

* default payment (Yes = 1, No = 0), as the response variable.

This study reviewed the literature and used the following 23 variables as explanatory variables:
* X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
* X2: Gender (1 = male; 2 = female).
* X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
* X4: Marital status (1 = married; 2 = single; 3 = others).
* X5: Age (year).
* X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
* X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
* X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.



In [2]:
import os
import pandas as pd
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

%config InlineBackend.figure_format = 'retina'
pd.set_option('precision', 2)

In [3]:
df_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls'
df = pd.read_excel(df_url,header=1)
# df = df.sample(frac=1,random_state=1)
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,-2,-2,3913,3102,689,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,0,2,2682,1725,2682,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,0,0,29239,14027,13559,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,0,0,46990,48233,49291,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,0,0,8617,5670,35835,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [11]:
df['SEX'] = df['SEX'].astype('category')
df['EDUCATION'] = df['EDUCATION'].astype('category')
df['MARRIAGE'] = df['MARRIAGE'].astype('category')
df[''] = df[''].astype('category')
df[''] = df[''].astype('category')
df[''] = df[''].astype('category')
df[''] = df[''].astype('category')

In [10]:
display(df.describe(), df.dtypes)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.32,1.6,1.85,1.55,35.49,-0.02,-0.13,-0.17,-0.22,-0.27,-0.29,51223.33,49179.08,47000.0,43262.95,40311.4,38871.76,5663.58,5920.0,5225.68,4826.08,4799.39,5215.5,0.22
std,8660.4,129747.66,0.49,0.79,0.52,9.22,1.12,1.2,1.2,1.17,1.13,1.15,73635.86,71173.77,69300.0,64332.86,60797.16,59554.11,16563.28,23000.0,17606.96,15666.16,15278.31,17777.47,0.42
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-165580.0,-69777.0,-157000.0,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3558.75,2984.75,2670.0,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,22381.5,21200.0,20100.0,19052.0,18104.5,17071.0,2100.0,2010.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,67091.0,64006.25,60200.0,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,8.0,964511.0,983931.0,1660000.0,891586.0,927171.0,961664.0,873552.0,1680000.0,896040.0,621000.0,426529.0,528666.0,1.0


ID                            int64
LIMIT_BAL                     int64
SEX                           int64
EDUCATION                     int64
MARRIAGE                      int64
AGE                           int64
PAY_0                         int64
PAY_2                         int64
PAY_3                         int64
PAY_4                         int64
PAY_5                         int64
PAY_6                         int64
BILL_AMT1                     int64
BILL_AMT2                     int64
BILL_AMT3                     int64
BILL_AMT4                     int64
BILL_AMT5                     int64
BILL_AMT6                     int64
PAY_AMT1                      int64
PAY_AMT2                      int64
PAY_AMT3                      int64
PAY_AMT4                      int64
PAY_AMT5                      int64
PAY_AMT6                      int64
default payment next month    int64
dtype: object

In [None]:
corrMatrix = df.corr()
print(corrMatrix)

Empty DataFrame
Columns: []
Index: []


In [None]:
X, y = df.iloc[:,1:24], df['Y']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.10, random_state=1)
print(X_train.head(n=2),X_test.head(n=2),y_test.head(n=2))

           X1 X2 X3 X4  X5  X6  X7  ...  X17   X18   X19   X20    X21  X22   X23
10809  320000  1  1  2  33  -2  -2  ...  700  3609  4615  5256  22760  700  1992
7546    80000  2  3  2  34  -2  -2  ...    0   395  1846  8759      0    0     0

[2 rows x 23 columns]            X1 X2 X3 X4  X5 X6 X7  ...     X17    X18   X19   X20   X21    X22   X23
3261   170000  1  2  1  42  0  0  ...  134525   5300  5100  5000  4700   5000  4800
20927  340000  1  1  1  39  0  0  ...  206924  16100  9000  4000  7200  15200     0

[2 rows x 23 columns] 3261     0
20927    0
Name: Y, dtype: object


In [None]:
X.describe()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
count,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001,30001
unique,82,3,8,5,57,12,12,12,12,11,11,22724,22347,22027,21549,21011,20605,7944,7900,7519,6938,6898,6940
top,50000,2,2,2,29,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
freq,3365,18112,14030,15964,1605,14737,15730,15764,16455,16947,16286,2008,2506,2870,3195,3506,4020,5249,5396,5968,6408,6703,7173


In [None]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)

ValueError: ignored

In [None]:
X.head(1000)

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23
10747,20000,2,3,1,47,0,0,0,-1,-1,-2,8152,7268,3699,780,0,0,1125,1000,780,0,0,0
12573,50000,1,1,2,26,0,0,-1,-1,-1,-1,11550,11368,138,1022,316,450,1000,200,1200,600,600,600
29677,50000,1,2,1,28,-1,-1,-1,0,-1,-1,430,0,46257,45975,1300,43987,0,46257,2200,1300,43987,1386
8856,120000,1,2,2,26,0,0,0,0,0,0,114815,113360,116797,92346,88542,80225,6159,10008,3051,3100,3052,2908
21098,200000,1,1,1,42,2,2,2,2,2,2,168289,172001,175281,177895,180078,184048,8000,7500,7000,6600,7000,7100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23285,290000,2,3,1,49,0,0,-2,-1,0,-1,52894,0,0,44777,26201,2780,21560,0,44777,0,5560,6420
3782,430000,2,1,1,48,-1,-1,-2,-2,-2,-1,9900,0,0,0,0,2299,0,0,0,0,2299,37980
18013,110000,1,2,1,30,2,0,0,3,2,2,54404,57503,66128,64477,65413,66520,4000,9600,0,2600,2300,3000
20220,230000,2,2,2,53,0,0,0,0,0,0,9358,13482,17874,26743,26431,35153,5000,5000,10000,5000,10018,6850
