# Default of Credit Card Clients Dataset

### Default Payments of Credit Card Clients in Taiwan from 2005

Source from Kaggle https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset

Dataset Information

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

Content

There are 25 variables:

> - ID: ID of each client
> - LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit
> - SEX: Gender (1=male, 2=female)
> - EDUCATION: (1=graduate school, 2=university, 3=high school, 0,4,5,6=others)
> - MARRIAGE: Marital status (1=married, 2=single, 3=divorce, 0=others)
> - AGE: Age in years
> - PAY_0: Repayment status in September, 2005 (-2=No consumption, -1=pay duly, 0=The use of revolving credit, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)
> - PAY_2: Repayment status in August, 2005 (scale same as above)
> - PAY_3: Repayment status in July, 2005 (scale same as above)
> - PAY_4: Repayment status in June, 2005 (scale same as above)
> - PAY_5: Repayment status in May, 2005 (scale same as above)
> - PAY_6: Repayment status in April, 2005 (scale same as above)
> - BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
> - BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
> - BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
> - BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
> - BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
> - BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
> - PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)
> - PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
> - PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
> - PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
> - PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
> - PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
> - default.payment.next.month: Default payment (1=yes, 0=no)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

pd.set_option('display.max_columns', 30)
pd.set_option('display.max_colwidth', 120)

import warnings
warnings.filterwarnings(action='ignore')

import os
print(os.listdir("../input"))

#### Read in UCI_Credit_Card.csv to df dataframe

In [None]:
# df = pd.read_csv('drive/python/UCI_Credit_Card.csv')
df = pd.read_csv('../input/UCI_Credit_Card.csv')

#### Quick look into the dataframe

In [None]:
print(df.shape)
df.sample(10)
df.head()
df.dtypes

In [None]:
print(df.columns)
df.describe()

#### Rename the label from 'default.payment.next.month' to 'DEFAULT'

In [None]:
df.rename(inplace=True, columns={'default.payment.next.month': 'DEFAULT'}) 
df.columns

In [None]:
df.groupby('DEFAULT').size()

#### Perform visualization on all the features

In [None]:
df.groupby('DEFAULT').hist(figsize=(20,20))
plt.show()
plt.close()

#### Drop ID feature from the dataset

In [None]:
df.drop('ID', axis=1, inplace=True)
df.head()

#### Checking for missing value

In [None]:
df.isnull().any()

#### Checking for bad data with last bill which is negative or zero value but still appear DEFAULT

In [None]:
df_bad = df[(df['BILL_AMT1'] <= 0) & (df['DEFAULT'] == 1)]
print(df_bad.shape)
df_bad[['BILL_AMT1','DEFAULT']].sample(10)

#### Remove bad data from the dataset

In [None]:
for index, row in df.iterrows():
    if (row['BILL_AMT1'] <= 0) & (row['DEFAULT'] == 1):
        df.drop(index, axis=0, inplace=True)
df.shape

#### Remove data for EDUCATION which categorized under 'other'

In [None]:
for index, row in df.iterrows():
    if (row['EDUCATION'] >= 4) | (row['EDUCATION'] == 0):
        df.drop(index, axis=0, inplace=True)
df.shape

#### Remove data for MARRIAGE which categorized under 'other'

In [None]:
for index, row in df.iterrows():
    if (row['MARRIAGE'] == 0):
        df.drop(index, axis=0, inplace=True)
df.shape

In [None]:
df.groupby('DEFAULT').size()

#### The dataset balanced at DEFAULT 0 is 79.35% and DEFAULT 1 is 20.65% so dataset do not need to re-balance again

#### Importing all sklearn libraries and creating a list for all the models

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC()))
models.append(('LR', LogisticRegression()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('GB', GradientBoostingClassifier()))

#### Getting the label and all features in place for feature selection and modeling

In [None]:
array = df.values
X_sel = array[:,0:23]
Y_sel = array[:,23]
features = df.columns[:-1]

In [None]:
main_df = pd.DataFrame(columns=['Num_Of_Feature','Features_Sel','KNN','SVM','LR','DT','GNB','RF','GB'])
model_feat = LogisticRegression()

s_highest = 0

for n in range(3,11):
    print("Running selecting ", n ," features to run on all models..." )
    rfe = RFE(model_feat, n)
    fit = rfe.fit(X_sel, Y_sel)
    
    features_sel = []
    for sel, col in zip((fit.support_),features):
        if sel == True:
            features_sel.append(col)
    
    x = df[(features_sel)]
    y = df.DEFAULT

    x_train, x_test, y_train, y_test = train_test_split(
    x, y, stratify = df.DEFAULT, random_state=123)
    
    names = []
    scores = []
    for name, model in models:
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        score = accuracy_score(y_test, y_pred)
        scores.append(score)
        names.append(name)
        if score > s_highest:
            s_highest = score
            f_highest = features_sel
            n_highest = name
            m_highest = model
            
    main_df = main_df.append({'Num_Of_Feature':n,
                              'Features_Sel':(", ".join(features_sel)),
                              names[0]:scores[0],
                              names[1]:scores[1],
                              names[2]:scores[2],
                              names[3]:scores[3],
                              names[4]:scores[4],
                              names[5]:scores[5],
                              names[6]:scores[6]},
                             ignore_index=True)

print('The highest score is',s_highest,'with these features',f_highest,'on model',n_highest)

In [None]:
main_df

#### Ploting graph for all models with number of features

In [None]:
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(111)
main_df.plot(kind='line',x='Num_Of_Feature',y='KNN',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='SVM',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='LR',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='DT',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='GNB',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='RF',ax=ax)
main_df.plot(kind='line',x='Num_Of_Feature',y='GB',ax=ax)

ax.set_xticks(np.arange(3, 11, step=1.0))
ax.set_yticks(np.arange(0.73, 0.85, step=0.01))

plt.show()
plt.close()

In [None]:
main_df.describe()

#### The maximum of all models can see using 'describe()' funtion 

#### Re-assign the training and testing dataset

In [None]:
x = df[(f_highest)]
y = df.DEFAULT

x_train, x_test, y_train, y_test = train_test_split(
x, y, stratify = df.DEFAULT, random_state=123)

#### Take a look at the model selected and the parameters

In [None]:
model = m_highest
model

#### Use RandomizedSearchCV or GridSearchCV to fine tune the model

In [None]:
# from sklearn.model_selection import GridSearchCV
# parameters = {
#     "loss":["deviance"],
#     "learning_rate": [0.01, 0.025, 0.05, 0.075, 0.1, 0.15, 0.2],
#     "min_samples_split": np.linspace(0.1, 0.5, 12),
#     "min_samples_leaf": np.linspace(0.1, 0.5, 12),
#     "max_depth":[3,5,8],
#     "max_features":["log2","sqrt"],
#     "criterion": ["friedman_mse",  "mae"],
#     "subsample":[0.5, 0.618, 0.8, 0.85, 0.9, 0.95, 1.0],
#     "n_estimators":[10]
#     }
#
# clf = GridSearchCV(GradientBoostingClassifier(), parameters, cv=10, n_jobs=-1)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

parameters = {
    "loss":["deviance"],
    "learning_rate": [0.01, 0.1, 0.2],
    "max_depth":[3,5,8],
    "criterion": ["friedman_mse"],
    "subsample":[0.5, 0.8, 1.0],
    "n_estimators":[100]
    }

clf = GridSearchCV(GradientBoostingClassifier(), parameters, cv=10, n_jobs=-1)
# clf = RandomizedSearchCV(GradientBoostingClassifier(), parameters, cv=10, n_iter=100, n_jobs=-1)

#### Start running parameters tuning  

In [None]:
import datetime
#Starting time
print("Start time is",datetime.datetime.now())

#Beware: This line of code can takes hours to run depend of the parameters setting above 
clf.fit(x, y)

#Stop time
print("Stop time is",datetime.datetime.now())

In [None]:
print(clf.best_params_)

In [None]:
print(clf.best_estimator_)

#### So here is the accuracy of the model after tuning with Cross-Validation

In [None]:
final_score = cross_val_score(clf.best_estimator_, x, y, 
                              cv=10, scoring='accuracy').mean()
print("Final accuracy : {} ".format(final_score))

#### Plot graphs for SEX, MARRIAGE, EDUCATION and AGE for defaulter 

In [None]:
df1 = df[df['DEFAULT']==1]
df1.shape

temp_list = [x for x in df1['EDUCATION'] if x == 1]
GS = len(temp_list)/len(df1)
temp_list = [x for x in df1['EDUCATION'] if x == 2]
UNI = len(temp_list)/len(df1)
temp_list = [x for x in df1['EDUCATION'] if x == 3]
HS = len(temp_list)/len(df1)

data = {'Graduate School': [GS], 'University': [UNI], 'High School':[HS]}
df2 = pd.DataFrame.from_dict(data)

df2.plot.bar(stacked=True, title ='EDUCATION %',figsize=(10,6))
plt.show()
plt.close()

df2.rename(index={0: 'EDUCATION'})

In [None]:
temp_list = [x for x in df1['MARRIAGE'] if x == 1]
MA = len(temp_list)/len(df1)
temp_list = [x for x in df1['MARRIAGE'] if x == 2]
SG = len(temp_list)/len(df1)
temp_list = [x for x in df1['MARRIAGE'] if x == 3]
DV = len(temp_list)/len(df1)

data = {'Married': [MA], 'Single': [SG], 'Divorce':[DV]}
df3 = pd.DataFrame.from_dict(data)

df3.plot.bar(stacked=True, title ='MARRIAGE %',figsize=(10,6))
plt.show()
plt.close()

df3.rename(index={0: 'MARRIAGE'})

In [None]:
temp_list = [x for x in df1['SEX'] if x == 1]
MA = len(temp_list)/len(df1)
temp_list = [x for x in df1['SEX'] if x == 2]
FE = len(temp_list)/len(df1)

data = {'Male': [MA], 'Female': [FE]}
df4 = pd.DataFrame.from_dict(data)

df4.plot.bar(stacked=True, title ='SEX %',figsize=(10,6))
plt.show()
plt.close()

df4.rename(index={0: 'SEX'})

In [None]:
df1['AGE GRP'] = pd.cut(df1['AGE'], [0, 31, 41, 51, 61, 101], labels=['Below 30', '31-40', '41-50', '51-60', 'Above 61'])

temp_list = [x for x in df1['AGE GRP'] if x == 'Below 30']
GP1 = len(temp_list)/len(df1)
temp_list = [x for x in df1['AGE GRP'] if x == '31-40']
GP2 = len(temp_list)/len(df1)
temp_list = [x for x in df1['AGE GRP'] if x == '41-50']
GP3 = len(temp_list)/len(df1)
temp_list = [x for x in df1['AGE GRP'] if x == '51-60']
GP4 = len(temp_list)/len(df1)
temp_list = [x for x in df1['AGE GRP'] if x == 'Above 61']
GP5 = len(temp_list)/len(df1)

data = {'Below 30': [GP1], '31-40': [GP2], '41-50':[GP3], '51-60':[GP4], 'Above 61':[GP5]}
df4 = pd.DataFrame.from_dict(data)

df4.plot.bar(stacked=True, title ='AGE GROUP %',figsize=(10,6))
plt.show()
plt.close()

df4.rename(index={0: 'AGE GROUP'})

## Conclusion 

#### As we can see 'SEX', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_5' and 'PAY_6' are important features in this dataset. And GradientBoosting GB is the best model as compare with the other models in the list.

#### Age of below 30 year old, female, single and univeristy grade are the majority contribution to the defaulter. And previous payment trend also an indication to payment default.
