# University of Western Ontario CS 9114A Introduction to Data Science Final Project
## Credit Card Fraud Detection

#### *By Xuanzhi Huang, Wanyue Xin, Tongchen Yi*

We choose dataset ["Default of Credit Card Clients"](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) from UCI for our final project. We would use some classification algorithms to predict whether a client would default or not.

## 1. Preliminaries

In [None]:
!pip install scorecardpy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scorecardpy as sc
from sklearn.model_selection import train_test_split
import seaborn as sns

## 2. Data Exploration

* ### Load Dataset

In [None]:
df = pd.read_csv("../input/default-of-credit-card-clients-dataset/UCI_Credit_Card.csv")
df = df.drop('ID', axis = 1)
df.head()

* ### Attribute Information

Here are the attribute information from UCI.
* **ID:** ID of each client
* **LIMIT_BAL:** Amount of given credit in NT dollars (includes individual and family/supplementary credit)
* **SEX:** Gender (1 = male, 2 = female)
* **EDUCATION:** (1 = graduate school, 2 = university, 3 = high school, 4 = others)
* **MARRIAGE:** Marital status (1 = married, 2 = single, 3 = others)
* **AGE:** Age (year)
* **PAY_0:** Repayment status in September, 2005 (-1 = pay duly, 1 = payment delay for one month, 2 = payment delay for two months, ... 8 = payment delay for eight months, 9 = payment delay for nine months and above)
* **PAY_2:** Repayment status in August, 2005 (scale same as above)
* **PAY_3:** Repayment status in July, 2005 (scale same as above)
* **PAY_4:** Repayment status in June, 2005 (scale same as above)
* **PAY_5:** Repayment status in May, 2005 (scale same as above)
* **PAY_6:** Repayment status in April, 2005 (scale same as above)
* **BILL_AMT1:** Amount of bill statement in September, 2005 (NT dollar)
* **BILL_AMT2:** Amount of bill statement in August, 2005 (NT dollar)
* **BILL_AMT3:** Amount of bill statement in July, 2005 (NT dollar)
* **BILL_AMT4:** Amount of bill statement in June, 2005 (NT dollar)
* **BILL_AMT5:** Amount of bill statement in May, 2005 (NT dollar)
* **BILL_AMT6:** Amount of bill statement in April, 2005 (NT dollar)
* **PAY_AMT1:** Amount of previous payment in September, 2005 (NT dollar)
* **PAY_AMT2:** Amount of previous payment in August, 2005 (NT dollar)
* **PAY_AMT3:** Amount of previous payment in July, 2005 (NT dollar)
* **PAY_AMT4:** Amount of previous payment in June, 2005 (NT dollar)
* **PAY_AMT5:** Amount of previous payment in May, 2005 (NT dollar)
* **PAY_AMT6:** Amount of previous payment in April, 2005 (NT dollar)
* **default.payment.next.month:** Default payment (1 = yes, 0 = no)

* ### Data Distribution & Description

As the first step, find out if there are missing or anomalous data.

In [None]:
df.info()

Fortunately, there is no missing value in the dataset. Then take a look at the description and distribution of each variable to find out if there exist anomalous data. We will divide variables into categorical ones and numerical ones and observe them respectively.
#### 1) Categorical Variables

There are three categorical variables in the dataset, including SEX, MARRIAGE, and EDUCATION. Take a look at their description and distribution.

In [None]:
df[['SEX', 'EDUCATION', 'MARRIAGE']].describe()

In [None]:
plt.subplots_adjust(left = 0, bottom = 0, right = 1.5, top = 1.5, wspace = 0.3, hspace = 0.3)
font = {'family' : 'Times New Roman', 'weight' : 'normal', 'size' : 12}

plt.subplot(2, 2, 1)
df.SEX.value_counts().plot(kind = 'bar')
plt.xlabel("SEX", font)
plt.ylabel("Number", font)

plt.subplot(2, 2, 2)
df.MARRIAGE.value_counts().plot(kind = 'bar')
plt.xlabel("MARRIAGE", font)
plt.ylabel("Number", font)

plt.subplot(2, 2, 3)
df.EDUCATION.value_counts().plot(kind = 'bar')
plt.xlabel("EDUCATION", font)
plt.ylabel("Number", font)

According to the attribute information given by UCI, SEX = 1 means "male", while SEX = 2 indicates "female"; MARRIGE = 1 means "married", 2 means "single", and 3 indicates "others", such as "divorced"; EDUCATION = 1 indicates "graduate school", 2 means "university", 3 means "high school", and 4 indicates "others", like "primary school". From the description tables and barplots above, we can find some undocumented labels (MARRIAGE = 0, EDUCATION = 0) and some unknown labels (EDUCATION = 5, EDUCATION = 6). All of these can be safely categorized as "others", as the number of them is small and thus will not affect our prediction that much even if we categorize them wrongly. Besides, it is difficult and too compliated to build some other models to work out these undocumented or unknown labels. We will deal with these labels in the data cleaning part.

#### 2) Numerical Variables

We then observe the description tables and histograms of numerical variables.

In [None]:
df[['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].describe()

In [None]:
def draw_histograms(df, variables, nrows, ncols, nbins):
    fig = plt.figure()
    for i, varname in enumerate(variables):
        ax = fig.add_subplot(nrows, ncols, i + 1)
        df[varname].hist(bins = nbins, ax = ax)
        ax.set_title(varname)
    fig.tight_layout()
    plt.show()

pay = df[['PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']]
draw_histograms(pay, pay.columns, 2, 3, 10)

From PAY_0 to PAY_6, all present an undocumented label -2 and 0. According to UCI, 1, 2, 3, ..., 8 are the months of delay, and -1 indicates 'pay duly'. Here it is reasonable to label 0 and -2 "pay duly", as 0 means no delay, and -2 can mean payment in advance. To make it more understandable, We will change PAY = -2, -1 to PAY = 0, which means there is no delay time, in the data cleaning part.

In [None]:
df[['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']].describe()

In [None]:
def draw_histograms(df, variables, nrows, ncols, nbins):
    fig = plt.figure()
    for i, varname in enumerate(variables):
        ax = fig.add_subplot(nrows, ncols, i + 1)
        df[varname].hist(bins = nbins, ax = ax)
        ax.set_title(varname)
    fig.tight_layout()
    plt.show()

bills = df[['BILL_AMT1','BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']]
draw_histograms(bills, bills.columns, 2, 3, 10)

In [None]:
df[['PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']].describe()

In [None]:
def draw_histograms(df, variables, nrows, ncols, nbins):
    fig = plt.figure()
    for i, varname in enumerate(variables):
        ax = fig.add_subplot(nrows, ncols, i + 1)
        df[varname].hist(bins = nbins, ax = ax)
        ax.set_title(varname)
    fig.tight_layout()
    plt.show()

pay_amt = df[['PAY_AMT1','PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']]
draw_histograms(pay_amt, pay_amt.columns, 2, 3, 10)

In [None]:
df.LIMIT_BAL.describe()

In [None]:
df.LIMIT_BAL.hist()

In [None]:
df.AGE.describe()

In [None]:
df.AGE.hist()

It is shown that all the numerical variables are skewed, which may affect our prediction. 

## 3. Data Cleaning


As mentioned in the data exploration (categorical variables) part, we label undocumented/unlabeled education level "others", i.e. change their values to 4.

In [None]:
edu_ano = (df.EDUCATION == 5) | (df.EDUCATION == 6) | (df.EDUCATION == 0)
df.loc[edu_ano, 'EDUCATION'] = 4
df.EDUCATION.value_counts()

Then we do the same thing for MARRIAGE = 0.

In [None]:
df.loc[df.MARRIAGE == 0, 'MARRIAGE'] = 3
df.MARRIAGE.value_counts()

We take a look at their distribution again.

In [None]:
plt.subplots_adjust(left = 0, bottom = 0, right = 1.2, top = 0.6, wspace = 0.5, hspace = 0.5)

plt.subplot(1, 2, 1)
df.MARRIAGE.value_counts().plot(kind = 'bar')
plt.xlabel("MARRIAGE", font)
plt.ylabel("Number", font)

plt.subplot(1, 2, 2)
df.EDUCATION.value_counts().plot(kind = 'bar')
plt.xlabel("EDUCATION", font)
plt.ylabel("Number", font)

As mentioned before in the data exploration (numerical variables) part, change PAY == -1 and -2 to 0.

In [None]:
fil = (df.PAY_0 == -2) | (df.PAY_0 == -1) | (df.PAY_0 == 0)
df.loc[fil, 'PAY_0'] = 0
fil = (df.PAY_2 == -2) | (df.PAY_2 == -1) | (df.PAY_2 == 0)
df.loc[fil, 'PAY_2'] = 0
fil = (df.PAY_3 == -2) | (df.PAY_3 == -1) | (df.PAY_3 == 0)
df.loc[fil, 'PAY_3'] = 0
fil = (df.PAY_4 == -2) | (df.PAY_4 == -1) | (df.PAY_4 == 0)
df.loc[fil, 'PAY_4'] = 0
fil = (df.PAY_5 == -2) | (df.PAY_5 == -1) | (df.PAY_5 == 0)
df.loc[fil, 'PAY_5'] = 0
fil = (df.PAY_6 == -2) | (df.PAY_6 == -1) | (df.PAY_6 == 0)
df.loc[fil, 'PAY_6'] = 0
pay = df[['PAY_0','PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']]
draw_histograms(pay, pay.columns, 2, 3, 10)

We can find two bothering and confusing attribute names in the dataset, so here we change them for convenience.

In [None]:
df = df.rename(columns = {'default.payment.next.month': 'Default', 'PAY_0': 'PAY_1'})
df.head()

## 4. Feature Engineering

Here we use the function **woebin** in package **scorecardpy** to calculate WOE and IV of each feature, and then drop some useless variables as well as bin the rest in a proper way. The results are shown in the plots below.

In [None]:
bins = sc.woebin(df, y = 'Default', 
                 min_perc_fine_bin = 0.05,     # How many bins to cut initially into
                 min_perc_coarse_bin = 0.05,   # Minimum percentage per final bin
                 stop_limit = 0.1,             # Minimum information value 
                 max_num_bin = 8,              # Maximum number of bins
                 method = 'tree')
sc.woebin_plot(bins)

We set IV = 0.1 as a threshold and drop variables with IV under it.

In [None]:
df_drop = df.drop(['PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 
                   'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'SEX', 'MARRIAGE', 'AGE', 'EDUCATION'], axis = 1)
df_drop.head()

Then we cut the rest variables into bins given by the graphs.

In [None]:
def PAY_1_bin(PAY_1):
    if PAY_1 < 1:
        return 0
    elif PAY_1 == 1:
        return 1
    elif PAY_1 >= 2:
        return 2

df_drop.PAY_1 = df_drop.PAY_1.apply(PAY_1_bin)

In [None]:
def PAY_2_bin(PAY_2):
    if PAY_2 < 2:
        return 0
    elif PAY_2 >= 2:
        return 1

df_drop.PAY_2 = df_drop.PAY_2.apply(PAY_2_bin)

In [None]:
def PAY_3_bin(PAY_3):
    if PAY_3 < 2:
        return 0
    elif PAY_3 >= 2:
        return 1

df_drop.PAY_3 = df_drop.PAY_3.apply(PAY_3_bin)

In [None]:
def PAY_4_bin(PAY_4):
    if PAY_4 < 1:
        return 0
    elif PAY_4 >= 1:
        return 1

df_drop.PAY_4 = df_drop.PAY_4.apply(PAY_4_bin)

In [None]:
def PAY_5_bin(PAY_5):
    if PAY_5 < 2:
        return 0
    elif PAY_5 >= 2:
        return 1

df_drop.PAY_5 = df_drop.PAY_5.apply(PAY_5_bin)

In [None]:
def PAY_6_bin(PAY_6):
    if PAY_6 < 2:
        return 0
    elif PAY_6 >= 2:
        return 1
df_drop.PAY_6 = df_drop.PAY_6.apply(PAY_6_bin)

In [None]:
def PAY_AMT1_bin(PAY_AMT1):
    if PAY_AMT1 < 1000:
        return 0
    elif (PAY_AMT1 >= 1000) & (PAY_AMT1 < 4000):
        return 1
    elif (PAY_AMT1 >= 4000) & (PAY_AMT1 < 18000):
        return 2
    elif PAY_AMT1 >= 18000:
        return 3

df_drop.PAY_AMT1 = df_drop.PAY_AMT1.apply(PAY_AMT1_bin)

In [None]:
def PAY_AMT2_bin(PAY_AMT2):
    if PAY_AMT2 < 1000:
        return 0
    elif (PAY_AMT2 >= 1000) & (PAY_AMT2 < 2000):
        return 1
    elif (PAY_AMT2 >= 2000) & (PAY_AMT2 < 5000):
        return 2
    elif (PAY_AMT2 >= 5000) & (PAY_AMT2 < 16000):
        return 3
    elif PAY_AMT2 >= 16000:
        return 4

df_drop.PAY_AMT2 = df_drop.PAY_AMT2.apply(PAY_AMT2_bin)

In [None]:
def PAY_AMT3_bin(PAY_AMT3):
    if PAY_AMT3 < 1000:
        return 0
    elif (PAY_AMT3 >= 1000) & (PAY_AMT3 < 3000):
        return 1
    elif (PAY_AMT3 >= 3000) & (PAY_AMT3 < 5000):
        return 2
    elif (PAY_AMT3 >= 5000) & (PAY_AMT3 < 17000):
        return 3
    elif PAY_AMT3 >= 17000:
        return 4

df_drop.PAY_AMT3 = df_drop.PAY_AMT3.apply(PAY_AMT3_bin)

In [None]:
def LIMIT_BAL_bin(LIMIT_BAL):
    if LIMIT_BAL < 50000:
        return 0
    elif (LIMIT_BAL >= 50000) & (LIMIT_BAL < 150000):
        return 1
    elif (LIMIT_BAL >= 150000) & (LIMIT_BAL < 250000):
        return 2
    elif LIMIT_BAL >= 250000:
        return 3
    
df_drop.LIMIT_BAL = df_drop.LIMIT_BAL.apply(LIMIT_BAL_bin)

## 4. Model Training

We stratifiedly split the dataset into training set and test set according to the distribution of "Default" label.

In [None]:
X = df_drop.drop('Default', axis = 1)
y = df_drop.Default 
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.3, random_state = 0, stratify = y)

* ## Logistic

We choose the best hyperparameters by Cross Validation, and then fit and make prediction.

In [None]:
# Compute the correlation matrix
corr = Xtrain.corr()
corr = np.abs(corr)

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype = bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize = (11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap = True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask = mask, cmap = cmap, vmax = 1, center = 0,
            square = True, linewidths = 0.5, cbar_kws = {"shrink": .5})

In [None]:
corr

In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve, auc

In [None]:
# Calculate performance measures from scratch
# TP: true postives 
# TN: true negatives 
# FP: False positives 
# FN: False negatives
def compute_performance(yhat, y):
    # First, get tp, tn, fp, fn
    tn, fp, fn, tp = confusion_matrix(y,yhat).ravel()

    print(f"tp: {tp} tn: {tn} fp: {fp} fn: {fn}")
    
    # Accuracy
    acc = (tp + tn) / (tp + tn + fp + fn)
    
    # Precision
    # "Of the ones I labeled +, how many are actually +?"
    precision = tp / (tp + fp)
    
    # Recall
    # "Of all the + in the data, how many do I correctly label?"
    recall = tp / (tp + fn)    
    
    # Sensitivity
    # "Of all the + in the data, how many do I correctly label?"
    sensitivity = recall
    
    # Specificity
    # "Of all the - in the data, how many do I correctly label?"
    specificity = tn / (fp + tn)
    
    # Print results
    
    print("Accuracy:",round(acc,3),"Recall:",round(recall,3),"Precision:",round(precision,3),
          "Sensitivity:",round(sensitivity,3),"Specificity:",round(specificity,3))

In [None]:
from sklearn.linear_model import LogisticRegression
LOGREG = LogisticRegression(solver = 'lbfgs',penalty = 'none',max_iter = 10000)
lr_all = LOGREG.fit(Xtrain, ytrain)
lr_all.coef_

In [None]:
ytest_hat_all = lr_all.predict(Xtest)
probs_test = lr_all.predict_proba(Xtest)
compute_performance(ytest_hat_all, ytest)

In [None]:
fpr, tpr, thresholds = roc_curve(ytest, 
                                 probs_test[:,1])
ax = sns.lineplot(fpr,tpr)
ax.set(xlabel = "FPR",ylabel = "TPR")
auc(fpr,tpr)

In [None]:
from sklearn.linear_model import LogisticRegressionCV

logregCV = LogisticRegressionCV(penalty = 'l1', # Type of penalization l1 = lasso, l2 = ridge
                                     Cs = 10,
                                     tol = 0.0001, # Tolerance for parameters
                                     cv = 3,
                                     fit_intercept = True, # Use constant?
                                     class_weight = 'balanced', # Weights, see below
                                     random_state = 0, # Random seed
                                     max_iter = 1000, # Maximum iterations
                                     verbose = 1, # Show process. 1 is yes.
                                     solver = 'liblinear',
                                     n_jobs = 8,
                                     # warm_start=False, # Train anew or start from previous weights. For repeated training.
                                     refit = True
                                    )

In [None]:
logregCV.fit(X = Xtrain, # All rows and from the second var to end
           y = ytrain # The target
          )

The best penalty constant is 0.00077426.

In [None]:
logregCV.C_

In [None]:
logreg = LogisticRegression(penalty = 'l1', # Type of penalization l1 = lasso, l2 = ridge
                                     tol = 0.0001, # Tolerance for parameters
                                     C = 0.00077426, # Penalty constant, see below
                                     fit_intercept = True, 
                                     class_weight = 'balanced', # Weights, see below
                                     random_state = 0, # Random seed
                                     max_iter = 1000, # Maximum iterations
                                     verbose = 1, 
                                     solver = 'liblinear',
                                     warm_start = False 
                                    )

In [None]:
logreg.fit(X = Xtrain, # All rows and from the second var to end
           y = ytrain # The target
          )

In [None]:
coef_df = pd.concat([pd.DataFrame({'column': Xtrain.columns}), 
                    pd.DataFrame(np.transpose(logreg.coef_))],
                    axis = 1
                   )

coef_df

In [None]:
logreg.intercept_

In [None]:
pred_class_test = logreg.predict(Xtest)
probs_test = logreg.predict_proba(Xtest)

In [None]:
compute_performance(pred_class_test, ytest)

In [None]:
confusion_matrix_log = confusion_matrix(ytest, pred_class_test)
 
# Turn matrix to percentages
confusion_matrix_log = confusion_matrix_log.astype('float') / confusion_matrix_log.sum(axis = 1)[:, np.newaxis]
 
# Turn to dataframe
df_cm = pd.DataFrame(
        confusion_matrix_log, index = ['Not Default', 'Default'], columns = ['Not Default', 'Default'], 
)
 
# Parameters of the image
figsize = (5,5)
fontsize = 10
 
# Create image
fig = plt.figure(figsize = figsize)
heatmap = sns.heatmap(df_cm, annot = True, fmt = '.2f',linecolor = "Darkblue", cmap = "Blues",
            yticklabels = ['Not Default', 'Default'],)
 
# Make it nicer
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation = 0, 
                             ha = 'right', fontsize = fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation = 45,
                             ha = 'right', fontsize = fontsize)
 
# Add labels
plt.ylabel('True label')
plt.xlabel('Predicted label')
 
# Plot!
plt.show()

In [None]:
from sklearn.metrics import confusion_matrix, roc_curve, precision_recall_curve, auc
fpr, tpr, thresholds = roc_curve(ytest, 
                                 probs_test[:,1])
ax = sns.lineplot(fpr,tpr)
ax.set(xlabel = "FPR",ylabel = "TPR")
auc(fpr,tpr)

* ## XGBoost

We choose the best hyperparameters by Grid Search, and then fit and make prediction.

In [None]:
ytrain.value_counts()

In [None]:
16355/4645

In [None]:
from xgboost import XGBClassifier
#Define the classifier.
XGB = XGBClassifier(max_depth = 3,                 # Depth of each tree
                            learning_rate = 0.1,            # How much to shrink error in each subsequent training. Trade-off with no. estimators.
                            n_estimators = 100,             # How many trees to use, the more the better, but decrease learning rate if many used.
                            verbosity = 1,                  # If to show more errors or not.
                            objective = 'binary:logistic',  # Type of target variable.
                            booster = 'gbtree',             # What to boost. Trees in this case.
                            n_jobs = 8,                     # Parallel jobs to run. Set your processor number.
                            gamma = 0.001,                  # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!)
                            subsample = 0.632,              # Subsample ratio. Can set lower
                            colsample_bytree = 1,           # Subsample ratio of columns when constructing each tree.
                            colsample_bylevel = 1,          # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest.
                            colsample_bynode = 1,           # Subsample ratio of columns when constructing each split.
                            reg_alpha = 1,                  # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO.
                            reg_lambda = 0,                 # Regularizer for first fit.
                            scale_pos_weight = 3.52099,           # Balancing of positive and negative weights.
                            base_score = 0.5,               # Global bias. Set to average of the target rate.
                            random_state = 0,        # Seed
                            missing = None                  # How are nulls encoded?
                            )

In [None]:
# Define the parameters. Play with this grid!
param_grid = dict({'n_estimators': [100, 150, 200],
                   'max_depth': [2, 3, 4],
                 'learning_rate' : [0.01, 0.05, 0.1, 0.15]
                  })

In [None]:
from sklearn.model_selection import GridSearchCV

# Define grid search object.
GridXGB = GridSearchCV(XGB,        # Original XGB. 
                       param_grid,          # Parameter grid
                       cv = 3,              # Number of cross-validation folds.  
                       scoring = 'recall', # How to rank outputs.
                       n_jobs = 8,          # Parallel jobs. -1 is "all you have"
                       refit = False,       # If refit at the end with the best. We'll do it manually.
                       verbose = 1          # If to show what it is doing.
                      )

In [None]:
GridXGB.fit(Xtrain,ytrain)

In [None]:
GridXGB.best_params_.get('max_depth')

In [None]:
GridXGB.best_params_.get('learning_rate')

In [None]:
GridXGB.best_params_.get('n_estimators')

Best parameter: n_estimators = 200, max_depth = 3, learning_rate = 0.1

In [None]:
# Create XGB with best parameters.
XGB = XGBClassifier(max_depth = GridXGB.best_params_.get('max_depth'), # Depth of each tree
                            learning_rate = GridXGB.best_params_.get('learning_rate'), # How much to shrink error in each subsequent training. Trade-off with no. estimators.
                            n_estimators = GridXGB.best_params_.get('n_estimators'), # How many trees to use, the more the better, but decrease learning rate if many used.
                            verbosity = 1,                  # If to show more errors or not.
                            objective = 'binary:logistic',  # Type of target variable.
                            booster = 'gbtree',             # What to boost. Trees in this case.
                            n_jobs = 8,                     # Parallel jobs to run. Set your processor number.
                            gamma = 0.001,                  # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!)
                            subsample = 0.632,              # Subsample ratio. Can set lower
                            colsample_bytree = 1,           # Subsample ratio of columns when constructing each tree.
                            colsample_bylevel = 1,          # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest.
                            colsample_bynode = 1,           # Subsample ratio of columns when constructing each split.
                            reg_alpha = 1,                  # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO.
                            reg_lambda = 0,                 # Regularizer for first fit.
                            scale_pos_weight = 3.52099,     # Balancing of positive and negative weights.
                            base_score = 0.5,               # Global bias. Set to average of the target rate.
                            random_state = 0,        # Seed
                            missing = None                  # How are nulls encoded?
                            )

In [None]:
XGB.fit(Xtrain, ytrain)

In [None]:
# Plot variable importance
importances = XGB.feature_importances_
indices = np.argsort(importances)[::-1] 

f, ax = plt.subplots(figsize = (3, 8))
plt.title("Variable Importance")
sns.set_color_codes("pastel")
sns.barplot(y = [Xtrain.columns[i] for i in indices], x = importances[indices], 
            label="Total", color = "b")
ax.set(ylabel = "Variable",
       xlabel = "Variable Importance (Entropy)")
sns.despine(left = True, bottom = True)

In [None]:
XGBClassTest = XGB.predict(Xtest)
xg_probs_test = XGB.predict_proba(Xtest)
compute_performance(XGBClassTest, ytest)

In [None]:
confusion_matrix_xg = confusion_matrix(ytest, XGBClassTest)
 
# Turn matrix to percentages
confusion_matrix_xg = confusion_matrix_xg.astype('float') / confusion_matrix_xg.sum(axis=1)[:, np.newaxis]
 
# Turn to dataframe
df_cm = pd.DataFrame(
        confusion_matrix_xg, index=['Not Default', 'Default'], columns=['Not Default', 'Default'], 
)
 
# Parameters of the image
figsize = (5,5)
fontsize=10
 
# Create image
fig = plt.figure(figsize=figsize)
heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f',linecolor="Darkblue", cmap="Blues",
            yticklabels=['Not Default', 'Default'],)
 
# Make it nicer
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, 
                             ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,
                             ha='right', fontsize=fontsize)
 
# Add labels
plt.ylabel('True label')
plt.xlabel('Predicted label')
 
# Plot!
plt.show()

In [None]:
fpr, tpr, thresholds = roc_curve(ytest, 
                                 xg_probs_test[:,1])
ax=sns.lineplot(fpr,tpr)
ax.set(xlabel="FPR",ylabel="TPR")
auc(fpr,tpr)

* ## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF_default = RandomForestClassifier(n_estimators=210, # Number of trees to train
                       criterion = 'entropy', # How to train the trees. Also supports gini.
                       max_depth = None, # Max depth of the trees. Not necessary to change.
                       min_samples_split = 2, # Minimum samples to create a split.
                       min_samples_leaf = 0.0001, # Minimum samples in a leaf. Accepts fractions for %. This is 0.1% of sample.
                       min_weight_fraction_leaf = 0.0, # Same as above, but uses the class weights.
                       max_features = 'auto', # Maximum number of features per split (not tree!) by default is sqrt(vars)
                       max_leaf_nodes = None, # Maximum number of nodes.
                       min_impurity_decrease = 0.0001, # Minimum impurity decrease. This is 10^-4.
                       bootstrap = False, # If sample with repetition. For large samples (>100.000) set to false.
                       oob_score = False,  # If report accuracy with non-selected cases.
                       n_jobs = 8, # Parallel processing. Set to the number of cores you have. Watch your RAM!!
                       random_state = 250886749, # Seed
                       verbose = 1, # If to give info during training. Set to 0 for silent training.
                       warm_start = False, # If train over previously trained tree.
                       class_weight = 'balanced' # Balance the classes.
                    )
RF_fit = RF_default.fit(Xtrain, ytrain)

In [None]:
RF_predict = RF_fit.predict(Xtest)
RF_predict_prob = RF_fit.predict_proba(Xtest)

In [None]:
compute_performance(RF_predict, ytest)

In [None]:
confusion_matrix_xg = confusion_matrix(ytest, RF_predict)
 
# Turn matrix to percentages
confusion_matrix_xg = confusion_matrix_xg.astype('float') / confusion_matrix_xg.sum(axis=1)[:, np.newaxis]
 
# Turn to dataframe
df_cm = pd.DataFrame(
        confusion_matrix_xg, index=['Not Default', 'Default'], columns=['Not Default', 'Default'], 
)
 
# Parameters of the image
figsize = (5,5)
fontsize = 10
 
# Create image
fig = plt.figure(figsize = figsize)
heatmap = sns.heatmap(df_cm, annot = True, fmt = '.2f',linecolor = "Darkblue", cmap="Blues",
            yticklabels = ['Not Default', 'Default'],)
 
# Make it nicer
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation = 0, 
                             ha = 'right', fontsize = fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation = 45,
                             ha = 'right', fontsize = fontsize)
 
# Add labels
plt.ylabel('True label')
plt.xlabel('Predicted label')
 
# Plot!
plt.show()

In [None]:
fpr, tpr, thresholds = roc_curve(ytest, 
                                 RF_predict_prob[:,1])
ax = sns.lineplot(fpr,tpr)
ax.set(xlabel = "FPR",ylabel = "TPR")
auc(fpr,tpr)