# Rare Event Classification (Ensemble Modelling incorporating Random Undersampling)


**About the Dataset:**

The data consists of 10,500 credit applications, each classified as good or bad credit. However, there are only 500 bad credit applications. Since this is less than 5% of the data, classifying applicants as bad credit is referred to as a rare event problem. This is also known as anomaly dete ction in many applications.


**Approach:**

1. The best ratio is discovered by trying ratios between 50:50 to 85:15. 
2. Build an ensemble model based on the optimum ratio selected. 

This is done my creating ensemble of trees using the optimum ratio, fitting a model to each, making classification probability predictions for each and then averaging those to get predicted classification probabilities. From that we can calculate the loss totaled over all the trees.

The base model is a decision tree with a minimum leaf size is 5, and the minimum split size is 5. The optimum depth for this model is determined by optimizing the F1-score using 10-fold cross-validation.

In [1]:
#Importing Required Libraries

# Install using Conda:
# conda install -c glemaitre imbalanced-learn
from imblearn.under_sampling import RandomUnderSampler
import pandas as pd
import numpy as np
from AdvancedAnalytics import ReplaceImputeEncode
# classes for decision tree
from AdvancedAnalytics import DecisionTree, calculate
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.tree import export_graphviz
from pydotplus.graphviz import graph_from_dot_data
import graphviz
import math
import warnings 
warnings.filterwarnings("ignore")

In [2]:
#Reading Data
df = pd.read_excel("CreditData_RareEvent.xlsx")
df.head()

Unnamed: 0,good_bad,age,amount,duration,checking,coapp,depends,employed,existcr,foreign,history,housing,installp,job,marital,other,property,resident,savings,telephon
0,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
1,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
2,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
3,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
4,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2


In [3]:
# Minority classes are 500
df['good_bad'].value_counts()

good    10000
bad       500
Name: good_bad, dtype: int64

**Loss Calculation Function**

The following function can be used to calculate loss and the confusion matrix for our models. This is useful since the loss calculations are a function of the Amount of the loan application. If the case is correctly classified by the model, the loss is zero. Otherwise the loss is a function of the loan amount, which is different for false positives and negatives.

Loss = Amount, if the case is a false positive, or
Loss = 0.15 x Amount, if the case is a false negative.

False positives are loans that were classified as good but the customer later defaults on the loan. In that case, the entire amount of the loan is treated as the loss. In practice, this amount might be adjusted by the actual loss for the load which would be the loan amount minus payments, plus some overhead costs. Here, we are just using the unadjusted loan amount.

False negatives are applications that were classified as bad but should have been classified as good. That is the customer would have paid off the loan in a timely fashion, but the model is saying they should be denied a loan.

In this function, the numpy array y is the actual classification for each case. It is encoded using zeros for the bad classifications and one for the good classifications because alphabetically bad occurs before good. If instead of bad and good, the data used yes and no, respectively, then the bad classifications would have been coded as ones.

In [4]:
# Function for calculating loss and confusion matrix
def loss_cal(y, y_predict, fp_cost, fn_cost, display=True):
    loss = [0, 0] #False Neg Cost, False Pos Cost
    conf_mat = [0, 0, 0, 0] #tn, fp, fn, tp
    for j in range(len(y)):
        if y[j]==0:
            if y_predict[j]==0:
                conf_mat[0] += 1 #True Negative
            else:
                conf_mat[1] += 1 #False Positive
                loss[1] += fp_cost[j]
        else:
            if y_predict[j]==1:
                conf_mat[3] += 1 #True Positive
            else:
                conf_mat[2] += 1 #False Negative
                loss[0] += fn_cost[j]
    if display:
        fn_loss = loss[0]
        fp_loss = loss[1]
        total_loss = fn_loss + fp_loss
        misc = conf_mat[1] + conf_mat[2]
        misc = misc/len(y)
        print("{:.<23s}{:10.4f}".format("Misclassification Rate", misc))
        print("{:.<23s}{:10.0f}".format("False Negative Cost", fn_loss))
        print("{:.<23s}{:10.0f}".format("False Positive Cost", fp_loss))
        print("{:.<23s}{:10.0f}".format("Total Loss", total_loss))
    return loss, conf_mat

In [5]:
# Attribute Map for CreditData_RareEvent.xlsx, N=10,500
attribute_map = { \
'age':['I',(19,120)], \
'amount': ['I',(0,20000)], \
'checking': ['N',(1,2,3,4)], \
'coapp': ['N',(1,2,3)], \
'depends': ['B',(1,2)], \
'duration': ['I',(1,72)], \
'employed': ['N',(1,2,3,4,5)], \
'existcr': ['N',(1,2,3,4)], \
'foreign': ['B',(1,2)], \
'good_bad': ['B',('bad','good')], \
'history': ['N',(0,1,2,3,4)], \
'housing':['N',(1,2,3)], \
'installp': ['N',(1,2,3,4)], \
'job': ['N',(1,2,3,4)], \
'marital': ['N',(1,2,3,4)], \
'other': ['N',(1,2,3)], \
'property': ['N',(1,2,3,4)], \
'resident': ['N',(1,2,3,4)], \
'savings': ['N',(1,2,3,4,5)], \
'telephon': ['B',(1,2)] \
}

We encode the categorical attributes using one-hot encoding. Since this is a decision tree model, the last one-hot column is not dropped. The interval attributes are scaled using z-score scaling. This is not required, but can improve the speed of the fitting the decision trees.

In these data, the target attribute good_bad is entered as good and bad rather than one and zero. The ReplaceImputeEncode method will encode the character sting version of good_bad into zeros and ones, but zero is used to encode bad and one is used to encode good. As a result, a false positive refers to classifying the target as 1, or equivalently as good.

In [6]:
#Data Preprocessing

# Encode for Logistic Regression, drop last one-hot column
rie = ReplaceImputeEncode(data_map=attribute_map, nominal_encoding='one-hot', interval_scale = 'std', \
                          drop=False, display=False)
encoded_df = rie.fit_transform(df)
# Create X and y, numpy arrays
# The target is not scaled or imputed, but
# the target coding is: bad=0 and good=1
X = np.asarray(encoded_df.drop('good_bad',axis=1))
y = np.asarray(encoded_df['good_bad'])

**Calculate Potential Loss for Each Case**

The potential loss is the false positive and false negative loss calculated assuming the case might be classifed as false positive or negative. In most cases the model will correctly classify the observation and the actual loss will be zero. The potential loss is used to evaluate models as they are developed. In this example the potential losses are calculated using the following formulas, one for false positive and another for false negatives:

- False Positive Cost = Amount, and False Negative = 0.15 x Amount.

In [7]:
# Best model is one that minimizes the loss
# Setup false positive and false negative costs for each transaction
fp_cost = np.array(df['amount'])
fn_cost = np.array(0.1*df['amount'])

**Calculate Total Loss without RUS**

As a benchmark, it is good to know the results from fitting the entire dataset without using RUS. The following code evaluates fitting the entire dataset using a Decision Tree built using 10-fold cross validation to determine the optimum depth. In this case depths between 2 and 20 are examined. The depth that maximizes the F1-score is selected as the optimum depth.

In this cross validation, the optimum depth is 2 with F1 = 97.6%. The misclassification rate if 4.7%, which is almost equal to the percent of bad credit cases in these data. The accuracy is high, 95%. However, notice that the bad credit applicants are being ignored. Most of the 500 applicants with bad credit are being classified as good. The model is classifying all but 5 applicants as good credit risks. From the perspective of the quality metrics this is an excellent model, but from the perspective of a banker who needs to reduce loss from bad loans, this is a terrible model. The estimated total loss from this model is $2,086,650, and it is all from classifying applicants with bad credit as good.

In [8]:
search_depths = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
best_d = 0
max_f = 0
for d in search_depths:
    dtc = DecisionTreeClassifier(max_depth=d, min_samples_leaf=5, \
    min_samples_split=5, criterion='gini')
    dtc_10 = cross_val_score(dtc, X, y, scoring='f1', cv=10)
    mean = dtc_10.mean()
    if mean > max_f:
        max_f = mean
        best_d = d
        best_dtc = dtc
        print("\nDecision Tree constructed using Depth = ",best_d, "and all data.")
best_dtc.fit(X, y)
loss,conf_mat = calculate.binary_loss(y,best_dtc.predict(X),\
fp_cost,fn_cost)
DecisionTree.display_binary_metrics(best_dtc, X, y)


Decision Tree constructed using Depth =  2 and all data.
Misclassification Rate.    0.0471
False Negative Loss....         0
False Positive Loss....   2086650
Total Loss.............   2086650

Model Metrics
Observations...............     10500
Features...................        58
Maximum Tree Depth.........         2
Minimum Leaf Size..........         5
Minimum split Size.........         5
Mean Absolute Error........    0.0860
Avg Squared Error..........    0.0430
Accuracy...................    0.9529
Precision..................    0.9528
Recall (Sensitivity).......    1.0000
F1-Score...................    0.9758
MISC (Misclassification)...      4.7%
     class 0...............     99.0%
     class 1...............      0.0%


     Confusion
       Matrix     Class 0   Class 1  
Class 0.....         5       495
Class 1.....         0     10000



In this cross validation, the optimum depth is 2 with F1 = 97.6%. The misclassification rate if 4.7%, which is almost equal to the percent of bad credit cases in these data. The accuracy is high, 95%. However, notice that the bad credit applicants are being ignored. Most of the 500 applicants with bad credit are being classified as good. The model is classifying all but 5 applicants as good credit risks.

From the perspective of the quality metrics this is an excellent model, but from the perspective of a banker who needs to reduce loss from bad loans, this is a terrible model. The estimated total loss from this model is $2,086,650, and it is all from classifying applicants with bad credit as good.

**Random Undersampling Technique**

Building a model that decreases this loss involves using RUS. The first step is to identify the best mixture of majority and minority event applicants. This example considers mixtures of 50:50, 60:40, 70:30, 75:25, 80:20 and 85:15.

In Python, instead of passing these ratios into the RUS routine, it is necessary to pass the actual number of observations represented by these mixtures. For example, the 50:50 ratio is a sample constructed using all 500 bad applicants and a random sample of an additional 500 good applications. The total sample is 1,000 applications evenly divided between bad and good.

A mixture of 80:20 would have 80% good and 20% bad. Since each RUS sample is constructed using all of the minority data, all 500 of the bad applicants, the number of randomly selected good applicants will need to be 4 times larger. That is, the number of randomly selected good applications to achieve an 80:20 ratio of good to bad cases is calculated by: (0.8/0.2) x 500 = 2,000.

After making these calculations for each ratio, we create list containing the random seeds we would like to use, rand_val, a list of ratios, ratio, and a tuple, rus_ratio, containing dictionaries that describe the number of observations for each ratio.

In [9]:
# Setup 10 random number seeds for use in creating random samples
np.random.seed(12345)
max_seed = 2**20-1
rand_val = np.random.randint(1, high=max_seed, size=10)

# Use majority:minority ratios of 50:50, 60:40, 70:30, 75:25, 80:20, 85:15
ratio = [ '50:50', '60:40', '70:30', '75:25', '80:20', '85:15' ]

# Dictionaries contains number of minority and majority
# n_majority = ratio x n_minority
rus_ratio = ({0:500, 1:500}, {0:500, 1:750}, {0:500, 1:1167}, {0:500, 1:1500}, {0:500, 1:2000}, {0:500, 1:2833})

In [10]:
# Use a decision tree as the base model. 
# Build upon the ‘gini’ split criterion and optimize the depth for values between 2 and 20.

depth_list = list(range(2,21))

min_loss = 1e64
best_ratio = 0
for k in range(len(rus_ratio)):
    print("\nDecision Tree Classifier Model using " + ratio[k] + " RUS")
    best_d = 0
    min_loss_c = 1e64
    for j in range(len(depth_list)):
        d = depth_list[j]
        fn_loss = np.zeros(len(rand_val))
        fp_loss = np.zeros(len(rand_val))
        misc = np.zeros(len(rand_val))
        for i in range(len(rand_val)):
            rus = RandomUnderSampler(ratio=rus_ratio[k], random_state=rand_val[i], \
                                     return_indices=False, replacement=False)
            X_rus, y_rus = rus.fit_sample(X, y)
            
            dtc = DecisionTreeClassifier(criterion='gini', max_depth=d, min_samples_leaf=5, min_samples_split=5)
            dtc = dtc.fit(X_rus, y_rus)
            
            loss, conf_mat = calculate.binary_loss(y, dtc.predict(X), fp_cost, fn_cost, display=False)
            
            fn_loss[i] = loss[0]
            fp_loss[i] = loss[1]
            misc[i] = (conf_mat[1] + conf_mat[2])/y.shape[0]
        avg_misc = np.average(misc)
        t_loss = fp_loss+fn_loss
        avg_loss = np.average(t_loss)
        if avg_loss < min_loss_c:
            min_loss_c = avg_loss
            se_loss_c = np.std(t_loss)/math.sqrt(len(rand_val))
            best_d = d
            misc_c = avg_misc
            fn_avg_loss = np.average(fn_loss)
            fp_avg_loss = np.average(fp_loss)
    if min_loss_c < min_loss:
        min_loss = min_loss_c
        se_loss = se_loss_c
        best_ratio = k
        best_reg = best_d
    print("{:.<23s}{:12.2E}".format("Best depth", best_d))
    print("{:.<23s}{:12.4f}".format("Misclassification Rate",misc_c))
    print("{:.<23s} ${:10,.0f}".format("False Negative Loss",fn_avg_loss))
    print("{:.<23s} ${:10,.0f}".format("False Positive Loss",fp_avg_loss))
    print("{:.<23s} ${:10,.0f}{:5s}${:<,.0f}".format("Total Loss", min_loss_c, " +/- ", se_loss_c))
print("")
print("{:.<23s}{:>12s}".format("Best RUS Ratio", ratio[best_ratio]))
print("{:.<23s}{:12.2E}".format("Best C", best_reg))
print("{:.<23s} ${:10,.0f}{:5s}${:<,.0f}".format("Lowest Loss", \
min_loss, " +/-", se_loss))



Decision Tree Classifier Model using 50:50 RUS
Best depth.............    1.50E+01
Misclassification Rate.      0.2097
False Negative Loss.... $   741,810
False Positive Loss.... $   121,370
Total Loss............. $   863,179 +/- $35,351

Decision Tree Classifier Model using 60:40 RUS
Best depth.............    1.80E+01
Misclassification Rate.      0.1567
False Negative Loss.... $   544,674
False Positive Loss.... $   153,779
Total Loss............. $   698,453 +/- $19,577

Decision Tree Classifier Model using 70:30 RUS
Best depth.............    1.80E+01
Misclassification Rate.      0.0970
False Negative Loss.... $   326,478
False Positive Loss.... $   182,219
Total Loss............. $   508,697 +/- $13,300

Decision Tree Classifier Model using 75:25 RUS
Best depth.............    2.00E+01
Misclassification Rate.      0.0790
False Negative Loss.... $   267,626
False Positive Loss.... $   179,723
Total Loss............. $   447,349 +/- $10,244

Decision Tree Classifier Model using 80

In this search, the best ratio is defined as the ratio with the lowest calculated loss, after optimizing the tree depth. The best ratio is 80:15 with a depth of 20 levels. With that configuration, the loss calculated over the entire dataset is estimated to be $368,145.

This is significantly lower than the loss calculated for an optimized tree fitted to the entire dataset. In that case the tree classified all by 5 cases as good credit. The quality metrics were high, but the calculated loss was a little over $2 million. The loss using the base model is over 4 times the loss projected using RUS.

Also of interest is the misclassification error. The error for the base model was 4.7%. The same error for the RUS model is projected to be 3.6%

**Ensemble Modelling-RUS** 

From the first step, it is clear that the best ratio identified is 85:15 with a tree with 19 levels. The final step in RUS modeling is to build an ensemble model using the best ratio, and to estimate of the total loss from the ensemble model.

An ensemble model, in this case, is the average of several models, each developed using the same best ratio, but with different random samples. Each 85:15 RUS sample uses all 500 cases from the bad applicants and then an additional 4,500 cases randomly selected from the remaining 10,000 good applicants. Since these are randomly selected, each sample produces different estimates of the Decision Tree.

In this example, 100 separate samples are used to create 100 trees. Each produces different estimates of the probability that the applicant is a bad credit risk. The ensemble model averages the 100 estimates for each of the 10,500 cases in the data. These average probabilities are used to classify the data and finally to evaluate the misclassification rate and total loss.

In [11]:
# Ensemble Modeling - Averaging Classification Probabilities
n_obs = len(y)
n_rand = 100
predicted_prob = np.zeros((n_obs,n_rand))
avg_prob = np.zeros(n_obs)
# Setup 100 random number seeds for use in creating random samples
np.random.seed(12345)
max_seed = 2**20-1
rand_value = np.random.randint(1, high=max_seed, size=n_rand)
# Model 100 random samples, each with a 70:30 ratio
for i in range(len(rand_value)):
    rus = RandomUnderSampler(ratio=rus_ratio[best_ratio], \
                             random_state=rand_value[i], return_indices=False, \
                             replacement=False)
    X_rus, y_rus = rus.fit_sample(X, y)
    
    dtc = DecisionTreeClassifier(criterion='gini', max_depth=d, min_samples_leaf=5, min_samples_split=5)
    dtc = dtc.fit(X_rus, y_rus)
    
    predicted_prob[0:n_obs, i] = dtc.predict_proba(X)[0:n_obs, 0]
for i in range(n_obs):
    avg_prob[i] = np.mean(predicted_prob[i,0:n_rand])
# Set y_pred equal to the predicted classification
y_pred = avg_prob[0:n_obs] < 0.5
y_pred.astype(np.int)
# Calculate loss from using the ensemble predictions
print("\nEnsemble Estimates based on averaging",len(rand_value), "Models")
loss, conf_mat = calculate.binary_loss(y, y_pred, fp_cost, fn_cost)


Ensemble Estimates based on averaging 100 Models
Misclassification Rate.    0.0020
False Negative Loss....         0
False Positive Loss....     63181
Total Loss.............     63181


The ensemble model is significantly better than the best RUS model. The misclassification error for the ensemble model is only 0.2%. The rate for the base model was 4.7% and the RUS model 3.6%.

The estimated loss for the ensemble model is $63,181

The estimated loss for the base model was over 2 million