# DECISION TREES FOR IDENTIFYING RISKY BANK LOANS

Data with these characteristics is available in a dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg.

The dataset contains information on loans obtained from a credit agency in Germany.

If you want to continue from a previously saved session state:

In [None]:
sessionfile <- "04_decision_trees_01.RData"

if(file.exists(sessionfile)) load(sessionfile)

Load necessary libraries:

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
library(listviewer) # for navigating nested/list objects
library(scales) # for formatting numbers
library(C50) # for C5.0 decision tree algorithm
library(gmodels) # for model evaluation
library(IRdisplay) # to help pretty print tables
options(warn = -1) # for suppressing messages

# Explore and prepare data

Read the data into a data.table:

In [None]:
credit <- fread("../data/csv/03_01_credit.csv", stringsAsFactors = T)

## Explore the data

Let's get some info on the data:

In [None]:
str(credit)

In [None]:
summary(credit)

summary is result is messed. We split into numeric and factors and pretty print summaries

In [None]:
credit_num <- credit %>% purrr::keep(is.numeric)

summaries <- credit_num %>%
    summary() %>% # get statistical summaries
    apply(1, function(x) stringr::str_extract(x, "(?<=:).+") %>% as.numeric) %>%
    magrittr::set_colnames(names(summary(1))) %>% # set column names
    magrittr::set_rownames(names(credit_num)) # set row names

summaries

In [None]:
sprintf("Loan amount duration ranged between %s and %s months", summaries[1,1], summaries[1,6])
sprintf("Loan amount ranged between %s DM and %s DM", summaries[2,1], summaries[2,6])

In [None]:
credit %>%
    purrr::discard(is.numeric) %>%
    lapply(unique) %>%
    listviewer::jsonedit(mode = "form")

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types.

Let's take a look at the table() output for a couple of loan features that seem likely to predict a default.

The applicant's checking and savings account balance are recorded as categorical variables:

In [None]:
table(credit$checking_balance)

In [None]:
table(credit$savings_balance)

But dull numbers do not tell much of a thing.

We'd better have a faceted histogram:

In [None]:
credit_hist <- credit[,.(checking_balance, savings_balance)] %>%
    tidyr::gather() %>%
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar() +
    coord_flip()
    
plotly::ggplotly(credit_hist)

However, the factor levels are not sorted correctly

In [None]:
lapply(credit[,.(checking_balance, savings_balance)], levels)

We reorder factors (do not run it more than once!)

In [None]:
credit[, (c("checking_balance", "savings_balance")) :=
       .(forcats::fct_relevel(checking_balance, "< 0 DM") %>% factor(ordered = T),
        forcats::fct_relevel(savings_balance, "< 100 DM") %>% factor(ordered = T)
           )]

In [None]:
str(credit$checking_balance)
str(credit$savings_balance)

And plot again with combined factor level orders: 

In [None]:
credit_hist <- credit[,.(checking_balance, savings_balance)] %>%
    tidyr::gather() %>%
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar() +
    scale_x_discrete(limits = c("< 0 DM", "< 100 DM", "1 - 200 DM",
                                "100 - 500 DM", "> 200 DM", "500 - 1000 DM", "> 1000 DM", "unknown")) +
    coord_flip()
    
plotly::ggplotly(credit_hist)

Now we look at defaults:

In [None]:
tabdef <- table(credit$default)
tabdef

In [None]:
sprintf("%s of all loans defaulted", scales::percent(prop.table(tabdef)[2], accuracy = 0.1))

And better in a visual format:

In [None]:
plotly::plot_ly(credit, x = ~default,
        type = "histogram")

## Split data into train and test

Pick train indices

In [None]:
set.seed(123)

In [None]:
train_sample <- credit[,sample(.N, 0.9 * .N)]

In [None]:
credit_train <- credit[train_sample]
credit_test <- credit[-train_sample]

Let's compare the no/yes distribution of train and test:

In [None]:
p1 <- ggplot2::ggplot(credit_train[,.(default)]) +
geom_bar(aes(x = default, y = ..count../sum(..count..)), height = 0.1) +
ggtitle("Train Labels") +
labs(x = "type", y = "proportion")

p2 <- ggplot2::ggplot(credit_test[,.(default)]) +
geom_bar(aes(x = default, y = ..count../sum(..count..)), height = 0.1) +
ggtitle("Test Labels") +
labs(x = "default or not", y = "proportion")

gridExtra::grid.arrange(p1, p2, ncol = 2)

They are fairly similar

# Train a model

In [None]:
credit_model <- C50::C5.0(credit_train[,!"default"], credit_train$default)

In [None]:
credit_model

In [None]:
sprintf("The tree is %s decisions deep", credit_model$size)

View the structure of the model:

In [None]:
class(credit_model)
str(credit_model)

And get a summary of the decision tree and view it as text:

In [None]:
summary(credit_model)

Or view like that:

In [None]:
credit_model$output %>% cat()

Plotting a large tree is too much compute intensive, so we skip that

Let's interpret the first three levels:
The numbers xxx/yyy means: xxx examples reaches the decision and yyy are incorrectly classified

* If checking balance is either unknown or > 200, classify 411 (56 incorrect) as "no default" (this is a leaf)
* If checking balance is < 200, then;
    * If months loan duration  > 30, then;
        * If unemployed classify 6 (all correct) as "no default" (this may be an anomaly)
        * If employed, then;
            * Age <= 25, classify 15 (all correct) as "likely to default"

A confusion matrix at the end of the summary shows the classification accuracy of the model on the train set itself.

But it is better we create the confusion matrix ourselves:

In [None]:
credit_result <- predict(credit_model, credit_train)

In [None]:
ct_dt1 <- gmodels::CrossTable(credit_train$default, credit_result, prop.chisq = F, prop.c = F, prop.r = F,
dnn = c('actual default', 'predicted default'))

ct_dt1

The structure of the cross table is as follows:

In [None]:
str(ct_dt1)

We can get the overal accuracy of the model and format as a percent:

In [None]:
ct_dt1$prop.tbl %>%
    diag() %>%
    sum() %>%
    scales::percent(accuracy = 0.01)

# Test the model and evaluate performance

Predict the model on test data:

In [None]:
credit_pred <- predict(credit_model, credit_test)

And report confusion matrix:

In [None]:
ct_dt2 <- gmodels::CrossTable(credit_test$default, credit_pred, prop.chisq = F, prop.c = F, prop.r = F,
dnn = c('actual default', 'predicted default'))

In [None]:
ct_dt2$prop.tbl %>%
    diag() %>%
    sum() %>%
    scales::percent(accuracy = 0.01)

In [None]:
sprintf("%s of %s actual defaults not predicted!", ct_dt2$t[2,1], sum(ct_dt2$t[2,]))

# Improve model performance

## Adaptive boosting

Adaptive boosting is a process in which many decision trees are built and the trees vote on the best class for each example.

Boosting is rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone. Each of the models has a unique set of strengths and weaknesses and they may be better or worse in solving certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier.

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. 

In [None]:
credit_boost10 <- C5.0(credit_train[,-"default"],
                       credit_train$default,
                       trials = 10)

In [None]:
credit_boost10

In [None]:
summary(credit_boost10)

In [None]:
credit_result_boost10 <- predict(credit_boost10, credit_train)

In [None]:
ct_dt3 <- gmodels::CrossTable(credit_train$default, credit_result_boost10, prop.chisq = F, prop.c = F, prop.r = F,
dnn = c('actual default', 'predicted default'))

ct_dt3

In [None]:
ct_dt3$prop.tbl %>%
    diag() %>%
    sum() %>%
    scales::percent(accuracy = 0.01)

Accuracy on the train data enhanced

### Test with boost

In [None]:
credit_pred_boost10 <- predict(credit_boost10, credit_test)

In [None]:
ct_dt4 <- gmodels::CrossTable(credit_test$default, credit_pred_boost10, prop.chisq = F, prop.c = F, prop.r = F,
dnn = c('actual default', 'predicted default'))



In [None]:
ct_dt4

Or nicely:

In [None]:
ct_dt4$t %>% knitr::kable() %>% as.character() %>% IRdisplay::display_html()

In [None]:
ct_dt4$prop.tbl %>%
    diag() %>%
    sum() %>%
    scales::percent(accuracy = 0.01)

Boosting did not help on test set. We can play with more iterations or use other methods to enhance performance

## Making mistakes costlier

The C5.0 algorithm allows us to assign a penalty to different types of errors, in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how much costlier each error is, relative to any other prediction.

In [None]:
dimnames <- rep(list(c("no", "yes")), 2)

In [None]:
names(dimnames) <- c("predicted", "actual")

In [None]:
dimnames

In [None]:
error_cost <- matrix(c(0,1,4,0), nrow = 2, dimnames = dimnames)

In [None]:
error_cost

So a wrong classification incurs a cost while a correct one does not

Now, let's apply it:

In [None]:
credit_cost <- C50::C5.0(credit_train[-17],
                         credit_train$default,
                         costs = error_cost)

Make predictions on test set:

In [None]:
credit_cost_pred <- predict(credit_cost, credit_test)

And evaluate:

In [None]:
ct_dt5 <- gmodels::CrossTable(credit_test$default,
                    credit_cost_pred,
                    prop.chisq = FALSE,
                    prop.c = FALSE,
                    prop.r = FALSE,
                    dnn = c('actual default', 'predicted default'))

In [None]:
ct_dt5$prop.tbl %>%
    diag() %>%
    sum() %>%
    scales::percent(accuracy = 0.01)

Overall accuracy is lower. However:

In [None]:
sprintf("%s of %s actual defaults not predicted!", ct_dt5$t[2,1], sum(ct_dt5$t[2,]))

So false negatives are much lower with cost matrix

In [None]:
save.image(sessionfile)