# DECISION TREES FOR IDENTIFYING COUNTERFEIT CURRENCY

The banknote data-frame in the mclust package contains measurements made on genuine and counterfeit Swiss 1000 franc
bank notes.

If you want to continue from a previously saved session state:

In [None]:
sessionfile <- "04_decision_trees_02.RData"

if(file.exists(sessionfile)) load(sessionfile)

Load necessary libraries:

In [None]:
library(data.table) # to handle the data in a more convenient manner
library(tidyverse) # for a better work flow and more tools to wrangle and visualize the data
#library(listviewer) # for navigating nested/list objects
#library(scales) # for formatting numbers
library(C50) # for C5.0 decision tree algorithm
library(gmodels) # for model evaluation
library(plotly)
library(tree) # for improved decision trees
options(warn = -1) # for suppressing messages

# Collect and explore data

Let's load the data from the mclust package:

In [None]:
data("banknote", package = "mclust")

In [None]:
banknote_dt <- as.data.table(banknote)

In [None]:
banknote_dt

In [None]:
str(banknote_dt)

In [None]:
summary(banknote_dt)

**EXERCISE 1:** Pretty print the stat summaries of numeric variables, and plot the histogram for Status variable

(3 minutes)

**SOLUTION 1:**

In [None]:
banknote_num <- banknote %>% purrr::keep(is.numeric)

summaries <- banknote_num %>%
    summary() %>% # get statistical summaries
    apply(1, function(x) stringr::str_extract(x, "(?<=:).+") %>% as.numeric) %>%
    magrittr::set_colnames(names(summary(1))) %>% # set column names
    magrittr::set_rownames(names(banknote_num)) # set row names

summaries

In [None]:
plotly::plot_ly(banknote_dt, x = ~Status, type = "histogram")

**EXERCISE 2:** First, draw the density plots of numeric variables. Note that the scales of variables are mostly different, so the scales have to be separate in the plots

Then, draw the correlation plot to see correlated variables

Interpret both

(4 minutes)

**SOLUTION 2:**

In [None]:
banknote_dt[,!"Status"] %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key, scale = "free" ) + # divide into separate plots by key
        geom_density(fill = "green")  # get density plots

Variables are mostly normally distributed

In [None]:
cor(banknote_dt[,!"Status"]) %>%
corrplot::corrplot.mixed(upper = "ellipse",
                         lower = "number",
                         tl.pos = "lt",
                         number.cex = .5,
                         lower.col = "black",
                         tl.cex = 0.7)

Left and right are positively and bottom and diagonal are negatively strongly correlated

# Split the data into train and test

Extract random train indices:

In [None]:
set.seed(2018)

In [None]:
train <- banknote_dt[,sample(.N, 150)]

# Train the model

**EXERCISE 3:** Train the model using any of the two usages of C5.0 function from C50 package, save into fitc1 object

(2 minutes)

**SOLUTION 3:**

In [None]:
fitc1 <- C50::C5.0(Status ~ ., data = banknote_dt[train])

In [None]:
fitc <- C50::C5.0(banknote_dt[train, !"Status"], banknote_dt[train,Status])

In [None]:
fitc

Plot the model:

In [None]:
plot(fitc1)

And summary of the model:

In [None]:
summary(fitc1)

## View rules

Since a decision tree is built as a series of test questions and conditions, we can view the actual rules as a series of “if then” statements.

With a large tree, this can improve readability. To do this, simply refit the model via the C5.0 function with the added argument rules=TRUE

In [None]:
fitc_rules <- C50::C5.0( Status ~.,
                    data = banknote[train ,],
                    rules = TRUE )

Let's see the rules:

In [None]:
summary(fitc_rules)

First rule says that, if bottom is more than 8.6 and diagonal is less than 140.6, then the banknote is counterfeit

# Evaluate model

## On train set

**EXERCISE 4:** Evaluate the model performance on the train set with predict() function, returning class labels, save into predc_train object

Create a confusion matrix of actual and predicted values in the train set

(4 minutes)

**SOLUTION 4:**

In [None]:
predc_train <- predict(fitc,
                      newdata = banknote_dt[train],
                      type = "class")

In [None]:
dt_ct6 <- gmodels::CrossTable(banknote_dt[train, Status],
                    predc_train,
                    prop.chisq = FALSE,
                    prop.c = FALSE,
                    prop.r = FALSE,
                    dnn = c('actual default', 'predicted default')
                   )

100% success in classifying train set!

## Test predictive power

**EXERCISE 5:** Evaluate the model's predictive performance on the test set with predict() function, returning class labels, save into predc object

Create a confusion matrix of actual and predicted values in the test set

(4 minutes)

**SOLUTION 5:**

In [None]:
predc <- predict(fitc,
                      newdata = banknote_dt[-train],
                      type = "class")

In [None]:
dt_ct7 <- gmodels::CrossTable(banknote_dt[-train, Status],
                    predc,
                    prop.chisq = FALSE,
                    prop.c = FALSE,
                    prop.r = FALSE,
                    dnn = c('actual default', 'predicted default')
                   )

3 out of 50 errors

# Improve model performance

One approach that often works well to improve performance is to select an alternative splitting criterion.

Three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity, Entropy and Deviance.

The tree package lets you to use the Deviance or Gini metric. 

In [None]:
fit <- tree::tree(Status ~.,
                  data = banknote_dt[train],
                  split = "deviance" )

View the decision tree:

In [None]:
plot(fit)
text(fit)

Get a summary of the model:

In [None]:
summary(fit)

## Test set performance

**EXERCISE 6:** Test the predictive performance of the fit model as raw probability values (not labels)

Get the labels from the raw probs. using colnames and max.col functions (check documentation)

Evaluate the performance with a confusion matrix

(6 minutes)

**SOLUTION 6:**

In [None]:
pred <- predict(fit,
                newdata = banknote_dt[-train])

In [None]:
pred

In [None]:
pred_class <- colnames(pred)[max.col(pred,
                                     ties.method = c("random")
                                    )
                            ]

In [None]:
dt_ct8 <- gmodels::CrossTable(banknote_dt[-train, Status],
                    pred_class,
                    prop.chisq = FALSE,
                    prop.c = FALSE,
                    prop.r = FALSE,
                    dnn = c('actual default', 'predicted default')
                   )

This time hundred percent prediction accuracy!

In [None]:
save.image(sessionfile)