<font size="6"><b>RECURSIVE PARTIONING TREES: EXAMPLE</b></font>

<font size="5"><b>Serhat Ã‡evikel</b></font>

In [None]:
library(data.table)
library(tidyverse)
library(plotly)
library(modeldata) # for churn data
library(rpart) # for recursive partioning trees
library(rpart.plot) # for plotting recursive partioning trees
library(visNetwork) # for better plotting recursive partioning trees
library(caret) # for a better confusion matrix
library(vip) # for variable importance plots

In [None]:
options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

In [None]:
datapath <- "~/databa"

![xkcd](../imagesba/tree.png)

(https://xkcd.com/835/)

In this session we will explore recursive partioning tree method for classification on a dataset for churn rate of telecom customers.

Example adapted from Machine Learning with R Cookbook: Analyze data and build predictive models by AshishSingh Bhatia, Yu-Wei, Chiu (David Chiu) Chapter 7

# Data

In [None]:
data(mlc_churn, package = "modeldata")

In [None]:
mlc_churn

We can get information on the data:

`churn` column holds the response variable.

In [None]:
#?modeldata::mlc_churn

# Explore

In [None]:
churn <- mlc_churn %>% as.data.table()

Now first let's say, I want to get the unique levels of each factor column in a concise and simple way

We use the purrr package for that in order to iterate through fields:

keep, selects only those columns that satisfied the condition, and map works like "lapply" to apply the function to each selected column:

In [None]:
churn %>% purrr::keep(is.factor) %>% purrr::map(levels)

Let's have the histograms for factor variables

In [None]:
churn_factors <- churn %>% purrr::keep(is.factor) %>% # select factor columns
    tidyr::gather() %>% # convert into long format for faceting
    ggplot(aes(x = value)) + # plot value
    facet_wrap(~ key, scales = "free") + # divide into separate plots by key
    geom_bar()

plotly::ggplotly(churn_factors)

So:

- Most frequent area code is 415
- 707 out of 5000 observations have a churn
- 4527 does not have an international plan
- Data is nearly evenly distributed across states
- 3677 does not gave a voice mail plan

For numeric variables, it is good to have five point summaries easily as such:

In [None]:
churn %>% purrr::keep(is.numeric) %>% sapply(quantile) %>% t()

And we can have density plots for numeric variables:

In [None]:
churn %>% purrr::keep(is.numeric) %>% # select columns
    tidyr::gather() %>% # reshape into long format in columns "key" and "value"
    ggplot(aes(value)) + # plot value
        facet_wrap(~ key, scale = "free" ) + # divide into separate plots by key
        geom_density(fill = "green")  # get density plots

# Partition dataset

In [None]:
set.seed(1863)
train_ind <- churn[,sample(.I, 0.7 * .N)]

In [None]:
churn_train <- churn[train_ind]
churn_test <- churn[-train_ind]

# Train the dataset

We can get information on rpart model:

In [None]:
#?rpart

```
Recursive Partitioning and Regression Trees
Description
Fit a rpart model

Usage
rpart(formula, data, weights, subset, na.action = na.rpart, method,
      model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...)
Arguments
formula	
a formula, with a response but no interaction terms. If this a a data frame, that is taken as the model frame (see model.frame).

data	
an optional data frame in which to interpret the variables named in the formula.

...

cost	
a vector of non-negative costs, one for each variable in the model. Defaults to one for all variables. These are scalings to be applied when considering splits, so the improvement on splitting on a variable is divided by its cost in deciding which split to choose.
```

In [None]:
churn.rp <- rpart::rpart(churn ~ ., data = churn_train)

In [None]:
churn.rp

- split is the condition for split,
- n is the total number of cases at node
- loss is the misclassification cost
- yval is the fitted value for the node (yes or no)
- and the yprob is the probabilities of yes and no (those reaching yes on the left and no the right)

The reason for the "yes" and "no" order is the order of the levels of the response variable:

In [None]:
levels(churn_train$churn)

When we stop at the root without any classification and predict all cases as "No", we would have a total misclassification of 491 - the total number of "yes" cases in the train sample.

After one step of partioning according to whether total_day_minutes >= 265.75, # of misclassified cases is down to 83+358 = 441

Now let's examine the complexity parameter.

Complexity parameter serves as a penalty to control the size of the tree. The greater the CP value, the fewer the number of splits there are:

In [None]:
printcp(churn.rp)

We see that out of 19 variables only 9 are used

And we can plot the cost complexity parameters:

In [None]:
plotcp(churn.rp)

And plot variable importance values:

In [None]:
vip(churn.rp)

Most important variables are total_day_minutes, today_day_charge and number_customer_service_calls

# Visualize the tree

A simple way to visualize a rpart tree is the base plot function with text:

In [None]:
plot(churn.rp, , uniform = F, branch=0.6, margin = 0)
text(churn.rp, all = T, use.n = T)

This does not work well with larger trees

A better option is the rpart.plot function from the rpart.plot package:

In [None]:
rpart.plot(churn.rp)

A better option is to use the visTree function from the JS powered visNetwork package:

In [None]:
visTree(churn.rp)

# Evaluate the classification accuracy

Now let's see how well the model can fit the classes of the response variable:

In [None]:
fitted_train <- predict(churn.rp, churn_train, type = "class")

In [None]:
cmtrain1 <- caret::confusionMatrix(table(fitted = fitted_train, actual = churn_train$churn))
cmtrain1

Accuracy rate is 96% with misclassified cases of 142 out of 3500

Kappa value of 0.79 can also be considered as good:

- Poor agreement = less than 0.20
- Fair agreement = 0.20 to 0.40
- Moderate agreement = 0.40 to 0.60
- Good agreement = 0.60 to 0.80
- Very good agreement = 0.80 to 1.00

(Lantz 2015, Machine Learning with R, Ch 10, p.323)


# Predictive power of the model

Now let's see whether our model can do as well on unseen data:

In [None]:
predictions_test <- predict(churn.rp, churn_test, type = "class")

In [None]:
cmtest1 <- caret::confusionMatrix(table(predicted = predictions_test, actual = churn_test$churn))
cmtest1

Predictive accuracy is 94.3%, quite good! Kappa is 0.73, still good.

# Pruning

We may remove sections not so powerful in classification in order to avoid over-fitting and to improve accuracy

Let's remember the model cost parameters:

In [None]:
printcp(churn.rp)

Let's plot the (relative) cross validation error with the standard deviation of the errors:

In [None]:
plotcp(churn.rp)

First let's find the minimum cross-validation error:

In [None]:
min(churn.rp$cptable[,"xerror"])

And locate the row of that minimum value:

In [None]:
minrow <- which.min(churn.rp$cptable[,"xerror"])
minrow

Get the cost complexity parameter at that row:

In [None]:
churn.cp <- churn.rp$cptable[minrow, "CP"]
churn.cp

Let's prune the tree by setting the cp parameter to the CP value of the record with minimum cross-validation error:

In [None]:
prune.tree <- prune(churn.rp, cp = churn.cp)

And visualize:

In [None]:
rpart.plot(prune.tree)

or:

In [None]:
visTree(prune.tree)

## Classification performance of the pruned tree

Let's assess the classification performance on train data with the pruned tree:

In [None]:
predictions_train_pruned <- predict(prune.tree, churn_train, type = "class")

In [None]:
cmtrain2 <- caret::confusionMatrix(table(fitted = predictions_train_pruned, actual = churn_train$churn))
cmtrain2

A lower accuracy and kappa values for the train set.

How about predictive power?

## Predictive power of the pruned tree

Now let's see the classification performance on the test set with pruned tree:

In [None]:
predictions_test_pruned <- predict(prune.tree, churn_test, type = "class")

In [None]:
cmtest2 <- caret::confusionMatrix(table(predicted = predictions_test_pruned, actual = churn_test$churn))
cmtest2

Predictive power is also slightly lower, however we have a less complex tree and some split conditions that may cause over-fitting are eliminated

Before pruning, the kappa of test set is 6% lower than that of train set:

In [None]:
cmtest1$overall["Kappa"] - cmtrain1$overall["Kappa"]

After pruning, the kappa of test set is only 1.5% lower than that of train set:

In [None]:
cmtest2$overall["Kappa"] - cmtrain2$overall["Kappa"]

So while pruning increased the bias, the variance decrease so the performance can be generalized better to unseen data.

# Object Generating Code

In [None]:
student_id <- 2025000000
library(tidyverse)
library(data.table)
library(PearsonDS) # for Pearson distribution
library(rethinking) # for LKJ distribution
library(caret) # for confusion matrix
library(rpart) # for recursive partioning trees
library(rpart.plot) # for plotting recursive partioning trees
library(vip) # for variable importance plots
set.seed(floor((student_id %% 1e8) * 1.1))
nvar <- 4
sampsize <- 1e3
etax <- 10
train_ratio <- 0.7
matx <- rlkjcorr(1, nvar, etax)
sampx <- rmvnorm(1e3, sigma = matx)
sampx <- pnorm(sampx)
means <- rnorm(nvar)
vars <- rexp(nvar, 1)
kurts <- rexp(nvar, 1) + 3
skews <- (rbeta(nvar, 3, 3) - 0.5)*2
colnamesx <- paste(sample(words, nvar + 1), "1", sep = "")
sampx_dt <- as.data.table(sampx)
sampx_dt <- as.data.table(mapply(function(x, a, b, c, d) qpearson(x, moments = c(a, b, c, d)), sampx_dt,
                                 means, vars, skews, kurts))
mm <- model.matrix(as.formula(paste("V0", sprintf("(%s)^4", paste(colnames(sampx_dt), collapse = " + ")), sep = " ~ ")), cbind(V0 = 1, sampx_dt))
paramst <- as.matrix(runif(ncol(mm), -5, 5))
errx <- as.matrix(rnorm(sampsize, 0, sqrt(rexp(1, 0.02))))
responsex <- mm %*% paramst + errx
posrate <- runif(1, 0.2, 0.4)
cutp <- quantile(responsex, 1 - posrate)
responsex <- factor(ifelse(responsex > cutp, 1, 0))
sampx_dt <- cbind(responsex, sampx_dt)
setnames(sampx_dt, colnamesx)
train_indices <- sampx_dt[,sample(.I, .N * train_ratio), by = c(colnamesx[1])]$V1
train_data <- sampx_dt[train_indices]
test_data <- sampx_dt[-train_indices]

## Explore

Get variable names:

In [None]:
names(train_data)

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)
train_data %>%
  pivot_longer(-"encourage1") %>%
  ggplot(aes(y = value, x = "", group = encourage1)) +
  geom_boxplot() +
  facet_wrap(~ name, scales = "free_y")

Here we see that no single variable can discriminate well across the response values: The interquartile ranges mostly overlap.

## Model

Let's run a model:

In [None]:
model1 <- rpart::rpart(encourage1 ~ level1 + bar1 + record1 + young1, data = train_data)

See the splits:

In [None]:
model1

Note that the probabilities are reported in the order of 0 and 1 since this is the order of levels:

In [None]:
levels(train_data$encourage1)

At the root node the predicted class is 0 since it is the majority class and misclassification cost of 168 is the count of 1 values:

In [None]:
table(train_data$encourage1)

Let's visualize the tree:

In [None]:
rpart.plot(model1)

It is hard to interpret all of the leaf nodes in the tree however we can interpret an arbitrarily selected leaf node, for example node 4:

- When record1 < -0.77 and young1 >= -0.94 the response variable is predicted as 0 (negative class) which makes up 90% of the values at that leaf node.

And plot variable importances:

In [None]:
vip(model1)

level1 is the most important variable, followed by record1

Print the complexity parameters:

In [None]:
printcp(model1)

And plot the cross validation errors across cp values: 

In [None]:
plotcp(model1)

## Classification performance

Let's get fitted values:

In [None]:
fitted_train <- predict(model1, type = "class")

In [None]:
cmtrain1 <- confusionMatrix(table(fitted = fitted_train, actual = train_data$encourage1), positive = "1")
cmtrain1

Kappa value is 0.4629, shows moderate aggreement:

- Poor agreement = less than 0.20
- Fair agreement = 0.20 to 0.40
- Moderate agreement = 0.40 to 0.60
- Good agreement = 0.60 to 0.80
- Very good agreement = 0.80 to 1.00

(Lantz 2015, Machine Learning with R, Ch 10, p.323)


And let's get prediction performance:

In [None]:
predictions_test <- predict(model1, test_data, type = "class")

In [None]:
cmtest1 <- confusionMatrix(table(predicted = predictions_test, actual = test_data$encourage1), positive = "1")
cmtest1

Kappa value is lower now, at 0.3812, fair agreement.

## Pruning

Let's prune the tree to have a simpler structure:

Locate the row of that minimum xerror (relative cross validation error) value:

In [None]:
minrow <- which.min(model1$cptable[,"xerror"])
minrow

Get the cost complexity parameter at that row:

In [None]:
model1.cp <- model1$cptable[minrow, "CP"]
model1.cp

Let's prune the tree by setting the cp parameter to the CP value of the record with minimum cross-validation error:

In [None]:
prune.tree <- prune(model1, cp = model1.cp)

And visualize:

In [None]:
rpart.plot(prune.tree)

### Classification performance of the pruned tree

Let's get the classification performance using the pruned tree:

In [None]:
fitted_train_pruned <- predict(prune.tree, type = "class")

In [None]:
cmtrain2 <- confusionMatrix(table(fitted = fitted_train_pruned, actual = train_data$encourage1), positive = "1")
cmtrain2

Kappa value of 0.2967 is quite lower now with fair agreement.

How about predictive power?

### Predictive power of the pruned tree

Let's get the prediction performance on test set with pruned tree:

In [None]:
predictions_test_pruned <- predict(prune.tree, test_data, type = "class")

In [None]:
cmtest2 <- confusionMatrix(table(predicted = predictions_test_pruned, actual = test_data$encourage1), positive = "1")
cmtest2

Now, predictive power of the pruned tree on test set is better than its performance on train set! Kappa value is at 0.3836, still fair.

And this performance is also better than the prediction performance of the full tree.

So while pruning caused the fitting performance worse as expected, the prediction performance is better now, the model can generalize well on unseen data.