### Tree-Based Methods

In this chapter, we describe tree-based methods for regression and classifi
cation. These involve stratifying or segmenting the predictor space into a
number of simple regions. In order to make a prediction for a given ob
servation, we typically use the mean or the mode response value for the
training observations in the region to which it belongs. Since the set of
splitting rules used to segment the predictor space can be summarized in
a tree, these types of approaches are known as decision tree methods.
Tree-based methods are simple and useful for interpretation. However,
they typically are not competitive with the best supervised learning ap
proaches, such as those seen in Chapters 6 and 7, in terms of prediction
accuracy. Hence in this chapter we also introduce bagging, random forests,
boosting, and Bayesian additive regression trees. Each of these approaches
involves producing multiple trees which are then combined to yield a single
consensus prediction. We will see that combining a large number of trees
can often result in dramatic improvements in prediction accuracy, at the
expense of some loss in interpretation.

#### The Basics of Decision Trees

Decision trees can be applied to both regression and classification problems.
We first consider regression problems, and then move on to classification.

##### Regression Trees

In order to motivate regression trees, we begin with a simple example.

Predicting Baseball Players’ Salaries Using Regression Trees

We use the Hitters data set to predict a baseball player’s Salary based on
Years (the number of years that he has played in the major leagues) and
Hits (the number of hits that he made in the previous year). We first remove
observations that are missing Salary values, and log-transform Salary so
that its distribution has more of a typical bell-shape. (Recall that Salary
is measured in thousands of dollars.)
Figure 8.1 shows a regression tree fit to this data. It consists of a series
of splitting rules, starting at the top of the tree. The top split assigns
observations having Years<4.5 to the left branch.1 The predicted salary
for these players is given by the mean response value for the players in
the data set with Years<4.5. For such players, the mean log salary is 5.107,
and so we make a prediction of e5.107 thousands of dollars, i.e. $165,174, for
these players. Players with Years>=4.5 are assigned to the right branch, and
then that group is further subdivided by Hits. Overall, the tree stratifies
or segments the players into three regions of predictor space: players who
have played for four or fewer years, players who have played for five or more
years and who made fewer than 118 hits last year, and players who have
played for five or more years and who made at least 118 hits last year. These
three regions can be written as R1 ={X | Years<4.5}, R2 ={X | Years>=4.5,
Hits<117.5}, and R3 ={X | Years>=4.5, Hits>=117.5}. Figure 8.2 illustrates the regions as a function of Years and Hits. The predicted salaries for these
three groups are $1,000 e5.107 =$165,174, $1,000 e5.999 =$402,834, and
$1,000 e6.740 =$845,346 respectively.
In keeping with the tree analogy, the regions R1, R2, and R3 are known as
terminal nodes or leaves of the tree. As is the case for Figure 8.1, decision terminal
trees are typically drawn upside down, in the sense that the leaves are at
the bottom of the tree. The points along the tree where the predictor space
is split are referred to as internal nodes. In Figure 8.1, the two internal internal
nodes are indicated by the text Years<4.5 and Hits<117.5. We refer to the
segments of the trees that connect the nodes as branches.
We might interpret the regression tree displayed in Figure 8.1 as follows:
Years is the most important factor in determining Salary, and players with
less experience earn lower salaries than more experienced players. Given
that a player is less experienced, the number of hits that he made in the
previous year seems to play little role in his salary. But among players who
have been in the major leagues for five or more years, the number of hits
made in the previous year does affect salary, and players who made more
hits last year tend to have higher salaries. The regression tree shown in
Figure 8.1 is likely an over-simplification of the true relationship between
Hits, Years, and Salary. However, it has advantages over other types of
regression models (such as those seen in Chapters 3 and 6): it is easier to
interpret, and has a nice graphical representation.
Prediction via Stratification of the Feature Space
Wenowdiscuss the process of building a regression tree. Roughly speaking,
there are two steps.
1. We divide the predictor space — that is, the set of possible values
for X1,X2,...,Xp — into J distinct and non-overlapping regions,
R1,R2,...,RJ.
2. For every observation that falls into the region Rj, we make the same
prediction, which is simply the mean of the response values for the
training observations in Rj.
For instance, suppose that in Step 1 we obtain two regions, R1 and R2,
and that the response mean of the training observations in the first region
is 10, while the response mean of the training observations in the second
region is 20. Then for a given observation X = x, if x R1 we will predict
a value of 10, and if x R2 we will predict a value of 20.
We now elaborate on Step 1 above. How do we construct the regions
R1,...,RJ? In theory, the regions could have any shape. However, we
choose to divide the predictor space into high-dimensional rectangles, or
boxes, for simplicity and for ease of interpretation of the resulting predic
tive model. The goal is to find boxes R1,...,RJ that minimize the RSS,
given by For instance, suppose that in Step 1 we obtain two regions, $ R_1 $ and $ R_2 $, and that the response mean of the training observations in the first region is 10, while the response mean of the training observations in the second region is 20. Then for a given observation $ X = x $, if $ x \in R_1 $ we will predict a value of 10, and if $ x \in R_2 $ we will predict a value of 20.

We now elaborate on Step 1 above. How do we construct the regions $ R_1, \ldots, R_J $? In theory, the regions could have any shape. However, we choose to divide the predictor space into high-dimensional rectangles, or boxes, for simplicity and for ease of interpretation of the resulting predictive model. The goal is to find boxes $ R_1, \ldots, R_J $ that minimize the RSS, given by

$$
\sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2,
$$

where $ \hat{y}_{R_j} $ is the mean response for the training observations within the $ j $th box. Unfortunately, it is computationally infeasible to consider every possible partition of the feature space into $ J $ boxes. For this reason, we take a top-down, greedy approach that is known as recursive binary splitting. The approach is top-down because it begins at the top of the tree (at which point all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

In order to perform recursive binary splitting, we first select the predictor $ X_j $ and the cutpoint $ s $ such that splitting the predictor space into the regions $ \{X | X_j < s\} $ and $ \{X | X_j \geq s\} $ leads to the greatest possible reduction in RSS. (The notation $ \{X | X_j < s\} $ means the region of predictor space in which $ X_j $ takes on a value less than $ s $.) That is, we consider all predictors $ X_1, \ldots, X_p $, and all possible values of the cutpoint $ s $ for each of the predictors, and then choose the predictor and cutpoint such that the resulting tree has the lowest RSS. In greater detail, for any $ j $ and $ s $, we define the pair of half-planes

$$
R_1(j,s) = \{X | X_j < s\} \quad \text{and} \quad R_2(j,s) = \{X | X_j \geq s\},
$$

and we seek the value of $ j $ and $ s $ that minimize the equation

$$
\sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2,
$$

where $ \hat{y}_{R_1} $ is the mean response for the training observations in $ R_1(j,s) $, and $ \hat{y}_{R_2} $ is the mean response for the training observations in $ R_2(j,s) $. Finding the values of $ j $ and $ s $ that minimize the above expression can be done quite quickly, especially when the number of features $ p $ is not too large.

Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the
entire predictor space, we split one of the two previously identified regions.
Wenowhavethree regions. Again, we look to split one of these three regions
further, so as to minimize the RSS. The process continues until a stopping
criterion is reached; for instance, we may continue until no region contains
more than five observations.
Once the regions R1,...,RJ have been created, we predict the response
for a given test observation using the mean of the training observations in
the region to which that test observation belongs.
A five-region example of this approach is shown in Figure 8.3.


Tree Pruning


The process described above may produce good predictions on the training
set, but is likely to overfit the data, leading to poor test set performance.
This is because the resulting tree might be too complex. A smaller tree
