Xiaomeng Ju and Matias Salibian Barrera 2020-08-21
This repository contains R
code implementing the robust boosting
algorithm described in Ju X, Salibian-Barrera M. Robust Boosting for
Regression Problems (to appear in
Computational Statistics and Data Science). This method provides an
outlier-robust fit for non-parametric regression models.
The package can be installed as follows:
devtools::install_github("xmengju/RRBoost")
Below we illustrate the use of the package with the airfoil
dataset.
It consists of n = 1503
observations with p = 5
explanatory
variables, and the goal is to predict the response variable y
. We load
the data and look at the first few observations:
library(RRBoost)
data(airfoil)
head(airfoil)
## frequency angle chord_length velocity thickness y
## 1 800 0 0.3048 71.3 0.00266337 126.201
## 2 1000 0 0.3048 71.3 0.00266337 125.201
## 3 1250 0 0.3048 71.3 0.00266337 125.951
## 4 1600 0 0.3048 71.3 0.00266337 127.591
## 5 2000 0 0.3048 71.3 0.00266337 127.461
## 6 2500 0 0.3048 71.3 0.00266337 125.571
In order to train our predictor, we split the data set into a training
set (with 60% of the available data), a validation
set and a test
set (both with 20% of the data). We first randomly select the
observations for each of these three sets:
n <- nrow(airfoil)
n0 <- floor( 0.2 * n )
set.seed(123)
idx_test <- sample(n, n0)
idx_train <- sample((1:n)[-idx_test], floor( 0.6 * n ) )
idx_val <- (1:n)[ -c(idx_test, idx_train) ]
We now create the matrices of explanatory variables (x
) and vectors of
responses (y
) corresponding to this partition.
xx <- airfoil[, -6]
yy <- airfoil$y
xtrain <- xx[ idx_train, ]
ytrain <- yy[ idx_train ]
xval <- xx[ idx_val, ]
yval <- yy[ idx_val ]
xtest <- xx[ idx_test, ]
ytest <- yy[ idx_test ]
The Boost
function implements the RRBoost
estimator, as well as the
following previously proposed boosting algorithms: L2Boost
,
LADBoost
, MBoost
(Friedman, J. H.
(2001)) and Robloss
(Lutz et
al. (2008)).
There are two choices for the initial fit used in the RRBoost
algorithm: median
(corresponds to a constant initial fit equal to the
median of the response variable in the training set), and LADTree
(the
initial fit is an L1 Tree as proposed in Breiman
(1984)). To construct the L1
initial fit the user can select the desired maximum depth and minimum
leaf size, or use the function Boost.validation
to select these
parameters based on the predictive performance on the validation set.
We now train the RRBoost
predictor using the following three initial
fits: median
, an LADTree
with parameters chosen a priori, and an
LADTree
with parameteres selected using the validation set. The
performance of the resulting three predictors on the test set will be
measured using the root mean squared (prediction) error (setting the
argument error = "rmse"
). Other possible options for this argument
are: aad
(average absolute deviation), trmse
(trimmed root mean
squared (prediction) error) and rrmse
(a robust root mean squared
(prediction) error). For more information see the help page of the
function Boost
.
The depth of the base learners in the boosting algorithm is set with the
argument max_depth
(below we use decision stumps: max_depth = 1
).
The argument niter
specifies the number of iterations (epochs) to be
used (we set niter = 1000
). Predictions (on the supplied test set) and
variable importance scores can be computed by setting make_prediction = TRUE
and cal_imp = TRUE
in the argument control
, as we do below.
For the fit computed with an initial LADTree
with parameters chosen a
priori, we set its maximum depth to max_depth_init = 2
and the minimum
number of observations per node at min_leaf_size_init = 20
.
Finally, we also use as initial fit the best LADTree
among those
constructed with all the possible combinations of max_depth_init_set
between 1 and 5 (max_depth_init_set = 1:5
) and
min_leaf_size_init_set
equal to 10, 20 or 30 (min_leaf_size_init_set = c(10,20,30)
).
model_RRBoost_median = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "RRBoost", error = "rmse",
y_init = "median", max_depth = 1,
niter = 1000,
control = Boost.control(make_prediction = TRUE,
cal_imp = TRUE))
model_RRBoost_LADTree = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "RRBoost", error = "rmse",
y_init = "LADTree", max_depth = 1,
niter = 1000,
control = Boost.control(max_depth_init = 2,
min_leaf_size_init = 20,
make_prediction = TRUE,
cal_imp = TRUE))
model_RRBoost_cv_LADTree = Boost.validation(x_train = xtrain,
y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest,
y_test = ytest,
type = "RRBoost",
error = "rmse",
y_init = "LADTree",
max_depth = 1, niter = 1000,
max_depth_init_set = 1:5,
min_leaf_size_init_set = c(10,20,30),
control = Boost.control(
make_prediction = TRUE,
cal_imp = TRUE))
The parameters of the selected LADTree
are returned in the params
entry of the returned object:
print(model_RRBoost_cv_LADTree$params)
## min_leafs max_depths
## 14 20 5
The predictive performance of each of the fits on the test set is stored
in the value
entry of the returned objects. This is the test error
(using the criterion specified with the error
argument of the Boost
call) evaluated at the early stopping time, which is determined using
the validation set to avoid overfitting (see paper for more details). In
this example the returned value
is the test rmse
evaluated at the
early stopping time. The user can evaluate more than one prediction
performance by passing a vector of strings to the argument error
(e.g.
Boost(..., error=c('rmse', 'rrmse'), ...)
). In that case the returned
value
will be a vector corresponding to those prediction error
measures.
The best prediction performance on the test set was obtained by
selecting a deeper initial LADTree
:
print(c(median = model_RRBoost_median$value,
LADTree = model_RRBoost_LADTree$value,
cv_LADTree = model_RRBoost_cv_LADTree$value))
## median LADTree cv_LADTree
## 4.622946 4.410887 3.443912
The variable selection scores are returned in the var_importance
entry.
print(cbind(median = model_RRBoost_median$var_importance,
LADTree = model_RRBoost_LADTree$var_importance,
cv_LADTree = model_RRBoost_cv_LADTree$var_importance))
## median LADTree cv_LADTree
## frequency 3.7844891 3.7814350 5.99593796
## angle 0.0651800 0.3201209 0.08082488
## chord_length 0.9628652 1.0416198 1.87940434
## velocity 0.5609014 0.5983525 0.65163300
## thickness 2.9582819 2.5745679 4.53945203
We note that for the top 2 explanatory variables are consistently found
to be frequency
and thickness
.
We can also separate the process of training the predictor, evaluating
it on a test set, and calculating the variable importance scores. In
this way we have the flexiblity to compute predictions on multiple test
sets and calculate variable importance when needed. In this case the
user needs to use save_tree = TRUE
in the argument control
. This
option includes the trainer weak learners in the returned object, which
are needed for calculating predictions and variable importance scores.
The following code illustrates this procedure:
model = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
# x_test = xtest, y_test = ytest,
type = "RRBoost", error = "rmse",
y_init = "LADTree", max_depth = 1, niter = 1000,
control = Boost.control(max_depth_init = 2,
min_leaf_size_init = 20,
save_tree = TRUE,
make_prediction = FALSE, #TRUE,
cal_imp = FALSE))
prediction <- cal_predict(model, x_test = xtest, y_test = ytest)
var_importance <- cal_imp_func(model, x_val = xval, y_val= yval)
Sanity check: the results are the same as the ones obtained before:
print(all.equal(prediction$value, model_RRBoost_LADTree$value))
## [1] TRUE
print(all.equal(var_importance, model_RRBoost_LADTree$var_importance))
## [1] TRUE
Finally, we compare the predictive performance of RRBoost
on the test
set with those of the other boosting algorithms implemented in the
package, namely: L2Boost
, LADBoost
, MBoost
, and Robloss
.
model_L2Boost = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "L2Boost", error = "rmse",
y_init = "median", max_depth = 1,
niter = 1000,
control = Boost.control(make_prediction = TRUE,
cal_imp = TRUE))
model_LADBoost = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "LADBoost", error = "rmse",
y_init = "median", max_depth = 1,
niter = 1000,
control = Boost.control(make_prediction = TRUE,
cal_imp = TRUE))
model_MBoost = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "MBoost", error = "rmse",
y_init = "median", max_depth = 1,
niter = 1000,
control = Boost.control(make_prediction = TRUE,
cal_imp = TRUE))
model_Robloss = Boost(x_train = xtrain, y_train = ytrain,
x_val = xval, y_val = yval,
x_test = xtest, y_test = ytest,
type = "Robloss", error = "rmse",
y_init = "median", max_depth = 1,
niter = 1000,
control = Boost.control(make_prediction = TRUE,
cal_imp = TRUE))
Note that RRBoost
yields the best predictions on the test set:
print(c( RRBoost = model_RRBoost_cv_LADTree$value,
L2Boost = model_L2Boost$value,
LADTree = model_LADBoost$value,
MBoost = model_MBoost$value,
Robloss = model_Robloss$value))
## RRBoost L2Boost LADTree MBoost Robloss
## 3.443912 4.521834 4.520258 4.483478 4.648868