Skip to content

Creating an ensemble of parsnip models that is itself a (tunable) parsnip model #269

@prockenschaub

Description

@prockenschaub

The problem

I recently tried to implement an ensemble of parsnip model. This is easily done for one particular model (see example for bagging of linear regression below) but things become more difficult as soon as I would like to implement a general model type that can take any parsnip model as the base model. This model should ideally:

  • have its own list of arguments/hyperparameters such as number of resamples or number of features tried at each iteration
  • expose the hyperparameters of its base model for hyperparameter tuning

Another example besides bagging could be to allow for model stacking of parsnip models.

Is there interest in adding such models or having an ensemble module in parsnip/tidymodels?
Does it possibly exist already or is really easy to implement, but I just missed it?
If yes to the former and no to the latter, do you have any suggestions as to how to best go about this?

Reproducible example

library(dplyr)
library(purrr)
library(rsample)
library(parsnip)

# Implementation of a simple lin-reg bagging routine  ---------------------

# Divide data into train and test and re-sample the training set for bagging
set.seed(42)
train_test <- initial_split(mtcars)
btstrp <- bootstraps(training(train_test), times = 5)

# Randomly choose a subset of variables
btstrp <- btstrp %>% 
  mutate(cols = rerun(n(), sample(2:ncol(mtcars), size = 4, replace = FALSE)))

# Fit a separate linear regression to each sample
lr <- linear_reg() %>% set_engine("lm")
btstrp <- btstrp %>% 
  mutate(fit = map2(splits, cols, ~ fit(lr, mpg ~ ., analysis(.x)[, c(1, .y)])))

btstrp           # Overall collection of models
#> # Bootstrap sampling 
#> # A tibble: 5 x 4
#>   splits          id         cols      fit     
#> * <list>          <chr>      <list>    <list>  
#> 1 <split [24/8]>  Bootstrap1 <int [4]> <fit[+]>
#> 2 <split [24/10]> Bootstrap2 <int [4]> <fit[+]>
#> 3 <split [24/4]>  Bootstrap3 <int [4]> <fit[+]>
#> 4 <split [24/8]>  Bootstrap4 <int [4]> <fit[+]>
#> 5 <split [24/8]>  Bootstrap5 <int [4]> <fit[+]>
btstrp$fit[[1]]  # Fit on one bootstrap sample
#> parsnip model object
#> 
#> Fit time:  0ms 
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)         qsec           am           vs           hp  
#>    38.51572     -0.77183      5.07365      3.20619     -0.05355

# Then make predictions by averaging thepredictions of each submodel
btstrp <- btstrp %>% 
  mutate(pred = map(fit, predict, new_data = testing(train_test)))

overall_pred <- btstrp$pred %>% reduce(`+`) %>% `/`(nrow(btstrp))

btstrp$pred[[1]] # prediction from one fit
#> # A tibble: 8 x 1
#>   .pred
#>   <dbl>
#> 1  25.0
#> 2  20.8
#> 3  16.0
#> 4  21.0
#> 5  12.8
#> 6  28.2
#> 7  16.0
#> 8  26.6
overall_pred     # final (averaged) prediction made by the model
#>      .pred
#> 1 21.02806
#> 2 21.79681
#> 3 15.44021
#> 4 22.91096
#> 5 12.17279
#> 6 27.78687
#> 7 14.71942
#> 8 26.54504


# The parsnip model type would then be a wrapper around the above ---------
# For example, the interface could look like this

proposed_model <- 
  ensemble("regression", model = "linear_reg", resamples = 5, mtry = 4) %>% 
  set_engine("lm")

fitted <- fit(proposed_model, mpg ~ ., data = training(train_test)) # returns a model object that essentially contains btstrp$fit
pred   <- predict(fitted, new_data = testing(train_test))           # returns overall_pred

# If the base model has hyperparameters, these should also be exposed for
# tuning for example via the ``tune`` package

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions