NOTE the development of h2oparsnip is being integrated within tidymodels as the agua package. New issues and feature requests should be made in the agua repository.
h2oparsnip
provides a set of wrappers to bind h2o algorthms with the
'parsnip' package.
This package is early in development. Currently the following h2o algorithms are implemented:
- h2o.deeplearning engine added to parsnip::mlp model specification
- h2o.gbm engine added to parsnip::boost_tree model specification
- h2o.randomForest engine added to parsnip::rand_forest model specification
- h2o.glm engine added to multinom_reg, logistic_reg and linear_reg model specifications
- h2o.naiveBayes engine added to naive_Bayes specification
- a new model, automl
- h2o.rulefit engine added to parsnip::rule_fit
The package is not yet on CRAN and can be installed with:
devtools::install_github("stevenpawley/h2oparsnip")
The package currently is based on the concept of using h2o as a disposable backend, using h2o as a drop-in replacement for the traditionally used 'engines' within the parsnip package. However, performing tasks such as hyperparameter tuning via the 'tune' packge will be less efficient if working on a remote cluster than using h2o directly because data is being sent back and forth.
h2oparsnip also does not provide any management of the h2o cluster. If lots of models are being run then available memory within the cluster may be exhausted. Currently this has to be managed using the commands in the h2o package.
An example of using the h2o engine for boosted trees. For boosted trees, hyperparameter tuning of the number of trees is performed using a multi_predict method, i.e. the algorithm fits the maximum number of trees in the parameter grid and can be tuned using fewer trees without retraining.
library(tidymodels)
library(h2oparsnip)
library(h2o)
h2o.init()
gbm <- boost_tree(mode = "regression", trees = tune(), min_n = tune()) %>%
set_engine("h2o")
rec <- iris %>%
recipe(Petal.Length ~ .)
params <- expand.grid(trees = c(10, 20, 30), min_n = c(1, 5, 10))
resamples = mc_cv(iris, times = 1)
tune_results <- tune_grid(
object = gbm,
preprocessor = rec,
resamples = resamples,
grid = params,
metrics = metric_set(rmse)
)
A problem with using tune::tune_grid
is that performance is reduced because the data for every tuning hyperparameter iteration and resampling is moved from R to the h2o cluster. To minimize this, the tune_grid_h2o
function can be used to tune model arguments, as a near drop-in replacement:
tune_results <- tune_grid_h2o(
object = gbm,
preprocessor = rec,
resamples = resamples,
grid = params,
metrics = metric_set(rmse)
)
Currently, tune_grid_h2o
can only tune model parameters and does not handle recipes with tunable parameters. tune_grid_h2o
moves the data to the h2o cluster only once, i.e. the complete dataset specified by the resamples
argument is moved to the cluster, and then the equivalent h2o.frame is split based on the row indices in the resampling object, and the h2o::h2o.grid
function is used for tuning on the h2o frames. To avoid repeatedly moving predictions back from h2o to R, all metrics are also calculated on the cluster. This restricts the range of metrics to what is available in h2o (tune_grid_h2o
maps yardstick metrics to their h2o equivalents). The available metrics are listed in the tune_grid_h2o
help documentation. However, hyperparameter tuning using tune_grid_h2o
should be similarly performant as when using h2o directly.
Similar to tune::control_grid
, details of tune_grid_h2o
can be configured using tune_grid_h2o(control = control_h2o())
. This allows the predictions and/or models to be saved (the default is that they are removed after tuning to avoid clutter in the cluster).