-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Following up on @jameslamb's comment here—thank you for being willing to discuss. :)
Some background, for the GitHub archeologists:
lightgbm allows passing many of its arguments with aliases. On the parsnip side, these include both main and engine arguments to boost_tree(), including the now-tunable engine argument num_leaves. On the lightgbm side, these include both "core" and "control" arguments.
As of now, any aliases supplied to set_engine are passed in the dots of bonsai::train_lightgbm() to the dots of lightgbm::lgb.train(). lightgbm's machinery takes care of resolving aliases, with some rules that generally prevent silent failures while tuning:
- If a main argument is marked for tuning and its main translation (i.e. lightgbm's non-alias argument name) is supplied as an engine arg, parsnip machinery will throw warnings that the engine argument will be ignored. e.g. for
min_n->min_data_in_leaf:
! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: min_data_in_leaf
...
-
If a main argument is marked for tuning and a lightgbm alias is supplied as an engine arg, we ignore the alias silently. (Note that
bonsai::train_lightgbm()setslgb.train()'sverboseargument to1Lif one isn't supplied.) -
The scariest issue I'd anticipate is the user not touching the main argument (that will be translated to the main, non-alias
lgb.trainargument), but setting the alias inset_engine(). In that case, thebonsai::train_lightgbm()default kicks in, and the user-supplied engine argument is silently ignored in favor of the default supplied as the non-alias lightgbm argument.🫣
Reprex here. (Click to expand)
library(tidymodels)
library(bonsai)
library(testthat)
data("penguins", package = "modeldata")
penguins <- penguins[complete.cases(penguins),]
penguins_split <- initial_split(penguins)
set.seed(1)
boots <- bootstraps(training(penguins_split), 3)
base_wf <- workflow() %>% add_formula(bill_length_mm ~ .)Marking a main argument for tuning, as usual:
bt_spec <-
boost_tree(min_n = tune()) %>%
set_engine("lightgbm") %>%
set_mode("regression")
bt_wf <-
base_wf %>%
add_model(bt_spec)
set.seed(1)
bt_res_correct <- tune_grid(bt_wf, boots, grid = 3, control = control_grid(save_pred = TRUE))
bt_res_correct
#> # Tuning results
#> # Bootstrap sampling
#> # A tibble: 3 × 5
#> splits id .metrics .notes .predictions
#> <list> <chr> <list> <list> <list>
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>
Marking a main argument for tuning, and supplying its non-alias translation as engine arg:
bt_spec <-
boost_tree(min_n = tune()) %>%
set_engine("lightgbm", min_data_in_leaf = 1) %>%
set_mode("regression")
bt_wf <-
base_wf %>%
add_model(bt_spec)
set.seed(1)
bt_res_both <- tune_grid(bt_wf, boots, grid = 3)
#> ! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
bt_res_both
#> # Tuning results
#> # Bootstrap sampling
#> # A tibble: 3 × 4
#> splits id .metrics .notes
#> <list> <chr> <list> <list>
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [3 × 3]>
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [3 × 3]>
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [3 × 3]>
#>
#> There were issues with some computations:
#>
#> - Warning(s) x9: The following arguments cannot be manually modified and were remo...
#>
#> Run `show_notes(.Last.tune.result)` for more information.Marking a main argument for tuning, and supplying an alias to tune as engine arg:
set.seed(1)
bt_spec <-
boost_tree(min_n = tune()) %>%
set_engine("lightgbm", min_data_per_leaf = 1) %>%
set_mode("regression")
bt_wf <-
base_wf %>%
add_model(bt_spec)
bt_res_alias <-
tune_grid(
bt_wf, boots, grid = 3,
control = control_grid(extract = extract_fit_engine, save_pred = TRUE)
)Note that both params end up in the resulting object, though only one is reference when making predictions.
bt_res_alias %>%
pull(.extracts) %>%
`[[`(1)
#> # A tibble: 3 × 3
#> min_n .extracts .config
#> <int> <list> <chr>
#> 1 13 <lgb.Bstr> Preprocessor1_Model1
#> 2 33 <lgb.Bstr> Preprocessor1_Model2
#> 3 25 <lgb.Bstr> Preprocessor1_Model3
lgb_fit <- bt_res_alias %>%
pull(.extracts) %>%
`[[`(1) %>%
pull(.extracts) %>%
`[[`(1)
lgb_fit$params$min_data_in_leaf
#> [1] 13
lgb_fit$params$min_data_per_leaf
#> [1] 1
# all good
expect_equal(
collect_predictions(bt_res_correct),
collect_predictions(bt_res_alias)
)
bt_mets_correct <-
bt_res_correct %>%
select_best("rmse") %>%
finalize_workflow(bt_wf, parameters = .) %>%
last_fit(penguins_split)
bt_mets_alias <-
bt_res_alias %>%
select_best("rmse") %>%
finalize_workflow(bt_wf, parameters = .) %>%
last_fit(penguins_split)
# all good
expect_equal(
bt_mets_correct$.metrics,
bt_mets_alias$.metrics
)Created on 2022-11-04 with reprex v2.0.2
I think the best approach here would be to raise a warning or error whenever an alias that maps to a main boost_tree() argument is supplied, and note that it can be resolved by passing as a main argument to boost_tree(). Otherwise, passing aliases as engine arguments (i.e. that don't map to main arguments) seems unproblematic to me. Another option is to set verbose to a setting that allows lightgbm to propogate its own prompts with duplicated aliases when any alias is supplied, though this feels like it might obscure train_lightgbm()s role in passing a non-aliased argument. Either way, this requires being able to detect when an alias is supplied.
A question for you, James, if you're up for it—is there any sort of dictionary that we could reference that would contain these mappings? A list like that currently outputted by lightgbm:::.PARAMETER_ALIASES() would be perfect, though that also contains the parameters listed under "Learning Control Parameters".
We could also put that together ourselves—we'd just need the mappings for 8 of them:
library(tidymodels)
library(bonsai)
get_from_env("boost_tree_args") %>%
filter(engine == "lightgbm")
#> # A tibble: 8 × 5
#> engine parsnip original func has_submodel
#> <chr> <chr> <chr> <list> <lgl>
#> 1 lightgbm tree_depth max_depth <named list [2]> FALSE
#> 2 lightgbm trees num_iterations <named list [2]> TRUE
#> 3 lightgbm learn_rate learning_rate <named list [2]> FALSE
#> 4 lightgbm mtry feature_fraction_bynode <named list [2]> FALSE
#> 5 lightgbm min_n min_data_in_leaf <named list [2]> FALSE
#> 6 lightgbm loss_reduction min_gain_to_split <named list [2]> FALSE
#> 7 lightgbm sample_size bagging_fraction <named list [2]> FALSE
#> 8 lightgbm stop_iter early_stopping_rounds <named list [2]> FALSECreated on 2022-11-04 with reprex v2.0.2