Skip to content

resolving aliased argument names with lightgbm #53

@simonpcouch

Description

@simonpcouch

Following up on @jameslamb's comment here—thank you for being willing to discuss. :)


Some background, for the GitHub archeologists:

lightgbm allows passing many of its arguments with aliases. On the parsnip side, these include both main and engine arguments to boost_tree(), including the now-tunable engine argument num_leaves. On the lightgbm side, these include both "core" and "control" arguments.

As of now, any aliases supplied to set_engine are passed in the dots of bonsai::train_lightgbm() to the dots of lightgbm::lgb.train(). lightgbm's machinery takes care of resolving aliases, with some rules that generally prevent silent failures while tuning:

https://github.com/microsoft/LightGBM/blob/e45fc48405e9877138ffb5f7e1fd4c449752d323/R-package/R/utils.R#L176-L181

  • If a main argument is marked for tuning and its main translation (i.e. lightgbm's non-alias argument name) is supplied as an engine arg, parsnip machinery will throw warnings that the engine argument will be ignored. e.g. for min_n -> min_data_in_leaf:
! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: min_data_in_leaf
...
  • If a main argument is marked for tuning and a lightgbm alias is supplied as an engine arg, we ignore the alias silently. (Note that bonsai::train_lightgbm() sets lgb.train()'s verbose argument to 1L if one isn't supplied.)

  • The scariest issue I'd anticipate is the user not touching the main argument (that will be translated to the main, non-alias lgb.train argument), but setting the alias in set_engine(). In that case, the bonsai::train_lightgbm() default kicks in, and the user-supplied engine argument is silently ignored in favor of the default supplied as the non-alias lightgbm argument.🫣

Reprex here. (Click to expand)
library(tidymodels)
library(bonsai)
library(testthat)

data("penguins", package = "modeldata")

penguins <- penguins[complete.cases(penguins),]
penguins_split <- initial_split(penguins)
set.seed(1)
boots <- bootstraps(training(penguins_split), 3)
base_wf <- workflow() %>% add_formula(bill_length_mm ~ .)

Marking a main argument for tuning, as usual:

bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm") %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

set.seed(1)
bt_res_correct <- tune_grid(bt_wf, boots, grid = 3, control = control_grid(save_pred = TRUE))

bt_res_correct
#> # Tuning results
#> # Bootstrap sampling 
#> # A tibble: 3 × 5
#>   splits           id         .metrics         .notes           .predictions
#>   <list>           <chr>      <list>           <list>           <list>      
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>    
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>    
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [0 × 3]> <tibble>

Marking a main argument for tuning, and supplying its non-alias translation as engine arg:

bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm", min_data_in_leaf = 1) %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

set.seed(1)
bt_res_both <- tune_grid(bt_wf, boots, grid = 3)
#> ! Bootstrap1: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap1: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap2: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 1/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 2/3: The following arguments cannot be manually modified and were removed: mi...
#> ! Bootstrap3: preprocessor 1/1, model 3/3: The following arguments cannot be manually modified and were removed: mi...

bt_res_both
#> # Tuning results
#> # Bootstrap sampling 
#> # A tibble: 3 × 4
#>   splits           id         .metrics         .notes          
#>   <list>           <chr>      <list>           <list>          
#> 1 <split [249/93]> Bootstrap1 <tibble [6 × 5]> <tibble [3 × 3]>
#> 2 <split [249/93]> Bootstrap2 <tibble [6 × 5]> <tibble [3 × 3]>
#> 3 <split [249/97]> Bootstrap3 <tibble [6 × 5]> <tibble [3 × 3]>
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x9: The following arguments cannot be manually modified and were remo...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

Marking a main argument for tuning, and supplying an alias to tune as engine arg:

set.seed(1)
bt_spec <-
  boost_tree(min_n = tune()) %>%
  set_engine("lightgbm", min_data_per_leaf = 1) %>%
  set_mode("regression")

bt_wf <-
  base_wf %>%
  add_model(bt_spec)

bt_res_alias <- 
  tune_grid(
    bt_wf, boots, grid = 3, 
    control = control_grid(extract = extract_fit_engine, save_pred = TRUE)
  )

Note that both params end up in the resulting object, though only one is reference when making predictions.

bt_res_alias %>%
  pull(.extracts) %>%
  `[[`(1) 
#> # A tibble: 3 × 3
#>   min_n .extracts  .config             
#>   <int> <list>     <chr>               
#> 1    13 <lgb.Bstr> Preprocessor1_Model1
#> 2    33 <lgb.Bstr> Preprocessor1_Model2
#> 3    25 <lgb.Bstr> Preprocessor1_Model3

lgb_fit <- bt_res_alias %>%
  pull(.extracts) %>%
  `[[`(1) %>%
  pull(.extracts) %>%
  `[[`(1)

lgb_fit$params$min_data_in_leaf
#> [1] 13
lgb_fit$params$min_data_per_leaf
#> [1] 1

# all good
expect_equal(
  collect_predictions(bt_res_correct),
  collect_predictions(bt_res_alias)
)

bt_mets_correct <- 
  bt_res_correct %>%
  select_best("rmse") %>%
  finalize_workflow(bt_wf, parameters = .) %>%
  last_fit(penguins_split)

bt_mets_alias <- 
  bt_res_alias %>%
  select_best("rmse") %>%
  finalize_workflow(bt_wf, parameters = .) %>%
  last_fit(penguins_split)

# all good
expect_equal(
  bt_mets_correct$.metrics,
  bt_mets_alias$.metrics
)

Created on 2022-11-04 with reprex v2.0.2


I think the best approach here would be to raise a warning or error whenever an alias that maps to a main boost_tree() argument is supplied, and note that it can be resolved by passing as a main argument to boost_tree(). Otherwise, passing aliases as engine arguments (i.e. that don't map to main arguments) seems unproblematic to me. Another option is to set verbose to a setting that allows lightgbm to propogate its own prompts with duplicated aliases when any alias is supplied, though this feels like it might obscure train_lightgbm()s role in passing a non-aliased argument. Either way, this requires being able to detect when an alias is supplied.

A question for you, James, if you're up for it—is there any sort of dictionary that we could reference that would contain these mappings? A list like that currently outputted by lightgbm:::.PARAMETER_ALIASES() would be perfect, though that also contains the parameters listed under "Learning Control Parameters".

We could also put that together ourselves—we'd just need the mappings for 8 of them:

library(tidymodels)
library(bonsai)

get_from_env("boost_tree_args") %>%
  filter(engine == "lightgbm")
#> # A tibble: 8 × 5
#>   engine   parsnip        original                func             has_submodel
#>   <chr>    <chr>          <chr>                   <list>           <lgl>       
#> 1 lightgbm tree_depth     max_depth               <named list [2]> FALSE       
#> 2 lightgbm trees          num_iterations          <named list [2]> TRUE        
#> 3 lightgbm learn_rate     learning_rate           <named list [2]> FALSE       
#> 4 lightgbm mtry           feature_fraction_bynode <named list [2]> FALSE       
#> 5 lightgbm min_n          min_data_in_leaf        <named list [2]> FALSE       
#> 6 lightgbm loss_reduction min_gain_to_split       <named list [2]> FALSE       
#> 7 lightgbm sample_size    bagging_fraction        <named list [2]> FALSE       
#> 8 lightgbm stop_iter      early_stopping_rounds   <named list [2]> FALSE

Created on 2022-11-04 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions