Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grid_max_entropy need better error message for unfinalized parameters #99

Closed
SewerynGrodny opened this issue Feb 21, 2020 · 4 comments
Closed

Comments

@SewerynGrodny
Copy link

Hi,
thanks for great tidymodels packages. (Great job!)
While training random forest models, I've encounter an issue with tune grid and parameters. It seems that mtry() is not supported (case 2 in below code).
There is also minor problem with show_best function which throw an error if there are NA in .metric.

Best
Sewe

Reproducible example

#tidymodels
cars_split = initial_split(mtcars)

car_recipe = recipe(mpg ~., data = training(cars_split)) %>% 
  step_center(all_numeric()) %>% 
  prep()

cars_cv_folds <- training(cars_split) %>% 
  bake(car_recipe, new_data = .) %>%
  vfold_cv(v = 5)


#case 1
# model
rf_model_cars = rand_forest(
  mode = "regression",
  min_n = tune(),
  ) %>% 
  set_engine("ranger")

#params
rf_params_cars = parameters(min_n())
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 20)

# tune
rf_stage_1_cv_results_tbl_oto = tune_grid(
  formula = mpg ~.,
  model = rf_model_cars,
  resamples = cars_cv_folds,
  grid = rf_grid_cars,
  metrics = metric_set(mae, mape, rmse, rsq),
  control = control_grid(verbose = TRUE)
)
# error because of NA
rf_stage_1_cv_results_tbl_oto %>% show_best()

rf_stage_1_cv_results_tbl_oto %>% unnest(.metrics) %>% 
  filter(.metric == "rsq")

# case 2
rf_model_cars = rand_forest(
  mode = "regression",
  min_n = tune(),
  mtry = tune()
) %>% 
  set_engine("ranger")

#params
rf_params_cars = parameters(mtry(), min_n())
#error 
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 20)
 
@topepo
Copy link
Member

topepo commented Feb 24, 2020

mtry depends on the number of columns so the upper part of the range cannot be set. The finalize() method can do this if you pass in the predictors:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────── tidymodels 0.0.4 ──
#> ✓ broom     0.5.4     ✓ recipes   0.1.9
#> ✓ dials     0.0.4     ✓ rsample   0.0.5
#> ✓ dplyr     0.8.4     ✓ tibble    2.1.3
#> ✓ ggplot2   3.2.1     ✓ tune      0.0.1
#> ✓ infer     0.5.1     ✓ workflows 0.1.0
#> ✓ parsnip   0.0.5     ✓ yardstick 0.0.5
#> ✓ purrr     0.3.3
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard()    masks scales::discard()
#> x dplyr::filter()     masks stats::filter()
#> x dplyr::lag()        masks stats::lag()
#> x ggplot2::margin()   masks dials::margin()
#> x recipes::step()     masks stats::step()
#> x recipes::yj_trans() masks scales::yj_trans()

rf_params_cars = parameters(mtry(), min_n())
rf_params_cars
#> Collection of 2 parameters for tuning
#> 
#>     id parameter type object class
#>   mtry           mtry    nparam[?]
#>  min_n          min_n    nparam[+]
#> 
#> Parameters needing finalization:
#>    # Randomly Selected Predictors ('mtry')
#> 
#> See `?dials::finalize` or `?dials::update.parameters` for more information.

rf_params_cars <- 
  rf_params_cars %>% 
  update(mtry = finalize(mtry(), mtcars %>% select(-mpg)))
rf_params_cars
#> Collection of 2 parameters for tuning
#> 
#>     id parameter type object class
#>   mtry           mtry    nparam[+]
#>  min_n          min_n    nparam[+]

set.seed(131)
rf_grid_cars = grid_max_entropy(rf_params_cars, size = 3)
rf_grid_cars
#> # A tibble: 3 x 2
#>    mtry min_n
#>   <int> <int>
#> 1     4    34
#> 2     9    21
#> 3     2    16

Created on 2020-02-24 by the reprex package (v0.3.0)

We need a better error message though.

(edit - hit wrong key)

@topepo
Copy link
Member

topepo commented Feb 24, 2020

I'm going to move this to dials and update the title.

@topepo topepo transferred this issue from tidymodels/tune Feb 24, 2020
@topepo topepo changed the title mtry() parameters not supported , show best NA grid_max_entropy need better error message for unfinalized parameters Feb 24, 2020
@topepo
Copy link
Member

topepo commented Feb 24, 2020

There is also minor problem with show_best function which throw an error if there are NA in .metric.

That's because the message (and entries in the .notes column) tell you that

> rf_stage_1_cv_results_tbl_oto$.notes[[5]]$.notes
[1] "internal: A correlation computation is required, but `estimate` is constant 
and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be 
returned."

This happens when a model predicts the same value for all samples.

The main error in the code was the lack of metric argument:

> rf_stage_1_cv_results_tbl_oto %>% show_best()
Error in check_metric_choice(metric, maximize) : 
  argument "metric" is missing, with no default
> rf_stage_1_cv_results_tbl_oto %>% show_best(metric = "rmse", maximize = FALSE)
# A tibble: 5 x 6
  min_n .metric .estimator  mean     n std_err
  <int> <chr>   <chr>      <dbl> <int>   <dbl>
1     2 rmse    standard    2.06     5   0.292
2     5 rmse    standard    2.24     5   0.286
3     7 rmse    standard    2.49     5   0.271
4     9 rmse    standard    2.65     5   0.247
5    11 rmse    standard    2.94     5   0.204

@topepo topepo closed this as completed in de2bcfd Feb 24, 2020
@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants