Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Although mtry finalized tune_bayes does not work (tune_grid does). #432

Closed
jacekkotowski opened this issue Nov 18, 2021 · 12 comments
Closed

Comments

@jacekkotowski
Copy link

The problem

I'm having trouble with tune_bayes() tuning xgboost parameters. Without tuning mtry the function works. After mtry is added to the parameter list and then finalized I can tune with tune_grid and random parameter selection without problems. tune_bayes throws an error.

Reproducible example

doParallel::registerDoParallel()

xgboost_set <-
  parameters(bike_rf_wkfl) %>%
  update(mtry = finalize(mtry(), bike_training)) 
#> Error in parameters(bike_rf_wkfl) %>% update(mtry = finalize(mtry(), bike_training)): could not find function "%>%"
##  or entered by hand 
# xgboost_set <-
#   parameters(bike_rf_wkfl) %>%
#   update(mtry = mtry(c(2,8)))


# this will work
bike_rf_initial <-
  bike_rf_wkfl %>%
  tune_grid(
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = bike_metrics,
    grid = 9
  )
#> Error in bike_rf_wkfl %>% tune_grid(resamples = bike_folds, param_info = xgboost_set, : could not find function "%>%"

# this will not work
bike_rf_rs <-
  bike_rf_wkfl %>%
    tune_bayes( initial = 9,                      
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = metric_set(mape, rsq),
  )
#> Error in bike_rf_wkfl %>% tune_bayes(initial = 9, resamples = bike_folds, : could not find function "%>%"

Created on 2021-11-18 by the reprex package (v2.0.1)

Error message for tune bayes

x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in eval(expr, p) : no loop for break/next, jumping to top level
x Optimization stopped prematurely; returning current results.

My workflow looks like

bike_all<-
  read_csv("dane/train.csv", col_types = cols()) %>% 
  select(- casual, - registered)

# Create data split object
bike_split <- 
  initial_time_split(bike_all, 
                prop = .80)

# Create the training data
bike_training <- bike_split %>% 
  training()

# Create the test data
bike_testing <- bike_split %>% 
  testing()

bike_recipe <- recipe(count ~ ., data = bike_training) %>%
  step_mutate(datetime_hr = as.factor(lubridate::hour(datetime))) %>% 
  step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
  step_log(windspeed, base = 10, offset = 1) %>% 
  update_role("datetime", new_role = "id_variable") %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  {.}

bike_folds <- 
  timetk::time_series_cv(
    bike_training,
    assess = "1 months",
    initial = "11 months",
    skip = "1 months",
    slice_limit = 5, 
    cumulative = TRUE)


rf_model <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  min_n = tune(),
  mtry = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  learn_rate = tune(),
    ) %>%
  set_engine('xgboost',
             objective = 'count:poisson') %>%
  set_mode('regression')

# Create workflow
bike_rf_wkfl <- 
  workflow() %>% 
  # Add model
  add_model(rf_model) %>% 
  # Add recipe
  add_recipe(bike_recipe) 


# Create custom metrics function
bike_metrics <- metric_set(mape, rsq)

Created on 2021-11-18 by the reprex package (v2.0.1)

I am using the dataset about bike rental (attached)
train.csv

@juliasilge
Copy link
Member

Are you able to run this simpler example with tune_bayes() without problems?

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(cells)
set.seed(2369)
tr_te_split <- initial_split(cells %>% select(-case), prop = 3/4)
cell_train <- training(tr_te_split)
cell_test  <- testing(tr_te_split)

set.seed(1697)
folds <- vfold_cv(cell_train, v = 5)

xgb_spec <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  min_n = tune(),
  mtry = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  learn_rate = tune(),
) %>%
  set_engine('xgboost') %>%
  set_mode('classification')

cells_wf <- workflow(class ~ ., xgb_spec)

xgb_params <-
  parameters(cells_wf) %>%
  update(mtry = finalize(mtry(), cell_train)) 

doParallel::registerDoParallel()
set.seed(12)
tune_bayes(
    cells_wf,
    resamples = folds,
    param_info = xgb_params,
    initial = 10,
    control = control_bayes(verbose = TRUE)
  )
#> 
#> >  Generating a set of 10 initial parameter results
#> ✓ Initialization complete
#> 
#> Optimizing roc_auc using the expected improvement
#> 
#> ── Iteration 1 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9009 (@iter 0)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=7, tree_depth=6, learn_rate=3.67e-07,
#>   loss_reduction=1.18e-10, sample_size=0.982
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results:    roc_auc=0.9062 (+/-0.00703)
#> 
#> ── Iteration 2 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=12, min_n=4, tree_depth=1, learn_rate=0.0856, loss_reduction=8.21e-08,
#>   sample_size=0.673
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.9025 (+/-0.00895)
#> 
#> ── Iteration 3 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=37, min_n=9, tree_depth=2, learn_rate=0.000106, loss_reduction=0.0043,
#>   sample_size=0.967
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.8809 (+/-0.00481)
#> 
#> ── Iteration 4 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=11, min_n=5, tree_depth=7, learn_rate=2.95e-07, loss_reduction=2.8,
#>   sample_size=0.28
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.8971 (+/-0.00756)
#> 
#> ── Iteration 5 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=16, tree_depth=5, learn_rate=3.61e-05, loss_reduction=27.8,
#>   sample_size=0.199
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.8774 (+/-0.00904)
#> 
#> ── Iteration 6 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=7, tree_depth=14, learn_rate=2.11e-06,
#>   loss_reduction=4.26e-09, sample_size=0.991
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.9057 (+/-0.00747)
#> 
#> ── Iteration 7 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=24, min_n=6, tree_depth=15, learn_rate=0.00218,
#>   loss_reduction=6.07e-09, sample_size=0.995
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results:    roc_auc=0.9079 (+/-0.00761)
#> 
#> ── Iteration 8 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=52, min_n=2, tree_depth=1, learn_rate=1e-10, loss_reduction=1.72e-05,
#>   sample_size=0.971
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.5
#> 
#> ── Iteration 9 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=2, min_n=38, tree_depth=6, learn_rate=5.34e-08,
#>   loss_reduction=2.22e-07, sample_size=0.51
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.8711 (+/-0.00751)
#> 
#> ── Iteration 10 ────────────────────────────────────────────────────────────────
#> 
#> i Current best:      roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=3, min_n=12, tree_depth=14, learn_rate=1.62e-07, loss_reduction=20.4,
#>   sample_size=0.741
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results:    roc_auc=0.8862 (+/-0.00756)
#> # Tuning results
#> # 5-fold cross-validation 
#> # A tibble: 55 × 5
#>    splits             id    .metrics           .notes           .iter
#>    <list>             <chr> <list>             <list>           <int>
#>  1 <split [1211/303]> Fold1 <tibble [20 × 10]> <tibble [0 × 1]>     0
#>  2 <split [1211/303]> Fold2 <tibble [20 × 10]> <tibble [0 × 1]>     0
#>  3 <split [1211/303]> Fold3 <tibble [20 × 10]> <tibble [0 × 1]>     0
#>  4 <split [1211/303]> Fold4 <tibble [20 × 10]> <tibble [0 × 1]>     0
#>  5 <split [1212/302]> Fold5 <tibble [20 × 10]> <tibble [0 × 1]>     0
#>  6 <split [1211/303]> Fold1 <tibble [2 × 10]>  <tibble [0 × 1]>     1
#>  7 <split [1211/303]> Fold2 <tibble [2 × 10]>  <tibble [0 × 1]>     1
#>  8 <split [1211/303]> Fold3 <tibble [2 × 10]>  <tibble [0 × 1]>     1
#>  9 <split [1211/303]> Fold4 <tibble [2 × 10]>  <tibble [0 × 1]>     1
#> 10 <split [1212/302]> Fold5 <tibble [2 × 10]>  <tibble [0 × 1]>     1
#> # … with 45 more rows

Created on 2021-11-18 by the reprex package (v2.0.1)

If you are able to run this one OK, can you work on creating a simpler reprex that demonstrates your problem? The reprex you shared has quite a lot going on, with "in the wild" data, a lot going on in preprocessing, specialized CV, etc. It looks to me like the basics you reported (finalized mtry + tune_bayes()) does work as expected, so if you can help us narrow down what is actually causing your problem, that would be great. 👍

@jacekkotowski
Copy link
Author

jacekkotowski commented Nov 19, 2021

Hello Julia, thanks for the response I will do my best to help. Your code works without issues.

  1. The example below also runs without problems if tune grid is used.
  2. The preprocessing in my example does not cause any problems when I do not tune mtry.
    The preprocessing will generate new columns (time turned into features and one hot encoding) so mtry does not know the range to tune before preprocessing takes place. I I try to use mtry am aware I need to finalize this parameter, precalculate it. Then only the tune_grid works, tune_bayes will not work.
    The specialised preprocessing is Matt Dancho's timetk and modeltime, what it does is it
    Produces new columns, features from datetime like (dayofweek day of year etc.) it also produces CV folds for time series. I do not have any feeling this can be the culprit, but who knows. The folds as an object look like the folds created for random, non-time series cv and they work if I do not tune mtry.
  3. The data is popular bicycles from Kaggle, in the case of the task I was given last 10 days of each month are removed. The task is to build a model so that one can fill the missing data in.
  4. The random grid tuing works. The bayes throws an error. I have a gut feeling tune bayes expects mtry as a standard parameter and not as param_info = ....

Below I threw out all the unnecessary code. train.csv is actually all data prior to dividing into train and testing.
If you run the code with the attached data (at the bottom the attached train.cv it should work until tuning bayes appears. Tuning the grid with random search works.

library(tidymodels)
library(tidyverse)
library(modeltime)
library(timetk)

bike_all<-
  read_csv("train.csv", col_types = cols()) %>% 
  select(- casual, - registered)

# Create data split object
bike_split <- 
  initial_time_split(bike_all, 
                     prop = .80)

# Create the training data
bike_training <- bike_split %>% 
  training()

bike_testing <- bike_split %>% 
  testing()

# Create folds

bike_folds <- 
  timetk::time_series_cv(
    bike_training,
    assess = "1 months",
    initial = "11 months",
    skip = "1 months",
    slice_limit = 5, 
    cumulative = TRUE)


# The recipe

bike_recipe <- recipe(count ~ ., data = bike_training) %>%
  step_mutate(datetime_hr = as.factor(lubridate::hour(datetime))) %>% 
  step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
  step_log(windspeed, base = 10, offset = 1) %>% 
  update_role("datetime", new_role = "id_variable") %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  {.}

# Define a model
rf_model <- boost_tree(
  trees = 500,
  tree_depth = tune(),
  min_n = tune(),
  mtry = tune(),      # depends on n of columns!
  loss_reduction = tune(),
  sample_size = tune(),
  learn_rate = tune(),
) %>%
  set_engine('xgboost',
             objective = 'count:poisson') %>%
  set_mode('regression')

# Create workflow
bike_rf_wkfl <- 
  workflow() %>% 
  # Add model
  add_model(rf_model) %>% 
  # Add recipe
  add_recipe(bike_recipe) 

# Create custom metrics function
bike_metrics <- metric_set(mape, rsq)

doParallel::registerDoParallel()


# Update xgboost parameter mtry after preprocessing data

xgboost_set <-
  parameters(bike_rf_wkfl) %>%
  update(mtry = finalize(mtry(), bike_training)) 

##  If entered by hand 
# xgboost_set <-
#   parameters(bike_rf_wkfl) %>%
#   update(mtry = mtry(c(2,8)))

####
# this will work
bike_rf_tune_grid <-
  bike_rf_wkfl %>%
  tune_grid(
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = bike_metrics,
    grid = 9)
#####
    # this will not work (bug reported )
    bike_rf_tune_bayes <-
      bike_rf_wkfl %>%
        tune_bayes( initial = 9,
        resamples = bike_folds,
        param_info = xgboost_set,
        metrics = metric_set(mape, rsq),
      )

#  x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with...
#  Error in eval(expr, p) : no loop for break/next, jumping to top level
#  x Optimization stopped prematurely; returning current results.


######

Created on 2021-11-19 by the reprex package (v2.0.1)
train.csv

@hfrick
Copy link
Member

hfrick commented Nov 19, 2021

There is still a lot going on in your reprex - could you take another stab at making your example more minimal?

One strategy is to start with the bare minimum and if that does not produce the error, add elements of your current example back in until if fails. (And then start taking away previously added elements to isolate the element that's causing the error.) Another is to take the current example and step-by-step take out elements. Julia's reprex, for example, does not include any preprocessing in order to simplify the example.

To make it reproducible to the point that others can just copy+paste and run the code, please either use data provided in a package or point functions like read_csv() to a URL with the data rather than a local file path.

@hfrick hfrick added the reprex needs a minimal reproducible example label Nov 19, 2021
@jacekkotowski
Copy link
Author

jacekkotowski commented Nov 19, 2021

Hello Julia, Hello Hannah

I have eliminated tidytk() and modeltime() use from reprex now. I took out the timeseries index and do not generate features from it with the the expection of weekday which I would like to turn into factor and do one hot encoding to illustrate the change of the number of columns so that I need to finalize mtry.
Folds for cv are selected at random. I tune only the suspicious mtry hyperparameter. There is only one_hot encoding step (so the number of columns will increase and mtry needs to be "finalized" I am doing that.

The error persists:

# here is a dropbox link to the data, I hope it should run just by copy paste code and download one file now.
# https://www.dropbox.com/s/myg6fefdfa32yc7/train.csv?dl=0


library(tidymodels)
library(tidyverse)


bike_all<-
  read_csv("train.csv", col_types = cols()) %>% 
  mutate(weekday = lubridate::wday(datetime) %>% as.factor()) %>% 
  select(- casual, - registered, - datetime, - workingday, - season, - holiday)

str(bike_all)

# tibble [10,886 x 7] (S3: tbl_df/tbl/data.frame)
# $ weather  : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
# $ temp     : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
# $ atemp    : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
# $ humidity : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
# $ windspeed: num [1:10886] 0 0 0 0 0 ...
# $ count    : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
# $ weekday  : Factor w/ 7 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...


# Create data split object
bike_split <- 
  initial_split(bike_all, prop = .80)


# Create the training data
bike_training <- bike_split %>% training()


# Create folds

bike_folds <- vfold_cv( bike_training, v = 5)


# The recipe

bike_recipe <- recipe(count ~ . , data = bike_training) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) 


# Define a model
rf_model <- boost_tree(
  mtry = tune()      # depends on n of columns!
  ) %>%
  set_engine('xgboost',
             objective = 'count:poisson') %>%
  set_mode('regression')


# Create workflow
bike_rf_wkfl <- 
  workflow() %>% 
  add_model(rf_model) %>% 
  add_recipe(bike_recipe) 

doParallel::registerDoParallel()


# Update xgboost parameter mtry after preprocessing data

xgboost_set <-
  parameters(bike_rf_wkfl) %>%
  update(mtry = finalize(mtry(), bike_training)) 


# this will work 
bike_rf_tune_grid <-
  bike_rf_wkfl %>%
  tune_grid(
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = metric_set(rsq),
    grid = 9)

    
# this will not work 
bike_rf_tune_bayes <-
      bike_rf_wkfl %>%
        tune_bayes( initial = 9,
        resamples = bike_folds,
        param_info = xgboost_set,
        metrics = metric_set(rsq)
      )

#  x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with...
#  Error in eval(expr, p) : no loop for break/next, jumping to top level
#  x Optimization stopped prematurely; returning current results.


Created on 2021-11-19 by the reprex package (v2.0.1)</sup
train.csv

@jacekkotowski jacekkotowski changed the title Although mtry finalized tune_bayes does not work (tune_grid does). Although mtry finalized tune_bayes does not work (tune_grid does). Reprex simplified and data provided. Nov 21, 2021
@jacekkotowski jacekkotowski changed the title Although mtry finalized tune_bayes does not work (tune_grid does). Reprex simplified and data provided. Although mtry finalized tune_bayes does not work (tune_grid does). Nov 21, 2021
@juliasilge
Copy link
Member

  • What happens if you remove step_dummy() and the nominal predictor, and make a formula preprocessor like count ~ temp + humidity + windspeed?
  • Can you check out what Hannah suggested so that others can reproducibly use your reprex? When you make a reprex pointing to your own computer as you did here, then we can't run your code to see what is going wrong.

To make it reproducible to the point that others can just copy+paste and run the code, please either use data provided in a package or point functions like read_csv() to a URL with the data rather than a local file path.

@jacekkotowski
Copy link
Author

Hello Julia

The link is not my computer. It is on dropbox cloud service, available to all who click the link.

The code works „copy paste” with the file in one folder.

The error will not appear if the number of columns does not change in the preprocessing and mtry is not being tuned.

I will correct the link to the data and check the second point tomorrow.

Have a good day

@juliasilge
Copy link
Member

Just to clarify a bit, what I mean by "your computer" is that I can't run the code as you have shared it. If instead you write out code similar to this or using data provided in a package, folks who are trying to reproduce your problem will be able to run your example. I can't run read_csv("train.csv") on my computer. You might check out this article on "reprex do's and don'ts".

@hfrick
Copy link
Member

hfrick commented Nov 22, 2021

a reprex without a recipe could look like this. I've also set the range for mtry() manually, so it's not finalize() per se.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)

bike_all<-
  read_csv("https://github.com/tidymodels/tune/files/7570089/train.csv", 
           col_types = cols()) %>% 
  mutate(weekday = lubridate::wday(datetime) %>% as.factor()) %>% 
  select(- casual, - registered, - datetime, - workingday, - season, - holiday)

bike_split <- initial_split(bike_all, prop = .80)
bike_training <- bike_split %>% training()
bike_folds <- vfold_cv( bike_training, v = 5)

# Define a model
rf_model <- boost_tree(mtry = tune()) %>%
  set_engine('xgboost', objective = 'count:poisson') %>%
  set_mode('regression')

# Create workflow without a recipe
bike_rf_wkfl <- workflow(count ~ ., rf_model)

doParallel::registerDoParallel()

# Update xgboost parameter mtry 
bike_training # has 6 predictors for count ~ .
#> # A tibble: 8,708 × 7
#>    weather  temp atemp humidity windspeed count weekday
#>      <dbl> <dbl> <dbl>    <dbl>     <dbl> <dbl> <fct>  
#>  1       3  23.0 26.5        83     17.0    202 4      
#>  2       1  18.9 22.7        63      7.00   485 4      
#>  3       1  27.9 31.8        44     24.0    134 1      
#>  4       2  13.1 17.4        70      0      128 2      
#>  5       1  16.4 20.5        35     24.0     51 1      
#>  6       1   8.2  9.85       40     17.0    106 5      
#>  7       1   4.1  5.30       36     15.0     68 3      
#>  8       1  16.4 20.5        40     11.0    206 2      
#>  9       1  29.5 33.3        51     11.0    201 7      
#> 10       1  15.6 19.7        40     13.0    271 2      
#> # … with 8,698 more rows
xgboost_set <- parameters(mtry(range = c(1, 6)))

# this will work 
bike_rf_tune_grid <- 
  bike_rf_wkfl %>%
  tune_grid(
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = metric_set(rsq),
    grid = 9
  )

# this will not work 
bike_rf_tune_bayes <-
  bike_rf_wkfl %>%
  tune_bayes(
    initial = 9,
    resamples = bike_folds,
    param_info = xgboost_set,
    metrics = metric_set(rsq),
    control = control_bayes(verbose = TRUE)
  )
#> 
#> >  Generating a set of 6 initial parameter results
#> ✓ Initialization complete
#> 
#> Optimizing rsq using the expected improvement
#> 
#> ── Iteration 1 ─────────────────────────────────────────────────────────────────
#> 
#> i Current best:      rsq=0.001153 (@iter 0)
#> i Gaussian process model
#> ✓ Gaussian process model
#> ! No remaining candidate models
#> x Halting search
#> Error in eval(expr, p): no loop for break/next, jumping to top level
#> x Optimization stopped prematurely; returning current results.

Created on 2021-11-22 by the reprex package (v2.0.1)

@hfrick
Copy link
Member

hfrick commented Nov 22, 2021

@juliasilge when I run it interactively I get a bit more output than in the reprex which makes me wonder if this has anything to do with mtry directly

> bike_rf_tune_bayes <-
+   bike_rf_wkfl %>%
+   tune_bayes(
+     initial = 9,
+     resamples = bike_folds,
+     param_info = xgboost_set,
+     metrics = metric_set(rsq),
+     control = control_bayes(verbose = TRUE)
+   )

>  Generating a set of 6 initial parameter results
! Fold2: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold3: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold4: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold5: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...Initialization complete

Optimizing rsq using the expected improvement

── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

i Current best:		rsq=0.0006524 (@iter 0)
i Gaussian process modelGaussian process model
! No remaining candidate models
x Halting search
Error in eval(expr, p) : no loop for break/next, jumping to top level

@hfrick hfrick removed the reprex needs a minimal reproducible example label Nov 22, 2021
@topepo
Copy link
Member

topepo commented Nov 23, 2021

For some models, especially tree-based models, they can produce a constant prediction across all of the samples. This isn't a bug per-se, it just shows that there is a configuration of the model that is very poor.

The consequence of a constant prediction is that the R2 value cannot be computed. This is the reason for the "correlation computation is required, but..." warning.

Since the GP is optimizing on R2, the data given to the GP has a missing value:

# A tibble: 5 × 7
   mtry .metric .estimator       mean     n   std_err .config             
  <int> <chr>   <chr>           <dbl> <int>     <dbl> <chr>               
1     2 rsq     standard     0.000960     2  0.000341 Preprocessor1_Model1
2     6 rsq     standard     0.000760     3  0.000280 Preprocessor1_Model2
3     4 rsq     standard     0.000960     2  0.000341 Preprocessor1_Model3
4     3 rsq     standard     0.000960     2  0.000341 Preprocessor1_Model4
5     1 rsq     standard   NaN            0 NA        Iter1    

We need to remove these before the model fit (and add a warning) since that causes the error. I'll add an issue.

What will happen when they are removed: another GP is fit to the same data after mtry = 1 is sampled and it should be the same GP as the previous one. The remaining candidates are sampled (which does not include mtry = 1). If there are no additional issues computing the next R2 value, some other mtry value will be sampled and the process will move on.

Two notes:

  • Some of the reprex results above stop with ! No remaining candidate models/x Halting search: these did not generate missing R2 values. However, since there are only 7 possible mtry values and we have tested them all, there is nothing left in the search space.
  • But wait... why didn't their mtry = 1 values produce missing R2 values? Different random numbers were being used since the seed was not set before calling tune_bayes().

tl;dr

It is a problem with the results when mtry = 1 so, in the meantime, don't test that value for these data. Also, set the seed before calling tune_bayes().

@juliasilge
Copy link
Member

If you would like to install from GitHub via devtools::install_github("tidymodels/tune") and try again @jacekkotowski we believe we have solved the problem here. If you run into further problems, please do check back in! 🙌

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants