-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Although mtry finalized tune_bayes does not work (tune_grid does). #432
Comments
Are you able to run this simpler example with library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
data(cells)
set.seed(2369)
tr_te_split <- initial_split(cells %>% select(-case), prop = 3/4)
cell_train <- training(tr_te_split)
cell_test <- testing(tr_te_split)
set.seed(1697)
folds <- vfold_cv(cell_train, v = 5)
xgb_spec <- boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
mtry = tune(),
loss_reduction = tune(),
sample_size = tune(),
learn_rate = tune(),
) %>%
set_engine('xgboost') %>%
set_mode('classification')
cells_wf <- workflow(class ~ ., xgb_spec)
xgb_params <-
parameters(cells_wf) %>%
update(mtry = finalize(mtry(), cell_train))
doParallel::registerDoParallel()
set.seed(12)
tune_bayes(
cells_wf,
resamples = folds,
param_info = xgb_params,
initial = 10,
control = control_bayes(verbose = TRUE)
)
#>
#> > Generating a set of 10 initial parameter results
#> ✓ Initialization complete
#>
#> Optimizing roc_auc using the expected improvement
#>
#> ── Iteration 1 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9009 (@iter 0)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=7, tree_depth=6, learn_rate=3.67e-07,
#> loss_reduction=1.18e-10, sample_size=0.982
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results: roc_auc=0.9062 (+/-0.00703)
#>
#> ── Iteration 2 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=12, min_n=4, tree_depth=1, learn_rate=0.0856, loss_reduction=8.21e-08,
#> sample_size=0.673
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.9025 (+/-0.00895)
#>
#> ── Iteration 3 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=37, min_n=9, tree_depth=2, learn_rate=0.000106, loss_reduction=0.0043,
#> sample_size=0.967
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.8809 (+/-0.00481)
#>
#> ── Iteration 4 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=11, min_n=5, tree_depth=7, learn_rate=2.95e-07, loss_reduction=2.8,
#> sample_size=0.28
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.8971 (+/-0.00756)
#>
#> ── Iteration 5 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=16, tree_depth=5, learn_rate=3.61e-05, loss_reduction=27.8,
#> sample_size=0.199
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.8774 (+/-0.00904)
#>
#> ── Iteration 6 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=9, min_n=7, tree_depth=14, learn_rate=2.11e-06,
#> loss_reduction=4.26e-09, sample_size=0.991
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.9057 (+/-0.00747)
#>
#> ── Iteration 7 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9062 (@iter 1)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=24, min_n=6, tree_depth=15, learn_rate=0.00218,
#> loss_reduction=6.07e-09, sample_size=0.995
#> i Estimating performance
#> ✓ Estimating performance
#> ♥ Newest results: roc_auc=0.9079 (+/-0.00761)
#>
#> ── Iteration 8 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=52, min_n=2, tree_depth=1, learn_rate=1e-10, loss_reduction=1.72e-05,
#> sample_size=0.971
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.5
#>
#> ── Iteration 9 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=2, min_n=38, tree_depth=6, learn_rate=5.34e-08,
#> loss_reduction=2.22e-07, sample_size=0.51
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.8711 (+/-0.00751)
#>
#> ── Iteration 10 ────────────────────────────────────────────────────────────────
#>
#> i Current best: roc_auc=0.9079 (@iter 7)
#> i Gaussian process model
#> ✓ Gaussian process model
#> i Generating 5000 candidates
#> i Predicted candidates
#> i mtry=3, min_n=12, tree_depth=14, learn_rate=1.62e-07, loss_reduction=20.4,
#> sample_size=0.741
#> i Estimating performance
#> ✓ Estimating performance
#> ⓧ Newest results: roc_auc=0.8862 (+/-0.00756)
#> # Tuning results
#> # 5-fold cross-validation
#> # A tibble: 55 × 5
#> splits id .metrics .notes .iter
#> <list> <chr> <list> <list> <int>
#> 1 <split [1211/303]> Fold1 <tibble [20 × 10]> <tibble [0 × 1]> 0
#> 2 <split [1211/303]> Fold2 <tibble [20 × 10]> <tibble [0 × 1]> 0
#> 3 <split [1211/303]> Fold3 <tibble [20 × 10]> <tibble [0 × 1]> 0
#> 4 <split [1211/303]> Fold4 <tibble [20 × 10]> <tibble [0 × 1]> 0
#> 5 <split [1212/302]> Fold5 <tibble [20 × 10]> <tibble [0 × 1]> 0
#> 6 <split [1211/303]> Fold1 <tibble [2 × 10]> <tibble [0 × 1]> 1
#> 7 <split [1211/303]> Fold2 <tibble [2 × 10]> <tibble [0 × 1]> 1
#> 8 <split [1211/303]> Fold3 <tibble [2 × 10]> <tibble [0 × 1]> 1
#> 9 <split [1211/303]> Fold4 <tibble [2 × 10]> <tibble [0 × 1]> 1
#> 10 <split [1212/302]> Fold5 <tibble [2 × 10]> <tibble [0 × 1]> 1
#> # … with 45 more rows Created on 2021-11-18 by the reprex package (v2.0.1) If you are able to run this one OK, can you work on creating a simpler reprex that demonstrates your problem? The reprex you shared has quite a lot going on, with "in the wild" data, a lot going on in preprocessing, specialized CV, etc. It looks to me like the basics you reported (finalized |
Hello Julia, thanks for the response I will do my best to help. Your code works without issues.
Below I threw out all the unnecessary code. train.csv is actually all data prior to dividing into train and testing. library(tidymodels)
library(tidyverse)
library(modeltime)
library(timetk)
bike_all<-
read_csv("train.csv", col_types = cols()) %>%
select(- casual, - registered)
# Create data split object
bike_split <-
initial_time_split(bike_all,
prop = .80)
# Create the training data
bike_training <- bike_split %>%
training()
bike_testing <- bike_split %>%
testing()
# Create folds
bike_folds <-
timetk::time_series_cv(
bike_training,
assess = "1 months",
initial = "11 months",
skip = "1 months",
slice_limit = 5,
cumulative = TRUE)
# The recipe
bike_recipe <- recipe(count ~ ., data = bike_training) %>%
step_mutate(datetime_hr = as.factor(lubridate::hour(datetime))) %>%
step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
step_log(windspeed, base = 10, offset = 1) %>%
update_role("datetime", new_role = "id_variable") %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
{.}
# Define a model
rf_model <- boost_tree(
trees = 500,
tree_depth = tune(),
min_n = tune(),
mtry = tune(), # depends on n of columns!
loss_reduction = tune(),
sample_size = tune(),
learn_rate = tune(),
) %>%
set_engine('xgboost',
objective = 'count:poisson') %>%
set_mode('regression')
# Create workflow
bike_rf_wkfl <-
workflow() %>%
# Add model
add_model(rf_model) %>%
# Add recipe
add_recipe(bike_recipe)
# Create custom metrics function
bike_metrics <- metric_set(mape, rsq)
doParallel::registerDoParallel()
# Update xgboost parameter mtry after preprocessing data
xgboost_set <-
parameters(bike_rf_wkfl) %>%
update(mtry = finalize(mtry(), bike_training))
## If entered by hand
# xgboost_set <-
# parameters(bike_rf_wkfl) %>%
# update(mtry = mtry(c(2,8)))
####
# this will work
bike_rf_tune_grid <-
bike_rf_wkfl %>%
tune_grid(
resamples = bike_folds,
param_info = xgboost_set,
metrics = bike_metrics,
grid = 9)
#####
# this will not work (bug reported )
bike_rf_tune_bayes <-
bike_rf_wkfl %>%
tune_bayes( initial = 9,
resamples = bike_folds,
param_info = xgboost_set,
metrics = metric_set(mape, rsq),
)
# x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with...
# Error in eval(expr, p) : no loop for break/next, jumping to top level
# x Optimization stopped prematurely; returning current results.
###### Created on 2021-11-19 by the reprex package (v2.0.1) |
There is still a lot going on in your reprex - could you take another stab at making your example more minimal? One strategy is to start with the bare minimum and if that does not produce the error, add elements of your current example back in until if fails. (And then start taking away previously added elements to isolate the element that's causing the error.) Another is to take the current example and step-by-step take out elements. Julia's reprex, for example, does not include any preprocessing in order to simplify the example. To make it reproducible to the point that others can just copy+paste and run the code, please either use data provided in a package or point functions like |
Hello Julia, Hello Hannah I have eliminated tidytk() and modeltime() use from reprex now. I took out the timeseries index and do not generate features from it with the the expection of weekday which I would like to turn into factor and do one hot encoding to illustrate the change of the number of columns so that I need to finalize mtry. The error persists: # here is a dropbox link to the data, I hope it should run just by copy paste code and download one file now.
# https://www.dropbox.com/s/myg6fefdfa32yc7/train.csv?dl=0
library(tidymodels)
library(tidyverse)
bike_all<-
read_csv("train.csv", col_types = cols()) %>%
mutate(weekday = lubridate::wday(datetime) %>% as.factor()) %>%
select(- casual, - registered, - datetime, - workingday, - season, - holiday)
str(bike_all)
# tibble [10,886 x 7] (S3: tbl_df/tbl/data.frame)
# $ weather : num [1:10886] 1 1 1 1 1 2 1 1 1 1 ...
# $ temp : num [1:10886] 9.84 9.02 9.02 9.84 9.84 ...
# $ atemp : num [1:10886] 14.4 13.6 13.6 14.4 14.4 ...
# $ humidity : num [1:10886] 81 80 80 75 75 75 80 86 75 76 ...
# $ windspeed: num [1:10886] 0 0 0 0 0 ...
# $ count : num [1:10886] 16 40 32 13 1 1 2 3 8 14 ...
# $ weekday : Factor w/ 7 levels "1","2","3","4",..: 7 7 7 7 7 7 7 7 7 7 ...
# Create data split object
bike_split <-
initial_split(bike_all, prop = .80)
# Create the training data
bike_training <- bike_split %>% training()
# Create folds
bike_folds <- vfold_cv( bike_training, v = 5)
# The recipe
bike_recipe <- recipe(count ~ . , data = bike_training) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)
# Define a model
rf_model <- boost_tree(
mtry = tune() # depends on n of columns!
) %>%
set_engine('xgboost',
objective = 'count:poisson') %>%
set_mode('regression')
# Create workflow
bike_rf_wkfl <-
workflow() %>%
add_model(rf_model) %>%
add_recipe(bike_recipe)
doParallel::registerDoParallel()
# Update xgboost parameter mtry after preprocessing data
xgboost_set <-
parameters(bike_rf_wkfl) %>%
update(mtry = finalize(mtry(), bike_training))
# this will work
bike_rf_tune_grid <-
bike_rf_wkfl %>%
tune_grid(
resamples = bike_folds,
param_info = xgboost_set,
metrics = metric_set(rsq),
grid = 9)
# this will not work
bike_rf_tune_bayes <-
bike_rf_wkfl %>%
tune_bayes( initial = 9,
resamples = bike_folds,
param_info = xgboost_set,
metrics = metric_set(rsq)
)
# x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with...
# Error in eval(expr, p) : no loop for break/next, jumping to top level
# x Optimization stopped prematurely; returning current results.
Created on 2021-11-19 by the reprex package (v2.0.1)</sup |
|
Hello Julia The link is not my computer. It is on dropbox cloud service, available to all who click the link. The code works „copy paste” with the file in one folder. The error will not appear if the number of columns does not change in the preprocessing and mtry is not being tuned. I will correct the link to the data and check the second point tomorrow. Have a good day |
Just to clarify a bit, what I mean by "your computer" is that I can't run the code as you have shared it. If instead you write out code similar to this or using data provided in a package, folks who are trying to reproduce your problem will be able to run your example. I can't run |
a reprex without a recipe could look like this. I've also set the range for library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(tidyverse)
bike_all<-
read_csv("https://github.com/tidymodels/tune/files/7570089/train.csv",
col_types = cols()) %>%
mutate(weekday = lubridate::wday(datetime) %>% as.factor()) %>%
select(- casual, - registered, - datetime, - workingday, - season, - holiday)
bike_split <- initial_split(bike_all, prop = .80)
bike_training <- bike_split %>% training()
bike_folds <- vfold_cv( bike_training, v = 5)
# Define a model
rf_model <- boost_tree(mtry = tune()) %>%
set_engine('xgboost', objective = 'count:poisson') %>%
set_mode('regression')
# Create workflow without a recipe
bike_rf_wkfl <- workflow(count ~ ., rf_model)
doParallel::registerDoParallel()
# Update xgboost parameter mtry
bike_training # has 6 predictors for count ~ .
#> # A tibble: 8,708 × 7
#> weather temp atemp humidity windspeed count weekday
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 3 23.0 26.5 83 17.0 202 4
#> 2 1 18.9 22.7 63 7.00 485 4
#> 3 1 27.9 31.8 44 24.0 134 1
#> 4 2 13.1 17.4 70 0 128 2
#> 5 1 16.4 20.5 35 24.0 51 1
#> 6 1 8.2 9.85 40 17.0 106 5
#> 7 1 4.1 5.30 36 15.0 68 3
#> 8 1 16.4 20.5 40 11.0 206 2
#> 9 1 29.5 33.3 51 11.0 201 7
#> 10 1 15.6 19.7 40 13.0 271 2
#> # … with 8,698 more rows
xgboost_set <- parameters(mtry(range = c(1, 6)))
# this will work
bike_rf_tune_grid <-
bike_rf_wkfl %>%
tune_grid(
resamples = bike_folds,
param_info = xgboost_set,
metrics = metric_set(rsq),
grid = 9
)
# this will not work
bike_rf_tune_bayes <-
bike_rf_wkfl %>%
tune_bayes(
initial = 9,
resamples = bike_folds,
param_info = xgboost_set,
metrics = metric_set(rsq),
control = control_bayes(verbose = TRUE)
)
#>
#> > Generating a set of 6 initial parameter results
#> ✓ Initialization complete
#>
#> Optimizing rsq using the expected improvement
#>
#> ── Iteration 1 ─────────────────────────────────────────────────────────────────
#>
#> i Current best: rsq=0.001153 (@iter 0)
#> i Gaussian process model
#> ✓ Gaussian process model
#> ! No remaining candidate models
#> x Halting search
#> Error in eval(expr, p): no loop for break/next, jumping to top level
#> x Optimization stopped prematurely; returning current results. Created on 2021-11-22 by the reprex package (v2.0.1) |
@juliasilge when I run it interactively I get a bit more output than in the reprex which makes me wonder if this has anything to do with mtry directly > bike_rf_tune_bayes <-
+ bike_rf_wkfl %>%
+ tune_bayes(
+ initial = 9,
+ resamples = bike_folds,
+ param_info = xgboost_set,
+ metrics = metric_set(rsq),
+ control = control_bayes(verbose = TRUE)
+ )
> Generating a set of 6 initial parameter results
! Fold2: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold3: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold4: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
! Fold5: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 erro...
✓ Initialization complete
Optimizing rsq using the expected improvement
── Iteration 1 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
i Current best: rsq=0.0006524 (@iter 0)
i Gaussian process model
✓ Gaussian process model
! No remaining candidate models
x Halting search
Error in eval(expr, p) : no loop for break/next, jumping to top level |
For some models, especially tree-based models, they can produce a constant prediction across all of the samples. This isn't a bug per-se, it just shows that there is a configuration of the model that is very poor. The consequence of a constant prediction is that the R2 value cannot be computed. This is the reason for the "correlation computation is required, but..." warning. Since the GP is optimizing on R2, the data given to the GP has a missing value:
We need to remove these before the model fit (and add a warning) since that causes the error. I'll add an issue. What will happen when they are removed: another GP is fit to the same data after Two notes:
tl;dr It is a problem with the results when |
If you would like to install from GitHub via |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
The problem
I'm having trouble with tune_bayes() tuning xgboost parameters. Without tuning mtry the function works. After mtry is added to the parameter list and then finalized I can tune with tune_grid and random parameter selection without problems. tune_bayes throws an error.
Reproducible example
Created on 2021-11-18 by the reprex package (v2.0.1)
Error message for tune bayes
x Gaussian process model: Error in fit_gp(mean_stats %>% dplyr::select(-.iter), pset = param_info, : argument is missing, with no default
Error in eval(expr, p) : no loop for break/next, jumping to top level
x Optimization stopped prematurely; returning current results.
My workflow looks like
Created on 2021-11-18 by the reprex package (v2.0.1)
I am using the dataset about bike rental (attached)
train.csv
The text was updated successfully, but these errors were encountered: