Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when mtry =1 all predictors are used in xgboost engine #461

Closed
joeycouse opened this issue Apr 6, 2021 · 4 comments · Fixed by #499
Closed

when mtry =1 all predictors are used in xgboost engine #461

joeycouse opened this issue Apr 6, 2021 · 4 comments · Fixed by #499
Labels
bug an unexpected problem or unintended behavior

Comments

@joeycouse
Copy link

Whenever mtry = 1 in the model spec the corresponding value of colsample_bytree is not converted to proportion correctly.
example: when mtry = 1 , colsample_bytree = 1 when it should be 0.125.

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(mlbench)
library(xgboost)
#> 
#> Attaching package: 'xgboost'
#> The following object is masked from 'package:dplyr':
#> 
#>     slice
library(reprex)


data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes %>%
  mutate(diabetes = fct_relevel(diabetes, 'pos'))

#Model with mtry = 1, col_sample_by_tree should be 1/8 = 0.125

tidy_model <- 
  boost_tree(trees = 10,
             mtry = 1,
             tree_depth = 3) %>%
  set_engine('xgboost', 
             eval_metric = 'auc',
             event_level = "first",
             verbose = 1) %>%
  set_mode('classification')

set.seed(24)

tidy_model_fitted <- tidy_model %>% 
  fit(diabetes ~ . , data = df)
#> [1]  training-auc:0.834787 
#> [2]  training-auc:0.853881 
#> [3]  training-auc:0.873048 
#> [4]  training-auc:0.875000 
#> [5]  training-auc:0.882873 
#> [6]  training-auc:0.891716 
#> [7]  training-auc:0.896067 
#> [8]  training-auc:0.900414 
#> [9]  training-auc:0.904313 
#> [10] training-auc:0.906937

#Model with mtry  = 2, col_sample_by_tree should by 2/8 = 0.25
tidy_model_2 <- 
  boost_tree(trees = 10,
                         mtry = 2,
                         tree_depth = 3) %>%
  set_engine('xgboost', 
             eval_metric = 'auc',
             event_level = "first",
             verbose = 1) %>%
  set_mode('classification')

set.seed(24)
tidy_model_fitted_2 <- tidy_model_2 %>% 
  fit(diabetes ~ . , data = df)
#> [1]  training-auc:0.684049 
#> [2]  training-auc:0.776993 
#> [3]  training-auc:0.802187 
#> [4]  training-auc:0.808041 
#> [5]  training-auc:0.830478 
#> [6]  training-auc:0.837851 
#> [7]  training-auc:0.847817 
#> [8]  training-auc:0.847254 
#> [9]  training-auc:0.850034 
#> [10] training-auc:0.856455

#col_sample by tree should be 1/8 = 0.125, instead model is using all predictors
tidy_model_fitted$fit$call$params$colsample_bytree
#> [1] 1

#colsample_bytree is correct 2/8 = 0.25
tidy_model_fitted_2$fit$call$params$colsample_bytree
#> [1] 0.25

Created on 2021-04-06 by the reprex package (v1.0.0)

@juliasilge
Copy link
Member

I think the problem is here:

parsnip/R/boost_tree.R

Lines 366 to 371 in 30dc026

if (colsample_bytree > 1) {
colsample_bytree <- colsample_bytree/p
}
if (colsample_bytree > 1) {
colsample_bytree <- 1
}

It doesn't convert the colsample_bytree == 1 case, assuming it is meant as "use all the predictors".

  • How would colsample_bytree get to this function as a proportion and not a count? (I can't find a way.)
  • If it is possible, is there a way to disambiguate this?

@juliasilge juliasilge added the bug an unexpected problem or unintended behavior label May 6, 2021
@juliasilge
Copy link
Member

This problem will now apply to both colsample_bytree and colsample_bynode.

topepo added a commit that referenced this issue May 20, 2021
@juliasilge
Copy link
Member

Results now:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(mlbench)

data("PimaIndiansDiabetes")
df <- PimaIndiansDiabetes %>%
  mutate(diabetes = forcats::fct_relevel(diabetes, 'pos'))


tidy_model <- 
  boost_tree(trees = 10,
             mtry = 1,
             tree_depth = 3) %>%
  set_engine('xgboost', 
             eval_metric = 'auc',
             event_level = "first",
             verbose = 1) %>%
  set_mode('classification')

set.seed(24)

tidy_model_fitted <- tidy_model %>% 
  fit(diabetes ~ . , data = df)
#> [1]  training-auc:0.731097 
#> [2]  training-auc:0.821153 
#> [3]  training-auc:0.836787 
#> [4]  training-auc:0.854131 
#> [5]  training-auc:0.866493 
#> [6]  training-auc:0.876743 
#> [7]  training-auc:0.878963 
#> [8]  training-auc:0.879940 
#> [9]  training-auc:0.884306 
#> [10] training-auc:0.887933

# col_sample by node should be 1/8 = 0.125 
tidy_model_fitted$fit$call$params$colsample_bynode
#> [1] 0.125

Created on 2021-05-21 by the reprex package (v2.0.0)

@github-actions
Copy link

github-actions bot commented Jun 5, 2021

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants