Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranger "classification" mode still looks like "probability" #546

Closed
rmflight opened this issue Aug 25, 2021 · 3 comments
Closed

ranger "classification" mode still looks like "probability" #546

rmflight opened this issue Aug 25, 2021 · 3 comments

Comments

@rmflight
Copy link

The problem

When using a parsnip model specification and "ranger" classification, it looks like the "ranger" arguments are still "probability" instead of "classification".

Reproducible example

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(modeldata)
library(skimr)
data(cells, package = "modeldata")
library(ranger)
tidymodels_prefer()

cells$case = NULL
set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE)
#> Ranger result
#> 
#> Call:
#>  ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE) 
#> 
#> Type:                             Classification 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error:             17.29 %


# OK, so we have an OOB error of 0.17, which isn't too shabby!
# Now lets run via tidymodels.

rf_spec = rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_recipe = recipe(class ~ ., data = cells) %>%
  step_dummy(class, -class)

set.seed(1234)
workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_spec) %>%
  fit(data = cells)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_dummy()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1198456

# This is important! The fit object from tidymodels shows a "probability", when we specifically asked for "classification"!
# This leads to a lower OOB error.
# It isn't a *lot* lower here, but in other datasets it can make a big difference (i.e. 20% vs 45%).

# Now lets run again using ranger and "probability".

set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE)
#> Ranger result
#> 
#> Call:
#>  ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.119976

# Now the OOB error is down at the same level as the tidymodels result. 
# Also, the default for a "ranger" classification model has a "min.node.size = 1", whereas the tidymodels shows "10".

Created on 2021-08-25 by the reprex package (v2.0.0)

@juliasilge
Copy link
Member

juliasilge commented Aug 26, 2021

That is correct! We default to probability forests because that is more aligned with how the rest of tidymodels makes predictions.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(cells)
library(ranger)
tidymodels_prefer()

cells$case = NULL

rf_spec <- rand_forest(mode = "classification")
rf_form <- class ~ .
workflow(rf_form, rf_spec) %>% fit(data = cells)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> class ~ .
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1202317


ranger(x = cells[2:57], y = cells$class, probability = TRUE)
#> Ranger result
#> 
#> Call:
#>  ranger(x = cells[2:57], y = cells$class, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.120404

Created on 2021-08-26 by the reprex package (v2.0.1)

If you do not want to fit a probability tree and do not want to be able to make class probability predictions, you can set that as an engine argument:

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
data(cells)
library(ranger)
tidymodels_prefer()

cells$case = NULL

rf_spec <- rand_forest() %>%
   set_engine("ranger", probability = FALSE) %>%
   set_mode("classification")
rf_form <- class ~ .
workflow(rf_form, rf_spec) %>% fit(data = cells)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Formula
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> class ~ .
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, probability = ~FALSE,      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
#> 
#> Type:                             Classification 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 1 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error:             16.84 %


ranger(x = cells[2:57], y = cells$class, probability = FALSE)
#> Ranger result
#> 
#> Call:
#>  ranger(x = cells[2:57], y = cells$class, probability = FALSE) 
#> 
#> Type:                             Classification 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 1 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error:             16.99 %

Created on 2021-08-26 by the reprex package (v2.0.1)

@rmflight
Copy link
Author

Great, thanks @juliasilge !! I guess that makes sense.

For me, a first pass random forest classification tree I find the OOB errors give me a good idea of the expected AUC oftentimes, but I'd like to keep the tidymodels workflow everywhere to keep consistency and help with muscle memory.

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 12, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants