-
Notifications
You must be signed in to change notification settings - Fork 95
Closed
Description
The problem
When using a parsnip model specification and "ranger" classification, it looks like the "ranger" arguments are still "probability" instead of "classification".
Reproducible example
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(modeldata)
library(skimr)
data(cells, package = "modeldata")
library(ranger)
tidymodels_prefer()
cells$case = NULL
set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE)
#> Ranger result
#>
#> Call:
#> ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE)
#>
#> Type: Classification
#> Number of trees: 500
#> Sample size: 2019
#> Number of independent variables: 56
#> Mtry: 7
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error: 17.29 %
# OK, so we have an OOB error of 0.17, which isn't too shabby!
# Now lets run via tidymodels.
rf_spec = rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
rf_recipe = recipe(class ~ ., data = cells) %>%
step_dummy(class, -class)
set.seed(1234)
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_spec) %>%
fit(data = cells)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#>
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#>
#> • step_dummy()
#>
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#>
#> Call:
#> ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 2019
#> Number of independent variables: 56
#> Mtry: 7
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.1198456
# This is important! The fit object from tidymodels shows a "probability", when we specifically asked for "classification"!
# This leads to a lower OOB error.
# It isn't a *lot* lower here, but in other datasets it can make a big difference (i.e. 20% vs 45%).
# Now lets run again using ranger and "probability".
set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE)
#> Ranger result
#>
#> Call:
#> ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE)
#>
#> Type: Probability estimation
#> Number of trees: 500
#> Sample size: 2019
#> Number of independent variables: 56
#> Mtry: 7
#> Target node size: 10
#> Variable importance mode: none
#> Splitrule: gini
#> OOB prediction error (Brier s.): 0.119976
# Now the OOB error is down at the same level as the tidymodels result.
# Also, the default for a "ranger" classification model has a "min.node.size = 1", whereas the tidymodels shows "10".
Created on 2021-08-25 by the reprex package (v2.0.0)
Metadata
Metadata
Assignees
Labels
No labels