Skip to content

ranger "classification" mode still looks like "probability" #546

@rmflight

Description

@rmflight

The problem

When using a parsnip model specification and "ranger" classification, it looks like the "ranger" arguments are still "probability" instead of "classification".

Reproducible example

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(modeldata)
library(skimr)
data(cells, package = "modeldata")
library(ranger)
tidymodels_prefer()

cells$case = NULL
set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE)
#> Ranger result
#> 
#> Call:
#>  ranger(class ~ ., data = cells, min.node.size = 10, classification = TRUE) 
#> 
#> Type:                             Classification 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error:             17.29 %


# OK, so we have an OOB error of 0.17, which isn't too shabby!
# Now lets run via tidymodels.

rf_spec = rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_recipe = recipe(class ~ ., data = cells) %>%
  step_dummy(class, -class)

set.seed(1234)
workflow() %>%
  add_recipe(rf_recipe) %>%
  add_model(rf_spec) %>%
  fit(data = cells)
#> ══ Workflow [trained] ══════════════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: rand_forest()
#> 
#> ── Preprocessor ────────────────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_dummy()
#> 
#> ── Model ───────────────────────────────────────────────────────────────────────
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.1198456

# This is important! The fit object from tidymodels shows a "probability", when we specifically asked for "classification"!
# This leads to a lower OOB error.
# It isn't a *lot* lower here, but in other datasets it can make a big difference (i.e. 20% vs 45%).

# Now lets run again using ranger and "probability".

set.seed(1234)
ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE)
#> Ranger result
#> 
#> Call:
#>  ranger(class ~ ., data = cells, min.node.size = 10, probability = TRUE) 
#> 
#> Type:                             Probability estimation 
#> Number of trees:                  500 
#> Sample size:                      2019 
#> Number of independent variables:  56 
#> Mtry:                             7 
#> Target node size:                 10 
#> Variable importance mode:         none 
#> Splitrule:                        gini 
#> OOB prediction error (Brier s.):  0.119976

# Now the OOB error is down at the same level as the tidymodels result. 
# Also, the default for a "ranger" classification model has a "min.node.size = 1", whereas the tidymodels shows "10".

Created on 2021-08-25 by the reprex package (v2.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions