Skip to content

step_dummy_multi_choice() ignores levels of the input factor #916

@zenggyu

Description

@zenggyu

The problem

step_dummy_multi_choice() ignores levels of the input factor and only considers levels in the new data. This causes problem if the downstream model uses a dummy variable for a level that does not occur in the new data. This behavior is in contrast to that of step_dummy().

I also notice that the two functions have a common parameter called levels, and it looks like it is intended to explicitly specify the levels to include in the output. However, there are no examples in the documentation and I don't know how to specify (I tried levels = list(x = c("a", "b", "c", "d", "e", "f", "g")) but it failed). Can you add some more explanations on how to use the parameter and/or add an example?

Reproducible example

I would expect the following cases of step_dummy() and step_dummy_multi_choice() to produce the same number of columns. However, step_dummy_multi_choice() does not create columns for levels "a", "b", "f", "g".

By the way, why do step_dummy() and step_dummy_multi_choice() produce different types of output (double vs integer)?

suppressPackageStartupMessages(library(recipes))

# old data
tr <- data.frame(x = factor(c("a", "b", "c"), levels = c("a", "b", "c", "d", "e", "f", "g")))

# new data
te <- data.frame(x = factor(c("c", "d", "e"), levels = c("a", "b", "c", "d", "e", "f", "g")))

tr %>%
  recipe() |>
  step_dummy(x, one_hot = T) |>
  prep() |>
  bake(new_data = te)
#> # A tibble: 3 × 7
#>     x_a   x_b   x_c   x_d   x_e   x_f   x_g
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     0     0     1     0     0     0     0
#> 2     0     0     0     1     0     0     0
#> 3     0     0     0     0     1     0     0

tr |>
  recipe() |>
  step_dummy_multi_choice(x) |>
  prep() |>
  bake(new_data = te)
#> # A tibble: 3 × 3
#>     x_c   x_d   x_e
#>   <int> <int> <int>
#> 1     1     0     0
#> 2     0     1     0
#> 3     0     0     1

Created on 2022-02-24 by the reprex package (v2.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions