The problem
step_dummy_multi_choice() ignores levels of the input factor and only considers levels in the new data. This causes problem if the downstream model uses a dummy variable for a level that does not occur in the new data. This behavior is in contrast to that of step_dummy().
I also notice that the two functions have a common parameter called levels, and it looks like it is intended to explicitly specify the levels to include in the output. However, there are no examples in the documentation and I don't know how to specify (I tried levels = list(x = c("a", "b", "c", "d", "e", "f", "g")) but it failed). Can you add some more explanations on how to use the parameter and/or add an example?
Reproducible example
I would expect the following cases of step_dummy() and step_dummy_multi_choice() to produce the same number of columns. However, step_dummy_multi_choice() does not create columns for levels "a", "b", "f", "g".
By the way, why do step_dummy() and step_dummy_multi_choice() produce different types of output (double vs integer)?
suppressPackageStartupMessages(library(recipes))
# old data
tr <- data.frame(x = factor(c("a", "b", "c"), levels = c("a", "b", "c", "d", "e", "f", "g")))
# new data
te <- data.frame(x = factor(c("c", "d", "e"), levels = c("a", "b", "c", "d", "e", "f", "g")))
tr %>%
recipe() |>
step_dummy(x, one_hot = T) |>
prep() |>
bake(new_data = te)
#> # A tibble: 3 × 7
#> x_a x_b x_c x_d x_e x_f x_g
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 0 1 0 0 0 0
#> 2 0 0 0 1 0 0 0
#> 3 0 0 0 0 1 0 0
tr |>
recipe() |>
step_dummy_multi_choice(x) |>
prep() |>
bake(new_data = te)
#> # A tibble: 3 × 3
#> x_c x_d x_e
#> <int> <int> <int>
#> 1 1 0 0
#> 2 0 1 0
#> 3 0 0 1
Created on 2022-02-24 by the reprex package (v2.0.0)
The problem
step_dummy_multi_choice()ignores levels of the input factor and only considers levels in the new data. This causes problem if the downstream model uses a dummy variable for a level that does not occur in the new data. This behavior is in contrast to that ofstep_dummy().I also notice that the two functions have a common parameter called
levels, and it looks like it is intended to explicitly specify the levels to include in the output. However, there are no examples in the documentation and I don't know how to specify (I triedlevels = list(x = c("a", "b", "c", "d", "e", "f", "g"))but it failed). Can you add some more explanations on how to use the parameter and/or add an example?Reproducible example
I would expect the following cases of
step_dummy()andstep_dummy_multi_choice()to produce the same number of columns. However,step_dummy_multi_choice()does not create columns for levels"a", "b", "f", "g".By the way, why do
step_dummy()andstep_dummy_multi_choice()produce different types of output (double vs integer)?Created on 2022-02-24 by the reprex package (v2.0.0)