Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_dummy_multi_choice() ignores levels of the input factor #916

Closed
zenggyu opened this issue Feb 24, 2022 · 3 comments
Closed

step_dummy_multi_choice() ignores levels of the input factor #916

zenggyu opened this issue Feb 24, 2022 · 3 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@zenggyu
Copy link

zenggyu commented Feb 24, 2022

The problem

step_dummy_multi_choice() ignores levels of the input factor and only considers levels in the new data. This causes problem if the downstream model uses a dummy variable for a level that does not occur in the new data. This behavior is in contrast to that of step_dummy().

I also notice that the two functions have a common parameter called levels, and it looks like it is intended to explicitly specify the levels to include in the output. However, there are no examples in the documentation and I don't know how to specify (I tried levels = list(x = c("a", "b", "c", "d", "e", "f", "g")) but it failed). Can you add some more explanations on how to use the parameter and/or add an example?

Reproducible example

I would expect the following cases of step_dummy() and step_dummy_multi_choice() to produce the same number of columns. However, step_dummy_multi_choice() does not create columns for levels "a", "b", "f", "g".

By the way, why do step_dummy() and step_dummy_multi_choice() produce different types of output (double vs integer)?

suppressPackageStartupMessages(library(recipes))

# old data
tr <- data.frame(x = factor(c("a", "b", "c"), levels = c("a", "b", "c", "d", "e", "f", "g")))

# new data
te <- data.frame(x = factor(c("c", "d", "e"), levels = c("a", "b", "c", "d", "e", "f", "g")))

tr %>%
  recipe() |>
  step_dummy(x, one_hot = T) |>
  prep() |>
  bake(new_data = te)
#> # A tibble: 3 × 7
#>     x_a   x_b   x_c   x_d   x_e   x_f   x_g
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     0     0     1     0     0     0     0
#> 2     0     0     0     1     0     0     0
#> 3     0     0     0     0     1     0     0

tr |>
  recipe() |>
  step_dummy_multi_choice(x) |>
  prep() |>
  bake(new_data = te)
#> # A tibble: 3 × 3
#>     x_c   x_d   x_e
#>   <int> <int> <int>
#> 1     1     0     0
#> 2     0     1     0
#> 3     0     0     1

Created on 2022-02-24 by the reprex package (v2.0.0)

@EmilHvitfeldt EmilHvitfeldt added the bug an unexpected problem or unintended behavior label Feb 24, 2022
@EmilHvitfeldt
Copy link
Member

Hello @zenggyu ! Thanks for filling this issue. You are correct in assuming that step_dummy_multi_choice() should respect the factor levels.

@EmilHvitfeldt
Copy link
Member

was closed in #930

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants