Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_dummy returns less binary variables if a character is provided with less categories #83

Closed
LluisRamon opened this issue Aug 4, 2017 · 3 comments

Comments

@LluisRamon
Copy link

Thank you for the package, I find it very useful!

I found an unexpected behaviour when applying bake with a step_dummy recipe in a test dataset that have less categories in a character variable. In this case, bake doesn't return a binary variable for the missing category.

I attach a reproducible example.

library("recipes")
library("dplyr")

data(okc)
okc <- okc[complete.cases(okc),]

rec <- recipe(~ diet, data = okc)

dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)

dummy_data <- bake(dummies, newdata = okc)

names(dummy_data)

# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian"   

# Remove one category in diet
dummy_data <- bake(dummies, newdata = okc %>% filter(diet != 'halal'))

names(dummy_data)
# [1] "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"        "diet_mostly.kosher"      
# [5] "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"   "diet_other"              
# [9] "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"     "diet_strictly.other"     
# [13] "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"               "diet_vegetarian" 

A workaround is to convert all character variables to factor.

library("recipes")
library("dplyr")

data(okc)
okc <- okc[complete.cases(okc),]
okc$diet <- as.factor(okc$diet)

rec <- recipe(~ diet, data = okc)

dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)

dummy_data <- bake(dummies, newdata = okc)

names(dummy_data)

# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian"   

# Remove one category in diet
dummy_data <- bake(dummies, newdata = okc %>% filter(diet != 'halal'))

names(dummy_data)
# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian" 

My sessionInfo in case you need it.

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2  recipes_0.1.0 dplyr_0.7.2  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12      ddalpha_1.2.1     DEoptimR_1.0-8    gower_0.1.2       bindr_0.1         class_7.3-14      tools_3.3.2       rpart_4.1-10      ipred_0.9-6       lubridate_1.6.0  
[11] tibble_1.3.3      lattice_0.20-34   pkgconfig_2.0.1   rlang_0.1.1       Matrix_1.2-7.1    RcppRoll_0.2.2    prodlim_1.6.1     stringr_1.2.0     CVST_0.2-1        grid_3.3.2       
[21] nnet_7.3-12       tidyselect_0.1.1  robustbase_0.92-7 glue_1.1.1.9000   R6_2.2.2          survival_2.39-5   lava_1.5          kernlab_0.9-25    purrr_0.2.2.2     DRR_0.0.2        
[31] magrittr_1.5      MASS_7.3-45       splines_3.3.2     assertthat_0.2.0  dimRed_0.1.0      timeDate_3012.100 stringi_1.1.5    
topepo added a commit that referenced this issue Sep 13, 2017
The previous commit was mis-attributed to another issue and should be
directed towards #83
@topepo
Copy link
Member

topepo commented Sep 13, 2017

This long explanation is for posterity and future reference (tl;dr at end).

When prep is called, there is a stringsAsFactors option that defaults to TRUE (this is contrary to the rest of the tidyverse, but we need to make dummy variables so...).

When prep is called with the default, any factor levels are recorded before and after the steps are prepped and stored in an element called levels. Here's an example with a step that only affects the numeric data:

> library("recipes")
> library("dplyr")
> 
> data(okc)
> 
> rec1 <- recipe(~ ., data = okc) %>%
+   step_center(all_numeric()) %>%
+   prep(training = okc)
step 1 center training 
> names(rec1$levels)
[1] "age"      "diet"     "height"   "location" "date" 

The elements for numeric columns have all missing values. These values are consistent with the state of the data after prep is done with the computations.

When bake is called, it applies the step computations in order and then resets the factor levels to be consistent with the values in the levels slot.

The bug here is that, once dummy variables are created, the original factor is removed and, after prep is done and records the levels, the possible values of the original factor are gone. In your example, the columns are not produced because the information to fix them doesn't exist.

> dummies <- rec %>% step_dummy(diet)
> dummies <- prep(dummies, training = okc, retain = TRUE)
step 1 dummy training 
> # no "levels" element since there are no nominal data left at the end
> names(dummies)
[1] "var_info"  "term_info" "steps"     "template"  "retained"  "tr_info" 

To fix this, I have step_dummy recording the levels, saving them, and resetting them when the step is baked (just prior to making dummy variables). TBH I don't know why model.matrix doesn't do this anyway.

You will have to refit the step so that the factor levels are present (otherwise it throws an error).

tl;dr

The sequence of steps prevents the right factor levels from being used. It is fixed but you'll need to recreate the step with the devel version

one other note: this new version requires variables selected in step_dummy to be factors (strings won't work).

@LluisRamon
Copy link
Author

Thank you very much! I appreciate your long explanation :).

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants