Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_dummy returns less binary variables if a character is provided with less categories #83

Closed
LluisRamon opened this issue Aug 4, 2017 · 2 comments

Comments

@LluisRamon
Copy link

@LluisRamon LluisRamon commented Aug 4, 2017

Thank you for the package, I find it very useful!

I found an unexpected behaviour when applying bake with a step_dummy recipe in a test dataset that have less categories in a character variable. In this case, bake doesn't return a binary variable for the missing category.

I attach a reproducible example.

library("recipes")
library("dplyr")

data(okc)
okc <- okc[complete.cases(okc),]

rec <- recipe(~ diet, data = okc)

dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)

dummy_data <- bake(dummies, newdata = okc)

names(dummy_data)

# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian"   

# Remove one category in diet
dummy_data <- bake(dummies, newdata = okc %>% filter(diet != 'halal'))

names(dummy_data)
# [1] "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"        "diet_mostly.kosher"      
# [5] "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"   "diet_other"              
# [9] "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"     "diet_strictly.other"     
# [13] "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"               "diet_vegetarian" 

A workaround is to convert all character variables to factor.

library("recipes")
library("dplyr")

data(okc)
okc <- okc[complete.cases(okc),]
okc$diet <- as.factor(okc$diet)

rec <- recipe(~ diet, data = okc)

dummies <- rec %>% step_dummy(diet)
dummies <- prep(dummies, training = okc)

dummy_data <- bake(dummies, newdata = okc)

names(dummy_data)

# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian"   

# Remove one category in diet
dummy_data <- bake(dummies, newdata = okc %>% filter(diet != 'halal'))

names(dummy_data)
# [1] "diet_halal"               "diet_kosher"              "diet_mostly.anything"     "diet_mostly.halal"       
# [5] "diet_mostly.kosher"       "diet_mostly.other"        "diet_mostly.vegan"        "diet_mostly.vegetarian"  
# [9] "diet_other"               "diet_strictly.anything"   "diet_strictly.halal"      "diet_strictly.kosher"    
# [13] "diet_strictly.other"      "diet_strictly.vegan"      "diet_strictly.vegetarian" "diet_vegan"              
# [17] "diet_vegetarian" 

My sessionInfo in case you need it.

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2  recipes_0.1.0 dplyr_0.7.2  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12      ddalpha_1.2.1     DEoptimR_1.0-8    gower_0.1.2       bindr_0.1         class_7.3-14      tools_3.3.2       rpart_4.1-10      ipred_0.9-6       lubridate_1.6.0  
[11] tibble_1.3.3      lattice_0.20-34   pkgconfig_2.0.1   rlang_0.1.1       Matrix_1.2-7.1    RcppRoll_0.2.2    prodlim_1.6.1     stringr_1.2.0     CVST_0.2-1        grid_3.3.2       
[21] nnet_7.3-12       tidyselect_0.1.1  robustbase_0.92-7 glue_1.1.1.9000   R6_2.2.2          survival_2.39-5   lava_1.5          kernlab_0.9-25    purrr_0.2.2.2     DRR_0.0.2        
[31] magrittr_1.5      MASS_7.3-45       splines_3.3.2     assertthat_0.2.0  dimRed_0.1.0      timeDate_3012.100 stringi_1.1.5    
topepo added a commit that referenced this issue Sep 13, 2017
The previous commit was mis-attributed to another issue and should be
directed towards #83
@topepo
Copy link
Collaborator

@topepo topepo commented Sep 13, 2017

This long explanation is for posterity and future reference (tl;dr at end).

When prep is called, there is a stringsAsFactors option that defaults to TRUE (this is contrary to the rest of the tidyverse, but we need to make dummy variables so...).

When prep is called with the default, any factor levels are recorded before and after the steps are prepped and stored in an element called levels. Here's an example with a step that only affects the numeric data:

> library("recipes")
> library("dplyr")
> 
> data(okc)
> 
> rec1 <- recipe(~ ., data = okc) %>%
+   step_center(all_numeric()) %>%
+   prep(training = okc)
step 1 center training 
> names(rec1$levels)
[1] "age"      "diet"     "height"   "location" "date" 

The elements for numeric columns have all missing values. These values are consistent with the state of the data after prep is done with the computations.

When bake is called, it applies the step computations in order and then resets the factor levels to be consistent with the values in the levels slot.

The bug here is that, once dummy variables are created, the original factor is removed and, after prep is done and records the levels, the possible values of the original factor are gone. In your example, the columns are not produced because the information to fix them doesn't exist.

> dummies <- rec %>% step_dummy(diet)
> dummies <- prep(dummies, training = okc, retain = TRUE)
step 1 dummy training 
> # no "levels" element since there are no nominal data left at the end
> names(dummies)
[1] "var_info"  "term_info" "steps"     "template"  "retained"  "tr_info" 

To fix this, I have step_dummy recording the levels, saving them, and resetting them when the step is baked (just prior to making dummy variables). TBH I don't know why model.matrix doesn't do this anyway.

You will have to refit the step so that the factor levels are present (otherwise it throws an error).

tl;dr

The sequence of steps prevents the right factor levels from being used. It is fixed but you'll need to recreate the step with the devel version

one other note: this new version requires variables selected in step_dummy to be factors (strings won't work).

@LluisRamon
Copy link
Author

@LluisRamon LluisRamon commented Sep 13, 2017

Thank you very much! I appreciate your long explanation :).

@LluisRamon LluisRamon closed this Sep 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.