Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upstep_dummy returns less binary variables if a character is provided with less categories #83
Comments
The previous commit was mis-attributed to another issue and should be directed towards #83
|
This long explanation is for posterity and future reference (tl;dr at end). When When > library("recipes")
> library("dplyr")
>
> data(okc)
>
> rec1 <- recipe(~ ., data = okc) %>%
+ step_center(all_numeric()) %>%
+ prep(training = okc)
step 1 center training
> names(rec1$levels)
[1] "age" "diet" "height" "location" "date" The elements for numeric columns have all missing values. These values are consistent with the state of the data after When The bug here is that, once dummy variables are created, the original factor is removed and, after > dummies <- rec %>% step_dummy(diet)
> dummies <- prep(dummies, training = okc, retain = TRUE)
step 1 dummy training
> # no "levels" element since there are no nominal data left at the end
> names(dummies)
[1] "var_info" "term_info" "steps" "template" "retained" "tr_info" To fix this, I have You will have to refit the step so that the factor levels are present (otherwise it throws an error). tl;dr The sequence of steps prevents the right factor levels from being used. It is fixed but you'll need to recreate the step with the devel version one other note: this new version requires variables selected in |
|
Thank you very much! I appreciate your long explanation :). |
Thank you for the package, I find it very useful!
I found an unexpected behaviour when applying
bakewith astep_dummyrecipe in a test dataset that have less categories in a character variable. In this case,bakedoesn't return a binary variable for the missing category.I attach a reproducible example.
A workaround is to convert all character variables to factor.
My sessionInfo in case you need it.