Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upadd switch to make second level the reference in step_bin2factor #142
Conversation
|
Suggest making the default of the new argument FALSE. Binomial glms, e.g., assume the first level is the negative class, which leads to this backwardness with the current default. suppressPackageStartupMessages({
library(tidyverse)
library(recipes)
})
d <-
tibble(y = rbinom(50, 1, .5),
x1 = rnorm(50, mean = y),
x2 = rnorm(50, mean = y))
d <-
recipe(d) %>%
step_bin2factor(y) %>%
prep(d) %>%
bake(d)
m <- glm(y ~ ., d, family = "binomial")
d$predicted_prob <- predict(m, d, type = "response")
ggplot(d, aes(x = y, y = predicted_prob)) +
geom_boxplot()Created on 2018-04-06 by the reprex package (v0.2.0). |
No, I'm sticking to my guns on this one. Sorry about the rant. This convention is old and misguided since it is based on encoding binary categorical data as 0/1. If people forgot about this and were asked "which factor level is generally more important?," nobody would say "the last one". So in my quixotic efforts to change established human behavior, I'm going to say that we should treat categorical data as qualitative and forget about how it is eventually encoded (i.e., please make the default TRUE).
|
|
Quixotic ranting is good. I'll save you my counter-rant because I see that this is baked into caret and so would be a massive change. The default is TRUE. |
|
Can you add a unit test for the option and make a note in the NEWS.md file under "Other Changes"? Feel free to link to the PR or issue and use "(contributed by...)" or similar. |
|
Sorry, I missed that last ask. Thanks for letting me add this! |

First stab at addressing #141. I wrote the test differently than the others in the file because
tableorders by factor levels and I wanted to be explicit about what's being tested here.