Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_other reports that all the input columns have been othered #338

Closed
jmiahjones opened this issue Jun 3, 2019 · 2 comments
Closed

step_other reports that all the input columns have been othered #338

jmiahjones opened this issue Jun 3, 2019 · 2 comments

Comments

@jmiahjones
Copy link

Minimal, reproducible example:

data("credit_data")
rec <- recipe(credit_data, Age~Status + Marital + Home)
othered <- rec %>% step_other(all_predictors(), threshold=0.05, other="my_other") %>% 
    prep(training=credit_data)
othered

This will yield

Operations:
Collapsing factor levels for Status, Marital, Home [trained]

However, the variable Status was not actually binned according to other, as it has only two values well above the threshold. Upon inspection of the steps, you will find that the variable Status has the attribute collapse = TRUE.

othered$steps[[1]]$objects$Status

$keep
[1] "bad" "good"

$collapse
[1] TRUE

$other
[1] "my_other"

Upon tracing through the code, I discovered the collapse flag is always set to true, when it should be dynamically evaluated.

recipes:::keep_levels
function (x, prop = 0.1, other = "other") 
{
    if (!is.factor(x)) 
        x <- factor(x)
    xtab <- sort(table(x, useNA = "no"), decreasing = TRUE)/sum(!is.na(x))
    dropped <- which(xtab < prop)
    orig <- levels(x)
    if (length(dropped) > 0) 
        keepers <- names(xtab[-dropped])
    else keepers <- orig
    if (length(keepers) == 0) 
        keepers <- names(xtab)[which.max(xtab)]
    if (other %in% keepers) 
        stop("The level ", other, " is already a factor level that will be retained. ", 
            "Please choose a different value.", call. = FALSE)
    list(keep = orig[orig %in% keepers], collapse = TRUE, other = other)
}

The issue is in the last line of the function. Can we add in a collapse=all(orig %in% keepers) instead, or something similar?

topepo added a commit that referenced this issue Jun 28, 2019
@topepo
Copy link
Member

topepo commented Jun 28, 2019

The changes are:

  • Previously, if step_other() did not collapse any levels, it would still add an "other" level to the factor. This would lump new factor levels into "other" when data were baked (as step_novel() does). This no longer occurs since it was inconsistent with ?step_other, which said that

  • step_other()'s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing).

topepo added a commit that referenced this issue Jun 29, 2019
@topepo topepo closed this as completed Jun 29, 2019
@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants