Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_other reports that all the input columns have been othered #338

Closed
jmiahjones opened this issue Jun 3, 2019 · 1 comment
Closed

step_other reports that all the input columns have been othered #338

jmiahjones opened this issue Jun 3, 2019 · 1 comment

Comments

@jmiahjones
Copy link

@jmiahjones jmiahjones commented Jun 3, 2019

Minimal, reproducible example:

data("credit_data")
rec <- recipe(credit_data, Age~Status + Marital + Home)
othered <- rec %>% step_other(all_predictors(), threshold=0.05, other="my_other") %>% 
    prep(training=credit_data)
othered

This will yield

Operations:
Collapsing factor levels for Status, Marital, Home [trained]

However, the variable Status was not actually binned according to other, as it has only two values well above the threshold. Upon inspection of the steps, you will find that the variable Status has the attribute collapse = TRUE.

othered$steps[[1]]$objects$Status

$keep
[1] "bad" "good"

$collapse
[1] TRUE

$other
[1] "my_other"

Upon tracing through the code, I discovered the collapse flag is always set to true, when it should be dynamically evaluated.

recipes:::keep_levels
function (x, prop = 0.1, other = "other") 
{
    if (!is.factor(x)) 
        x <- factor(x)
    xtab <- sort(table(x, useNA = "no"), decreasing = TRUE)/sum(!is.na(x))
    dropped <- which(xtab < prop)
    orig <- levels(x)
    if (length(dropped) > 0) 
        keepers <- names(xtab[-dropped])
    else keepers <- orig
    if (length(keepers) == 0) 
        keepers <- names(xtab)[which.max(xtab)]
    if (other %in% keepers) 
        stop("The level ", other, " is already a factor level that will be retained. ", 
            "Please choose a different value.", call. = FALSE)
    list(keep = orig[orig %in% keepers], collapse = TRUE, other = other)
}

The issue is in the last line of the function. Can we add in a collapse=all(orig %in% keepers) instead, or something similar?

topepo added a commit that referenced this issue Jun 28, 2019
@topepo
Copy link
Collaborator

@topepo topepo commented Jun 28, 2019

The changes are:

  • Previously, if step_other() did not collapse any levels, it would still add an "other" level to the factor. This would lump new factor levels into "other" when data were baked (as step_novel() does). This no longer occurs since it was inconsistent with ?step_other, which said that

  • step_other()'s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing).

topepo added a commit that referenced this issue Jun 29, 2019
changes for #338 and #290
@topepo topepo closed this Jun 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.