Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fct_lump_prop() has undocumented feature for one other level #274

Closed
caneff opened this issue Oct 1, 2020 · 4 comments · Fixed by #336
Closed

fct_lump_prop() has undocumented feature for one other level #274

caneff opened this issue Oct 1, 2020 · 4 comments · Fixed by #336
Labels
Milestone

Comments

@caneff
Copy link

caneff commented Oct 1, 2020

There is nothing in documentation that says lumping into other does nothing if that is only one other category. I don't mind the behavior but I spent some time debugging my code because of it.

What happens:

library(forcats)
f <- factor(c(1,2,2,2))
fct_lump_prop(f, prop=0.5)
#> [1] 1 2 2 2
#> Levels: 1 2

Created on 2020-10-01 by the reprex package (v0.3.0)

However I expect it to look like

#> [1] Other 2 2 2 2
#> Levels: 2 Other

Again I'm fine with the current behavior it is just unexpected for how the function reads. 1 is less than the 50% of the total so I expected it to become other. So either update documentation or behavior.

@thackl
Copy link

thackl commented Jan 30, 2021

Just ran into the same issue, also with fct_lump_n(). To me, this feels quite counterintuitive even if it were documented...

library(forcats)
fct_lump_n(c("a","a","c"), n=1)
#> [1] a a c
#> Levels: a c
fct_lump_n(c("a","a","b","c"), n=1)
#> [1] a     a     Other Other
#> Levels: a Other

Created on 2021-01-30 by the reprex package (v0.3.0)

@hadley hadley added the help wanted ❤️ we'd love your help! label Mar 2, 2022
@hadley
Copy link
Member

hadley commented Mar 2, 2022

I think the behaviour makes sense because what's the point in having an "other" level that contains only one level?

@huftis
Copy link

huftis commented Mar 3, 2022

I think the behaviour makes sense because what's the point in having an "other" level that contains only one level?

If for nothing else, then for 1) consistency among the fct_lump_*() functions, 2) consistency with the documentation, and 3) the principle of least surprise. Regarding the first reason, fct_lump_min() does label the ‘other’ level "other" even if this doesn’t involve any merging:

> fct_lump_min(c("a","a","c"), min=2)
[1] a     a     Other
Levels: a Other

Any in case you want to change the fct_lump_min() to be consistent with fct_lump_n() instead of the other way around: The current fct_lump_min() behaviour is actually very useful. For example, I sometimes have to ‘anonymise’ levels, e.g., institution names, that have less than five observations (and then lump them into one level). Then I can write:

fct_lump_min(x, min = 5, other = "[NAME REDACTED]")

And this will work correctly even if there is only one such institution.

@hadley hadley added this to the v1.0.0 milestone Jan 3, 2023
@hadley hadley changed the title fct_lump_prop has undocumented feature for one other level fct_lump_prop() has undocumented feature for one other level Jan 3, 2023
@hadley
Copy link
Member

hadley commented Jan 4, 2023

Ok; let's just hope this doesn't affect too much code in the wild.

hadley added a commit that referenced this issue Jan 4, 2023
hadley added a commit that referenced this issue Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants