Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: fct_match with validation #126

jonocarroll opened this issue Apr 20, 2018 · 0 comments

Proposal: fct_match with validation #126

jonocarroll opened this issue Apr 20, 2018 · 0 comments


Copy link

@jonocarroll jonocarroll commented Apr 20, 2018

(Redesigned from tidyverse/dplyr#3514)

An issue I've faced many times is filtering a data.frame/tibble based on "levels" (unique values or factor levels) of a column. A common (?) pattern for this is to create a logical vector of presence/absence of the levels compared to a vector of acceptable levels, which filter then accepts

filter(iris, Species %in% c("setosa"))

The problem with this is that while filter and %in% are both doing their jobs correctly, together they can introduce an unexpected result

filter(iris, Species %in% c("virginica", "samosa"))

One might reasonably expect an error in the attempted filter to levels not present in the data, in the same way that select fails loudly if a column is not present

select(iris, Species, Spaceship)
#> Error in FUN(X[[i]], ...) : object 'Spaceship' not found

The easiest way to get into this situation is to misspell a level, in which case it will (silently) not be in the filtered result

filter(iris, Species %in% c("selosa", "virginica")) %>% {unique(.$Species)}
#> [1] virginica
#> Levels: setosa versicolor virginica

Neither filter nor %in% is at fault here, but the intent of the code is lost because the implementation is not specific enough: filter does not perform "filter to the rows which contain X" but rather "filter to the rows for which some condition is TRUE" and delegates that responsibility of identifying those to %in%. %in% knows nothing of the intent so it responds faithfully with "X is not in this vector".

I propose a new function fct_match (and its counterpart fct_exclude) which performs validation that the requested levels are indeed contained in the vector prior to generating the logical result of which elements correspond to these levels

fct_match(iris$Species, "selosa", "virginica")
#>  Error: Level(s) not present in factor: "selosa" 

and otherwise generates the logical vector, e.g. to be passed to filter

fct_match(iris$Species, "setosa", "virginica") %>% table()
#   50   100 

filter(iris, fct_match(Species, "virginica")) %>% head()
#  Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#1          6.3         3.3          6.0         2.5 virginica
#2          5.8         2.7          5.1         1.9 virginica
#3          7.1         3.0          5.9         2.1 virginica
#4          6.3         2.9          5.6         1.8 virginica
#5          6.5         3.0          5.8         2.2 virginica
#6          7.6         3.0          6.6         2.1 virginica

I will submit a PR prototype (with testing) to accompany this Issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants