-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement check_keys #1619
Comments
I wonder how to achieve this for SQL sources. |
Or if keys are missing (#1590) |
For now, let's make this an explicit |
I have been using a similar functionality in my scripts. I think it suffices if the first few collisions are reported. Should check_keys() call compute() for SQL sources? |
@hadley I agree it is rarely desirable to require that both When left joining in particular, being able to require a non-duplicative property in the foreign key |
A natural implementation would be something like this: check_keys <- function(.data, ...) {
keys <- select(.data, ...)
keys <- ungroup(keys)
keys <- mutate(keys, `__id` = 1:row_number())
# Check that keys are unique
dups <- group_by_all(keys)
dups <- summarise(dups, n = n())
dups <- filter(dups, n > 1)
dups_keys <- semi_join(keys, dups, by = names(keys))
# Check that keys are not missing
miss <- filter_all(keys, is.na)
miss_keys <- semi_join(keys, dups, by = names(keys))
} But this requires both |
I think we'll probably need two functions: one that throws an error if anything is wrong, the other should return a tibble with one row per problem. |
Should this be part of |
We have |
I think this is about a utility function that verifies if a selection of "key columns" defines a key on the tibble, i.e. that for each unique combination of values in the "key columns" there is at most one row. |
Something like this then conceptually using some of the new tools from #3574 check_keys <- function(.data, ...){
group_by(.data, ...) %>%
group_rows() %>%
lengths() %>%
all(. == 1L)
}
mtcars %>%
check_keys(mpg, cyl, qsec) Internalizing this by reusing some of the code from group_by would be quicker. |
Maybe we can just look at |
what do you mean ? instead of group_rows ? |
Just leave this one for me - it's more of a user interface issue. |
I now think this is out of scope for dplyr, and better belongs in its own package, e.g. https://github.com/krlmlr/dm. |
This is very rarely desirable (however we will need an option to turn the warning off in case you do actually want it).
cc @jennybc
The text was updated successfully, but these errors were encountered: