Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement check_keys #1619

Closed
hadley opened this issue Jan 12, 2016 · 15 comments
Closed

Implement check_keys #1619

hadley opened this issue Jan 12, 2016 · 15 comments
Assignees
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Milestone

Comments

@hadley
Copy link
Member

hadley commented Jan 12, 2016

This is very rarely desirable (however we will need an option to turn the warning off in case you do actually want it).

cc @jennybc

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2016

I wonder how to achieve this for SQL sources.

@hadley hadley added the feature a feature request or enhancement label Mar 1, 2016
@hadley hadley added this to the 0.5 milestone Mar 1, 2016
@hadley
Copy link
Member Author

hadley commented Mar 1, 2016

Or if keys are missing (#1590)

@hadley
Copy link
Member Author

hadley commented May 22, 2016

For now, let's make this an explicit check_keys() function. This could be a simple wrapper around count - the main challenge will be a nice output, particularly if there are many problems.

@krlmlr
Copy link
Member

krlmlr commented May 22, 2016

I have been using a similar functionality in my scripts. I think it suffices if the first few collisions are reported.

Should check_keys() call compute() for SQL sources?

@rtaph
Copy link

rtaph commented Dec 1, 2016

@hadley I agree it is rarely desirable to require that both x and y not have duplicates, but would argue it is common not to want the keys in y to have duplicates.

When left joining in particular, being able to require a non-duplicative property in the foreign key y ensures that conceptual meaning of a row is preserved through the join operation.

@hadley hadley changed the title In joins, warn if both x and y have duplicate keys Implement check_join function Feb 22, 2017
@hadley hadley changed the title Implement check_join function Implement check_keys Feb 22, 2017
@hadley
Copy link
Member Author

hadley commented Feb 23, 2017

A natural implementation would be something like this:

check_keys <- function(.data, ...) {
  keys <- select(.data, ...)
  keys <- ungroup(keys)
  keys <- mutate(keys, `__id` = 1:row_number())

  # Check that keys are unique
  dups <- group_by_all(keys)
  dups <- summarise(dups, n = n())
  dups <- filter(dups, n > 1)
  dups_keys <- semi_join(keys, dups, by = names(keys))
  
  # Check that keys are not missing
  miss <- filter_all(keys, is.na)
  miss_keys <- semi_join(keys, dups, by = names(keys))
}

But this requires both group_by_all() and filter_all()

@hadley
Copy link
Member Author

hadley commented Feb 23, 2017

I think we'll probably need two functions: one that throws an error if anything is wrong, the other should return a tibble with one row per problem.

@krlmlr
Copy link
Member

krlmlr commented Feb 23, 2017

Should this be part of tidyr, because we're only using exported verbs here? How does this interact with set_key() (#1792)?

@romainfrancois
Copy link
Member

We have group_by_all and filter_all now. I don't really understand what this is about though.

@krlmlr
Copy link
Member

krlmlr commented May 30, 2018

I think this is about a utility function that verifies if a selection of "key columns" defines a key on the tibble, i.e. that for each unique combination of values in the "key columns" there is at most one row.

@romainfrancois
Copy link
Member

romainfrancois commented May 30, 2018

Something like this then conceptually using some of the new tools from #3574

check_keys <- function(.data, ...){
  group_by(.data, ...) %>% 
    group_rows() %>% 
    lengths() %>% 
    all(. == 1L)
}

mtcars %>% 
  check_keys(mpg, cyl, qsec)

Internalizing this by reusing some of the code from group_by would be quicker.

@krlmlr
Copy link
Member

krlmlr commented May 30, 2018

Maybe we can just look at group_data() ?

@romainfrancois
Copy link
Member

what do you mean ? instead of group_rows ?

@hadley
Copy link
Member Author

hadley commented May 30, 2018

Just leave this one for me - it's more of a user interface issue.

@hadley
Copy link
Member Author

hadley commented Dec 11, 2019

I now think this is out of scope for dplyr, and better belongs in its own package, e.g. https://github.com/krlmlr/dm.

@hadley hadley closed this as completed Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verbs 🏃‍♀️
Projects
None yet
Development

No branches or pull requests

4 participants