Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_key #1792

Closed
hadley opened this issue Apr 28, 2016 · 4 comments
Closed

set_key #1792

hadley opened this issue Apr 28, 2016 · 4 comments
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Apr 28, 2016

library(nycflights13)
weather <- flights %>% set_key(year, month, day, hour, origin)
planes <- planes %>% set_key(tailnum)
airlines <- airlines %>% set_key(carrier)
airports <- airports %>% set_key(faa)

This would check that the combination of variables is a valid key (i.e. no duplicates and no missing values), and would store the keys as an attribute. Then joins would use the key attribute (if present) for natural joins, rather than the complete set of variables.

cc @jennybc @krlmlr

@krlmlr
Copy link
Member

krlmlr commented Apr 28, 2016

  • What happens if two tbls have different keys set?
    • "No key set" is a special case here, but a subset or superset should be permissible, too
    • The result could inherit the key property in some cases
    • Examples:
      • A + B x ? -> Joining over A + B
      • A + B x A -> Joining over A + B, A + B is a key in the result
      • A + B x A + B + C -> Joining over A + B, A + B + C is a key in the result
      • A + B + C x A + B + D -> Error, please specify "by"
  • Is there going to be a similar interface for secondary indexes?
    • A key can become a secondary index after the join

@hadley
Copy link
Member Author

hadley commented Apr 28, 2016

If two tables have different keys, the natural join uses the intersection.

What's a secondary index?

@krlmlr
Copy link
Member

krlmlr commented Apr 28, 2016

I guess we don't want to use the intersection if it's empty. Also, if it's not a strict subset or superset (A + B x A + C), no key can be defined, and the result is more like a cross join within strata. What are the applications of such a join? (The key of the result is A + B + C.)

Secondary index: One that's used only for optimizing query execution but doesn't have the uniqueness property. Don't know if this is the "official" terminology, though.

@hadley
Copy link
Member Author

hadley commented Dec 11, 2019

I now think this is out of scope for dplyr, because it's complex that it needs its own package, e.g. https://github.com/krlmlr/dm

@hadley hadley closed this as completed Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants