Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join_by(): Syntax for generic joins #2240

Closed
krlmlr opened this issue Nov 8, 2016 · 13 comments · Fixed by #5910
Closed

join_by(): Syntax for generic joins #2240

krlmlr opened this issue Nov 8, 2016 · 13 comments · Fixed by #5910
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@krlmlr
Copy link
Member

krlmlr commented Nov 8, 2016

#557 (comment) and #378 (comment) propose a syntax for generic and rolling joins:

left_join( 
  FundMonths, Returns, 
  join_by(FundID == FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15)
)

left_join( 
  events, days,
  join_by(collector_id == collector_id, event_timestamp >= largest(day))
)

As usual, this should be powered by an SE version join_by_().

We can pass this to the SQL engine (and perhaps to data tables) with relatively little work, the main challenge will be to implement this for data frames.

@ywhcuhk

This comment has been minimized.

@001ben

This comment has been minimized.

@dholstius
Copy link

Would this cover wildcard joins as well?

@krlmlr
Copy link
Member Author

krlmlr commented Feb 28, 2018

Need to keep in mind that dbplyr now translates full joins with by = c() as cross joins (tidyverse/dbplyr@3316463).

@krlmlr

This comment has been minimized.

@krlmlr
Copy link
Member Author

krlmlr commented Mar 2, 2018

We also need to review what fuzzyjoin already offers.

@jtelleriar

This comment has been minimized.

@kyletuft

This comment has been minimized.

@roblisy

This comment has been minimized.

@njdcornites

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Dec 10, 2019

Me too comments are not very useful for us; please just 👍 the issue.

@hadley hadley added tables 🧮 joins and set operations and removed verbs 🏃‍♀️ labels Dec 11, 2019
@DavisVaughan
Copy link
Member

A thought on the NSE join_by() implementation. Evaluate the ... in an env that masks the comparison ops, ==, >=, etc., to separate the LHS/RHS for further evaluation in the context of the appropriate data frame

library(rlang)
#> Warning: package 'rlang' was built under R version 4.0.2

capture_equal_equal <- function(e1, e2) {
  list(op = "==", lhs = enexpr(e1), rhs = enexpr(e2))
}
capture_greater <- function(e1, e2) {
  list(op = ">", lhs = enexpr(e1), rhs = enexpr(e2))
}
capture_greater_equal <- function(e1, e2) {
  list(op = ">=", lhs = enexpr(e1), rhs = enexpr(e2))
}
capture_less <- function(e1, e2) {
  list(op = "<", lhs = enexpr(e1), rhs = enexpr(e2))
}
capture_less_equal <- function(e1, e2) {
  list(op = "<=", lhs = enexpr(e1), rhs = enexpr(e2))
}

bindings <- list(
  `==` = capture_equal_equal, 
  `>` = capture_greater,
  `>=` = capture_greater_equal,
  `<` = capture_less,
  `<=` = capture_less_equal
)

env_join_by <- new_environment(bindings)

eval_join_by <- function(quo) {
  expr <- quo_get_expr(quo)
  
  # Bare symbol of `x` translates to `x == x`
  if (is_symbol(expr)) {
    expr <- expr(`==`(!!expr, !!expr))
  }
  
  # Evaluate expression in limited env
  out <- eval_bare(expr, env_join_by)
  
  # Retain quosure env for later evaluation
  env <- list(env = quo_get_env(quo))
  out <- c(out, env)
  
  out
}

join_by <- function(...) {
  quos <- enquos(...)
  
  if (any(names(quos) != "")) {
    abort("`=` used in `join_by()`, did you mean `==`?")
  }
  
  lapply(quos, eval_join_by)
}

process_join_by_op <- function(op, data_lhs, data_rhs) {
  # Evaluate with lhs/rhs data mask
  list(
    op = op$op,
    lhs = eval_tidy(op$lhs, data = data_lhs, env = op$env),
    rhs = eval_tidy(op$rhs, data = data_rhs, env = op$env)
  )
}
process_join_by_ops <- function(ops, data_lhs, data_rhs) {
  lapply(ops, process_join_by_op, data_lhs = data_lhs, data_rhs = data_rhs)
}

lhs <- data.frame(x = "foo", yearmonth = 2)
rhs <- data.frame(x = c("bar", "baz"), gmonth = 5:6)

vec <- 5

ops <- join_by(x, yearmonth - vec >= gmonth + 3)
str(ops)
#> List of 2
#>  $ :List of 4
#>   ..$ op : chr "=="
#>   ..$ lhs: symbol x
#>   ..$ rhs: symbol x
#>   ..$ env:<environment: R_GlobalEnv> 
#>  $ :List of 4
#>   ..$ op : chr ">="
#>   ..$ lhs: language yearmonth - vec
#>   ..$ rhs: language gmonth + 3
#>   ..$ env:<environment: R_GlobalEnv>

ops2 <- process_join_by_ops(ops, lhs, rhs)
str(ops2)
#> List of 2
#>  $ :List of 3
#>   ..$ op : chr "=="
#>   ..$ lhs: chr "foo"
#>   ..$ rhs: chr [1:2] "bar" "baz"
#>  $ :List of 3
#>   ..$ op : chr ">="
#>   ..$ lhs: num -3
#>   ..$ rhs: num [1:2] 8 9

@DavisVaughan
Copy link
Member

I've implemented a rough POC for rolling joins here https://github.com/DavisVaughan/slidejoin, which uses slider as the backend to generate the sliding indices that you join on. Many examples are in the readme. It may serve as inspiration for the rolling backend here, but is otherwise untested and not fit for production use.

  • There is a "simple" slide_join() function, for when you only need a rolling join with no extra by variables.
  • Then there are slide_left_join(), slide_right_join(), slide_inner_join() and slide_full_join() for when you need to perform a rolling join while also having extra by variables to join on.

slide_join() itself is some form of a left join, which feels wrong somehow.

Sliding left/right/inner/full joins are implemented by first performing the corresponding mutating join only on the by columns, while nesting everything else, followed up by slide_join()-ing the resulting joined nested columns. This is probably not that performant, and might have the wrong semantics, but I imagine getting this exactly right will only be possible by applying the rolling bit within the *_join() functions themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants