Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible joins #5910

Merged
merged 96 commits into from
May 9, 2022
Merged

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 8, 2021

Closes #5914
Closes #5661
Closes #5413
Closes #2240 (a whopping 5 years old! with 138 thumbs up!)

Reading order

  • NEWS

  • join-by.R

    • To read join_by() docs / examples and to get a general idea of how it works
    • I've tried to error intelligently for many bad user inputs in the parsing code. This probably doesn't need quite as deep of a review.
  • join.R

    • To read updated docs
    • Note that all joins go through join_rows() now
  • join-rows.R

    • This basically remaps dplyr arguments to vec_matches() arguments and calls vec_matches()
    • Many errors are rethrown from vctrs with more applicable error messages
  • join-cols.R

    • To see how keep = NULL is handled
    • standardise_join_by() was replaced by as_join_by() and join_by_common() in join-by.R
  • Tests

    • I've added quite a few for the join_by() parsing checks
    • I've also added snapshot tests for the rethrown errors from dplyr_matches()
    • It's worth noting that a number of tests needed an explicit multiple = "all" now to silence the new warning

Summary

This PR implements has two main purposes:

  1. It adds join_by() for creating a join specification which can be specified as the by argument to any join. This allows:

    • Equi join specification with unnamed left and right hand sides, i.e. join_by(date1 == date2)

    • Specification of non-equi joins like join_by(id, date1 >= date2)

    • Specification of rolling joins with join_by(id, preceding(date1, date2)) and join_by(id, following(date1, date2))

    • Shortcuts for some complex non-equi joins: between(), overlaps(), within().

    • There is a restriction that the left and right hand side of the join conditions have to be symbols or strings, i.e. you can't do join_by(x + 1 > y). It is unclear what the resulting column name should be if you did this, and you can get into order of operation issues, i.e. !x > y where ! has a higher precedence than >.

    • To support non-equi joins, keep has gain a completely new default value of NULL. In non-equi joins like join_by(sale_date >= commercial_date), since the information in the two columns isn't exactly the same, you almost always want to keep both columns. So keep = NULL implies keep = FALSE for equi conditions and keep = TRUE for non-equi conditions. This should be fully backwards compatible.

  2. It adds two new "quality control" arguments to most of the join functions that have been requested over the years.

    • multiple is a new argument for controlling what happens when a row in x matches multiple rows in y. It allows c("all", "first", "last", "warning", "error"). The default is NULL, which uses "warning" for equi and rolling joins, where multiple matches are surprising, and "all" for cross joins and when there is at least 1 non-equi join condition, where multiple matches are expected. This is a change from the current CRAN dplyr behavior, which never warns.

    • unmatched is a new argument for controlling what happens when a row would be dropped because it doesn't have a match. It allows c("drop", "error").

    • Combined with the proposed enforce(), this allows for a number of useful checks to ensure that you are doing a 1:1, 1:m, m:1, m:m style join.

Not implemented

Future enhancements for other PRs (as suggested from comments):

  • join_any() that would accept arbitrary predicates and would perform a much slower cartesian join + filter using the predicates. It could do it in batches to reduce peak memory. This would require a completely separate engine path for each type of join.

  • join_at() a tidy select interface for equi joins if you have many columns of the same name to join by. This would have an as_join_by() method that translated the result to a dplyr_join_by object.

Examples

library(dplyr)

set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))

set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))

needles
#> # A tibble: 5 x 2
#>   dates          x
#>   <date>     <int>
#> 1 2019-01-01     3
#> 2 2019-01-02     2
#> 3 2019-01-03     5
#> 4 2019-01-04     4
#> 5 2019-01-05     1
haystack
#> # A tibble: 5 x 3
#>   lower      upper          y
#>   <date>     <date>     <int>
#> 1 2019-01-04 2019-01-06     1
#> 2 2019-01-07 2019-01-08     2
#> 3 2019-01-04 2019-01-05     3
#> 4 2019-01-03 2019-01-05     4
#> 5 2019-01-03 2019-01-05     5

# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-04 2019-01-06     1
#>  6 2019-01-04     4 2019-01-04 2019-01-05     3
#>  7 2019-01-04     4 2019-01-03 2019-01-05     4
#>  8 2019-01-04     4 2019-01-03 2019-01-05     5
#>  9 2019-01-05     1 2019-01-04 2019-01-06     1
#> 10 2019-01-05     1 2019-01-04 2019-01-05     3
#> 11 2019-01-05     1 2019-01-03 2019-01-05     4
#> 12 2019-01-05     1 2019-01-03 2019-01-05     5

# If we really cared about including row 2 from `haystack` that wasn't matched by `needles`...
full_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 13 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-03 2019-01-05     4
#>  6 2019-01-04     4 2019-01-03 2019-01-05     5
#>  7 2019-01-04     4 2019-01-04 2019-01-05     3
#>  8 2019-01-04     4 2019-01-04 2019-01-06     1
#>  9 2019-01-05     1 2019-01-03 2019-01-05     4
#> 10 2019-01-05     1 2019-01-03 2019-01-05     5
#> 11 2019-01-05     1 2019-01-04 2019-01-05     3
#> 12 2019-01-05     1 2019-01-04 2019-01-06     1
#> 13 NA            NA 2019-01-07 2019-01-08     2


sales <- tibble(
  sale_id = c("S0", "S1", "S2", "S3", "S4", "S5"),
  sale_date = as.Date(c("2013-2-20", "2014-5-1", "2014-5-1", "2014-6-15", "2014-7-1", "2014-12-31"))
)
commercials <- tibble(
  comm_id = c("C1", "C2", "C3", "C4", "C5"),
  comm_date = as.Date(c("2014-1-1", "2014-4-1", "2014-7-1", "2014-9-15", "2014-9-15"))
)

sales
#> # A tibble: 6 x 2
#>   sale_id sale_date 
#>   <chr>   <date>    
#> 1 S0      2013-02-20
#> 2 S1      2014-05-01
#> 3 S2      2014-05-01
#> 4 S3      2014-06-15
#> 5 S4      2014-07-01
#> 6 S5      2014-12-31
commercials
#> # A tibble: 5 x 2
#>   comm_id comm_date 
#>   <chr>   <date>    
#> 1 C1      2014-01-01
#> 2 C2      2014-04-01
#> 3 C3      2014-07-01
#> 4 C4      2014-09-15
#> 5 C5      2014-09-15

# Rolling join:
# "Find the most recent commercial that aired just before this sale"
left_join(sales, commercials, by = join_by(max(sale_date >= comm_date)))
#> # A tibble: 7 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S0      2013-02-20 <NA>    NA        
#> 2 S1      2014-05-01 C2      2014-04-01
#> 3 S2      2014-05-01 C2      2014-04-01
#> 4 S3      2014-06-15 C2      2014-04-01
#> 5 S4      2014-07-01 C3      2014-07-01
#> 6 S5      2014-12-31 C4      2014-09-15
#> 7 S5      2014-12-31 C5      2014-09-15

# Notice that rows 6 and 7 have two commercials that aired on the same date.
# Limit that with `multiple`.
left_join(sales[6,], commercials, by = join_by(max(sale_date >= comm_date)), multiple = "first")
#> # A tibble: 1 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S5      2014-12-31 C4      2014-09-15
left_join(sales[6,], commercials, by = join_by(max(sale_date >= comm_date)), multiple = "last")
#> # A tibble: 1 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S5      2014-12-31 C5      2014-09-15

# "Find the most recent commercial that aired just before this sale,
# but limit it to within 40 days of the sale"
sales <- mutate(sales, sale_date_lower = sale_date - 40)

left_join(sales, commercials, by = join_by(max(sale_date >= comm_date), sale_date_lower <= comm_date))
#> # A tibble: 6 x 5
#>   sale_id sale_date  sale_date_lower comm_id comm_date 
#>   <chr>   <date>     <date>          <chr>   <date>    
#> 1 S0      2013-02-20 2013-01-11      <NA>    NA        
#> 2 S1      2014-05-01 2014-03-22      C2      2014-04-01
#> 3 S2      2014-05-01 2014-03-22      C2      2014-04-01
#> 4 S3      2014-06-15 2014-05-06      <NA>    NA        
#> 5 S4      2014-07-01 2014-05-22      C3      2014-07-01
#> 6 S5      2014-12-31 2014-11-21      <NA>    NA

<\details>

@DavisVaughan

This comment has been minimized.

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like exactly what I was imagining for extending the by specification 😄

With this interface in place there are a couple of other join helpers I can imagine:

  • join_any() which would accept any list of predicates, by first joining a cartesian join (possibly in batches) and then applying the predicates. The performance wouldn't be any better than a cartesian join + a filter, but we since we could do it iteratively, we could reduce peak memory.
  • join_at() which would use tidyselect to make it easier to join by many common variables

I'm not sure we would actually implement these, but I like having the option.

tests/testthat/test-join-cols.R Outdated Show resolved Hide resolved
@davidchall

This comment has been minimized.

@DavisVaughan

This comment has been minimized.

@davidchall

This comment has been minimized.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 26, 2021

Note: Go back and answer a few popular community questions, like:

NEWS.md Show resolved Hide resolved
R/join-by.R Show resolved Hide resolved
R/join-cols.R Outdated Show resolved Hide resolved
R/join-cols.R Outdated Show resolved Hide resolved
R/join-cols.R Show resolved Hide resolved
R/join-rows.R Show resolved Hide resolved
R/join.r Show resolved Hide resolved
R/join.r Outdated Show resolved Hide resolved
_pkgdown.yml Show resolved Hide resolved
tests/testthat/test-join.r Show resolved Hide resolved
R/join-rows.R Outdated Show resolved Hide resolved
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it up to the end of join-cols.R. I'll tackle the rest later

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Outdated Show resolved Hide resolved
R/join-by.R Show resolved Hide resolved
R/join-cols.R Outdated Show resolved Hide resolved
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks great and I think this is good to merge once you've read through my smaller suggestions, and we're ready to start on dplyr 1.1.

R/join-rows.R Outdated Show resolved Hide resolved
R/join-rows.R Outdated Show resolved Hide resolved
R/join-rows.R Show resolved Hide resolved
R/join-rows.R Outdated Show resolved Hide resolved
R/join-rows.R Outdated Show resolved Hide resolved
R/join.r Outdated Show resolved Hide resolved
R/join.r Outdated Show resolved Hide resolved
R/join.r Outdated Show resolved Hide resolved
_pkgdown.yml Show resolved Hide resolved
tests/testthat/test-join.r Show resolved Hide resolved
Merge branch 'main' into feature/vec-matches

# Conflicts:
#	R/join-cols.R
#	R/join-rows.R
#	R/join.r
#	tests/testthat/_snaps/bind.md
#	tests/testthat/_snaps/join-cols.md
#	tests/testthat/_snaps/join-rows.md
#	tests/testthat/test-join-cols.R
I don't think these can error, as the common type is handled earlier, but it doesn't hurt and is good to be consistent
@DavisVaughan DavisVaughan marked this pull request as ready for review May 3, 2022 19:33
@DavisVaughan DavisVaughan changed the title WIP: Flexible joins Flexible joins May 3, 2022
@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 4, 2022

  • comperes: Produced warning in test
  • dodgr: Produced warnings in test
  • exuber: Produced warnings in test
  • lans2r: Produced warnings in test
  • MBNMAtime: Produced warnings in test
  • modeldb: Produced warnings in test
  • multicolor: Produced warnings in test
  • parsnip: Produced warnings in Rd file generation and examples
  • sfnetworks: Error due to sfc not being orderable (tackling in vctrs) Rewrite vec_joint_proxy_order() with a more conservative heuristic r-lib/vctrs#1558
  • stars: Expected a different warning in test
  • Tplyr: Produced warnings in test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants