Flexible joins #5910

DavisVaughan · 2021-06-08T17:22:12Z

Closes #5914
Closes #5661
Closes #5413
Closes #2240 (a whopping 5 years old! with 138 thumbs up!)

Reading order

NEWS
join-by.R
- To read join_by() docs / examples and to get a general idea of how it works
- I've tried to error intelligently for many bad user inputs in the parsing code. This probably doesn't need quite as deep of a review.
join.R
- To read updated docs
- Note that all joins go through join_rows() now
join-rows.R
- This basically remaps dplyr arguments to vec_matches() arguments and calls vec_matches()
- Many errors are rethrown from vctrs with more applicable error messages
join-cols.R
- To see how keep = NULL is handled
- standardise_join_by() was replaced by as_join_by() and join_by_common() in join-by.R
Tests
- I've added quite a few for the join_by() parsing checks
- I've also added snapshot tests for the rethrown errors from dplyr_matches()
- It's worth noting that a number of tests needed an explicit multiple = "all" now to silence the new warning

Summary

This PR implements has two main purposes:

It adds join_by() for creating a join specification which can be specified as the by argument to any join. This allows:
- Equi join specification with unnamed left and right hand sides, i.e. join_by(date1 == date2)
- Specification of non-equi joins like join_by(id, date1 >= date2)
- Specification of rolling joins with join_by(id, preceding(date1, date2)) and join_by(id, following(date1, date2))
- Shortcuts for some complex non-equi joins: between(), overlaps(), within().
- There is a restriction that the left and right hand side of the join conditions have to be symbols or strings, i.e. you can't do join_by(x + 1 > y). It is unclear what the resulting column name should be if you did this, and you can get into order of operation issues, i.e. !x > y where ! has a higher precedence than >.
- To support non-equi joins, keep has gain a completely new default value of NULL. In non-equi joins like join_by(sale_date >= commercial_date), since the information in the two columns isn't exactly the same, you almost always want to keep both columns. So keep = NULL implies keep = FALSE for equi conditions and keep = TRUE for non-equi conditions. This should be fully backwards compatible.
It adds two new "quality control" arguments to most of the join functions that have been requested over the years.
- multiple is a new argument for controlling what happens when a row in x matches multiple rows in y. It allows c("all", "first", "last", "warning", "error"). The default is NULL, which uses "warning" for equi and rolling joins, where multiple matches are surprising, and "all" for cross joins and when there is at least 1 non-equi join condition, where multiple matches are expected. This is a change from the current CRAN dplyr behavior, which never warns.
- unmatched is a new argument for controlling what happens when a row would be dropped because it doesn't have a match. It allows c("drop", "error").
- Combined with the proposed enforce(), this allows for a number of useful checks to ensure that you are doing a 1:1, 1:m, m:1, m:m style join.

Not implemented

Future enhancements for other PRs (as suggested from comments):

join_any() that would accept arbitrary predicates and would perform a much slower cartesian join + filter using the predicates. It could do it in batches to reduce peak memory. This would require a completely separate engine path for each type of join.
join_at() a tidy select interface for equi joins if you have many columns of the same name to join by. This would have an as_join_by() method that translated the result to a dplyr_join_by object.

Examples

library(dplyr)

set.seed(123)
dates <- as.Date("2019-01-01") + 0:4
needles <- tibble(dates = dates, x = sample(length(dates)))

set.seed(123)
lower <- as.Date("2019-01-01") + sample(6, 5, replace = TRUE)
upper <- lower + sample(2, 5, replace = TRUE)
haystack <- tibble(lower = lower, upper = upper, y = sample(length(lower)))

needles
#> # A tibble: 5 x 2
#>   dates          x
#>   <date>     <int>
#> 1 2019-01-01     3
#> 2 2019-01-02     2
#> 3 2019-01-03     5
#> 4 2019-01-04     4
#> 5 2019-01-05     1
haystack
#> # A tibble: 5 x 3
#>   lower      upper          y
#>   <date>     <date>     <int>
#> 1 2019-01-04 2019-01-06     1
#> 2 2019-01-07 2019-01-08     2
#> 3 2019-01-04 2019-01-05     3
#> 4 2019-01-03 2019-01-05     4
#> 5 2019-01-03 2019-01-05     5

# Non-equi join
# For each row in `needles`, find locations in `haystack` matching the condition
left_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 12 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-04 2019-01-06     1
#>  6 2019-01-04     4 2019-01-04 2019-01-05     3
#>  7 2019-01-04     4 2019-01-03 2019-01-05     4
#>  8 2019-01-04     4 2019-01-03 2019-01-05     5
#>  9 2019-01-05     1 2019-01-04 2019-01-06     1
#> 10 2019-01-05     1 2019-01-04 2019-01-05     3
#> 11 2019-01-05     1 2019-01-03 2019-01-05     4
#> 12 2019-01-05     1 2019-01-03 2019-01-05     5

# If we really cared about including row 2 from `haystack` that wasn't matched by `needles`...
full_join(needles, haystack, by = join_by(dates >= lower, dates <= upper))
#> # A tibble: 13 x 5
#>    dates          x lower      upper          y
#>    <date>     <int> <date>     <date>     <int>
#>  1 2019-01-01     3 NA         NA            NA
#>  2 2019-01-02     2 NA         NA            NA
#>  3 2019-01-03     5 2019-01-03 2019-01-05     4
#>  4 2019-01-03     5 2019-01-03 2019-01-05     5
#>  5 2019-01-04     4 2019-01-03 2019-01-05     4
#>  6 2019-01-04     4 2019-01-03 2019-01-05     5
#>  7 2019-01-04     4 2019-01-04 2019-01-05     3
#>  8 2019-01-04     4 2019-01-04 2019-01-06     1
#>  9 2019-01-05     1 2019-01-03 2019-01-05     4
#> 10 2019-01-05     1 2019-01-03 2019-01-05     5
#> 11 2019-01-05     1 2019-01-04 2019-01-05     3
#> 12 2019-01-05     1 2019-01-04 2019-01-06     1
#> 13 NA            NA 2019-01-07 2019-01-08     2


sales <- tibble(
  sale_id = c("S0", "S1", "S2", "S3", "S4", "S5"),
  sale_date = as.Date(c("2013-2-20", "2014-5-1", "2014-5-1", "2014-6-15", "2014-7-1", "2014-12-31"))
)
commercials <- tibble(
  comm_id = c("C1", "C2", "C3", "C4", "C5"),
  comm_date = as.Date(c("2014-1-1", "2014-4-1", "2014-7-1", "2014-9-15", "2014-9-15"))
)

sales
#> # A tibble: 6 x 2
#>   sale_id sale_date 
#>   <chr>   <date>    
#> 1 S0      2013-02-20
#> 2 S1      2014-05-01
#> 3 S2      2014-05-01
#> 4 S3      2014-06-15
#> 5 S4      2014-07-01
#> 6 S5      2014-12-31
commercials
#> # A tibble: 5 x 2
#>   comm_id comm_date 
#>   <chr>   <date>    
#> 1 C1      2014-01-01
#> 2 C2      2014-04-01
#> 3 C3      2014-07-01
#> 4 C4      2014-09-15
#> 5 C5      2014-09-15

# Rolling join:
# "Find the most recent commercial that aired just before this sale"
left_join(sales, commercials, by = join_by(max(sale_date >= comm_date)))
#> # A tibble: 7 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S0      2013-02-20 <NA>    NA        
#> 2 S1      2014-05-01 C2      2014-04-01
#> 3 S2      2014-05-01 C2      2014-04-01
#> 4 S3      2014-06-15 C2      2014-04-01
#> 5 S4      2014-07-01 C3      2014-07-01
#> 6 S5      2014-12-31 C4      2014-09-15
#> 7 S5      2014-12-31 C5      2014-09-15

# Notice that rows 6 and 7 have two commercials that aired on the same date.
# Limit that with `multiple`.
left_join(sales[6,], commercials, by = join_by(max(sale_date >= comm_date)), multiple = "first")
#> # A tibble: 1 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S5      2014-12-31 C4      2014-09-15
left_join(sales[6,], commercials, by = join_by(max(sale_date >= comm_date)), multiple = "last")
#> # A tibble: 1 x 4
#>   sale_id sale_date  comm_id comm_date 
#>   <chr>   <date>     <chr>   <date>    
#> 1 S5      2014-12-31 C5      2014-09-15

# "Find the most recent commercial that aired just before this sale,
# but limit it to within 40 days of the sale"
sales <- mutate(sales, sale_date_lower = sale_date - 40)

left_join(sales, commercials, by = join_by(max(sale_date >= comm_date), sale_date_lower <= comm_date))
#> # A tibble: 6 x 5
#>   sale_id sale_date  sale_date_lower comm_id comm_date 
#>   <chr>   <date>     <date>          <chr>   <date>    
#> 1 S0      2013-02-20 2013-01-11      <NA>    NA        
#> 2 S1      2014-05-01 2014-03-22      C2      2014-04-01
#> 3 S2      2014-05-01 2014-03-22      C2      2014-04-01
#> 4 S3      2014-06-15 2014-05-06      <NA>    NA        
#> 5 S4      2014-07-01 2014-05-22      C3      2014-07-01
#> 6 S5      2014-12-31 2014-11-21      <NA>    NA

<\details>

hadley

This looks like exactly what I was imagining for extending the by specification 😄

With this interface in place there are a couple of other join helpers I can imagine:

join_any() which would accept any list of predicates, by first joining a cartesian join (possibly in batches) and then applying the predicates. The performance wouldn't be any better than a cartesian join + a filter, but we since we could do it iteratively, we could reduce peak memory.
join_at() which would use tidyselect to make it easier to join by many common variables

I'm not sure we would actually implement these, but I like having the option.

tests/testthat/test-join-cols.R

DavisVaughan · 2021-06-26T19:24:57Z

Note: Go back and answer a few popular community questions, like:

https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881/2 (and questions that link to it)

tests/testthat/_snaps/bind.md

NEWS.md

R/join-by.R

R/join-cols.R

R/join-rows.R

R/join.r

_pkgdown.yml

tests/testthat/test-join.r

R/join-rows.R

hadley

Made it up to the end of join-cols.R. I'll tackle the rest later

NEWS.md

R/join-by.R

R/join-cols.R

hadley

Overall, looks great and I think this is good to merge once you've read through my smaller suggestions, and we're ready to start on dplyr 1.1.

R/join-rows.R

R/join.r

_pkgdown.yml

tests/testthat/test-join.r

This defaults to `keep = FALSE` on equi conditions, but `keep = TRUE` on non-equi conditions, as this is generally what you want. The information in columns involved in a non-equi condition rarely overlaps, so you almost never want to drop the keys of the RHS.

Merge branch 'main' into feature/vec-matches # Conflicts: # R/join-cols.R # R/join-rows.R # R/join.r # tests/testthat/_snaps/bind.md # tests/testthat/_snaps/join-cols.md # tests/testthat/_snaps/join-rows.md # tests/testthat/test-join-cols.R

I don't think these can error, as the common type is handled earlier, but it doesn't hurt and is good to be consistent

DavisVaughan · 2022-05-04T14:48:51Z

Details: - #7 . - tidyverse/dplyr#5910 .

This comment has been minimized.

Sign in to view

hadley reviewed Jun 17, 2021

View reviewed changes

tests/testthat/test-join-cols.R Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

DavisVaughan force-pushed the feature/vec-matches branch from 2087f24 to 217eef1 Compare June 21, 2021 14:40

DavisVaughan mentioned this pull request Jul 21, 2021

future_pmap with large data frame involved DavisVaughan/furrr#199

Closed

This was referenced Aug 17, 2021

Support returning in rows_*.tbl_dbi() cynkra/dm#607

Merged

rows_update(): Conditional Update & Mutating Join cynkra/dm#609

Closed

DavisVaughan force-pushed the feature/vec-matches branch from 13d9f5a to 535c8d5 Compare September 8, 2021 15:12

DavisVaughan commented Sep 8, 2021

View reviewed changes

tests/testthat/_snaps/bind.md Outdated Show resolved Hide resolved

DavisVaughan commented Sep 8, 2021

View reviewed changes

R/join-rows.R Outdated Show resolved Hide resolved

DavisVaughan requested a review from hadley September 8, 2021 18:34

hadley reviewed Sep 8, 2021

View reviewed changes

hadley approved these changes Sep 9, 2021

View reviewed changes

DavisVaughan mentioned this pull request Oct 5, 2021

Implement vec_locate_matches() r-lib/vctrs#1472

Merged

DavisVaughan added 12 commits October 6, 2021 14:24

Hack in vec_matches() for testing

0dc42f5

Add prototype of join_by()

f167286

Support join_by() with varying conditions in all joins

0cc3086

Add remote on the vec_matches() vctrs PR

6272427

Add support for controlling multiple matches

162fa7d

Support the filter option of vec_matches()

199978f

Let vec_matches() be responsible for dropping x specific rows

b0bed87

Use remaining to avoid a call to vec_in()

b1cddc5

First pass of join_by() documentation

01d046a

Add new_join_by() constructor

d9721b2

Implement keep = NULL

a25b293

This defaults to `keep = FALSE` on equi conditions, but `keep = TRUE` on non-equi conditions, as this is generally what you want. The information in columns involved in a non-equi condition rarely overlaps, so you almost never want to drop the keys of the RHS.

Revise join_by() examples section

6eef24c

DavisVaughan added 11 commits May 3, 2022 12:19

Use latest CRAN versions of rlang and vctrs

d8398cf

Merge main into branch

e827bf6

Merge branch 'main' into feature/vec-matches # Conflicts: # R/join-cols.R # R/join-rows.R # R/join.r # tests/testthat/_snaps/bind.md # tests/testthat/_snaps/join-cols.md # tests/testthat/_snaps/join-rows.md # tests/testthat/test-join-cols.R

Use uppercase J

236ed25

Be overly defensive against typos

64ae160

by is a required argument for join_filter()

c86f6c1

Pass the error call through check_na_matches()

73340f3

Pass the error call through check_unmatched()

ed810d3

Defensively pass the error_call through more places in join_mutate()

23c7a51

I don't think these can error, as the common type is handled earlier, but it doesn't hurt and is good to be consistent

Pass the error call through join_by()

269c30e

Pass the error call through join_by_common()

597984a

Officially add support for multiple = "any"

a11541e

DavisVaughan marked this pull request as ready for review May 3, 2022 19:33

Make the join_by() helper documentation more user friendly

d4661de

DavisVaughan changed the title ~~WIP: Flexible joins~~ Flexible joins May 3, 2022

DavisVaughan added 3 commits May 3, 2022 16:14

Optimize join_by() docs for pkgdown and fix some typos

350ca85

Add a comment about the signatures of the binding_*() functions

7eb902b

Upload revdep results

c91e37d

Bump to 99 dev version so revdeps can depend on it

db891f1

DavisVaughan merged commit 2b8f726 into tidyverse:main May 9, 2022

DavisVaughan deleted the feature/vec-matches branch May 9, 2022 19:09

DavisVaughan mentioned this pull request May 9, 2022

dplyr 1.1.0 revdep tracker #6262

Closed

markfairbanks mentioned this pull request May 9, 2022

Implement join_by() markfairbanks/tidytable#222

Open

DavisVaughan mentioned this pull request May 13, 2022

Advise on how to silence the multiple = "warning" message #6269

Merged

hsbadr mentioned this pull request May 18, 2022

Request: Export a unified wrapper of all *_join verbs #6272

Closed

simonpcouch mentioned this pull request Jul 22, 2022

join warnings with dev dplyr tidymodels/stacks#143

Closed

eutwt mentioned this pull request Dec 22, 2022

Join "quality control" pola-rs/polars#5883

Closed

echasnovski added a commit to echasnovski/comperes that referenced this pull request Jan 1, 2023

Update {dplyr} joins to not give warning in case of multiple matches.

342b5a5

Details: - #7 . - tidyverse/dplyr#5910 .

asfimport mentioned this pull request Dec 12, 2022

[R] Support inequality joins apache/arrow#29841

Open

ianmcook mentioned this pull request Jan 15, 2023

[R] Forward compatibility for dplyr::join_by() output apache/arrow#14981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible joins #5910

Flexible joins #5910

DavisVaughan commented Jun 8, 2021 •

edited

Loading

This comment has been minimized.

hadley left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

DavisVaughan commented Jun 26, 2021 •

edited

Loading

hadley left a comment

hadley left a comment

DavisVaughan commented May 4, 2022 •

edited

Loading

Flexible joins #5910

Flexible joins #5910

Conversation

DavisVaughan commented Jun 8, 2021 • edited Loading

Reading order

Summary

Not implemented

Examples

This comment has been minimized.

hadley left a comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

DavisVaughan commented Jun 26, 2021 • edited Loading

hadley left a comment

Choose a reason for hiding this comment

hadley left a comment

Choose a reason for hiding this comment

DavisVaughan commented May 4, 2022 • edited Loading

DavisVaughan commented Jun 8, 2021 •

edited

Loading

DavisVaughan commented Jun 26, 2021 •

edited

Loading

DavisVaughan commented May 4, 2022 •

edited

Loading