Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outer joins don't keep join columns from both sides #4589

Closed
skinner opened this issue Oct 10, 2019 · 16 comments
Closed

outer joins don't keep join columns from both sides #4589

skinner opened this issue Oct 10, 2019 · 16 comments
Labels
feature a feature request or enhancement tables 🧮 joins and set operations

Comments

@skinner
Copy link

skinner commented Oct 10, 2019

The docs say

full_join() return all rows and all columns from both x and y

but it doesn't return all columns; I expect the result of this join to have four columns, but it has three:

> ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
> tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
> dplyr::full_join(ta, tx, by=c(a="x"), na_matches = "never")
# A tibble: 7 x 3
      a     b     y
  <dbl> <dbl> <dbl>
1    NA     1    NA
2     2     2    NA
3     3     3     3
4     3     4     3
5     4    NA     4
6     5    NA     5
7    NA    NA     6

I'm interested in questions like "how many unique a have a matching x?" (and vice versa). To answer that, I'd need both a and x to exist in the output.

Presumably, dplyr collapses join columns down to one because in some joins (e.g. inner join) they'd be the same. But with outer joins they're different.

Left, right, and full joins all appear to have this behavior. This is with dplyr_0.8.1

@romainfrancois
Copy link
Member

Maybe nest_join() can help you ?

library(dplyr, warn.conflicts = FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))

nest_join(ta, tx, by=c(a="x"))
#> # A tibble: 4 x 3
#>       a     b tx              
#>   <dbl> <dbl> <list>          
#> 1    NA     1 <tibble [1 × 1]>
#> 2     2     2 <tibble [0 × 1]>
#> 3     3     3 <tibble [1 × 1]>
#> 4     3     4 <tibble [1 × 1]>
nest_join(tx, ta, by=c(x="a"))
#> # A tibble: 4 x 3
#>       x     y ta              
#>   <dbl> <dbl> <list>          
#> 1     3     3 <tibble [2 × 1]>
#> 2     4     4 <tibble [0 × 1]>
#> 3     5     5 <tibble [0 × 1]>
#> 4    NA     6 <tibble [1 × 1]>

Created on 2019-11-18 by the reprex package (v0.3.0.9000)

@skinner
Copy link
Author

skinner commented Nov 18, 2019

Hm, thanks for the pointer (to nest_join and to reprex!), and for looking at the issue. I ended up making copies of both join columns and then renaming them back after the join:

library(dplyr, warn.conflicts=FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))

dplyr::full_join(
  ta %>% mutate(a_copy = a),
  tx %>% mutate(x_copy = x),
  by = c(a="x"),
  na_matches = "never"
) %>%
  dplyr::select(-a) %>%
  dplyr::rename(a = a_copy, x = x_copy)
#> # A tibble: 7 x 4
#>       b     a     y     x
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     1    NA    NA    NA
#> 2     2     2    NA    NA
#> 3     3     3     3     3
#> 4     4     3     3     3
#> 5    NA    NA     4     4
#> 6    NA    NA     5     5
#> 7    NA    NA     6    NA

Created on 2019-11-18 by the reprex package (v0.3.0)

I still think that full_join should keep the original columns, because there's information there that otherwise gets lost in a full join.

Maybe there are backward-compatibility concerns at this point, though. If you do choose to keep the current behavior, I think it makes sense to document that the original columns aren't all kept.

@hadley hadley added feature a feature request or enhancement tables 🧮 joins and set operations labels Dec 10, 2019
@hadley
Copy link
Member

hadley commented Jan 11, 2020

I think full_join() will gain a keep argument that works similarly to nest_join(). It'll default to FALSE for backward compatibility, but you'll be able to set it to TRUE if needed.

@skinner
Copy link
Author

skinner commented Jan 11, 2020

Adding an argument (with the previous behavior as a default) makes sense for back-compat. I'm not sure exactly what "works similarly to nest_join()" means; I'd expect full_join to return a result that's the same as the one from the column-copying version in my last comment (maybe with a different ordering of columns). That's what a database would give you; I guess that's the perspective I'm coming from. Here's the postgres result, for example:

mitch=> select * from ta;
 a | b 
---+---
   | 1
 2 | 2
 3 | 3
 3 | 4
(4 rows)

mitch=> select * from tx;
 x | y 
---+---
 3 | 3
 4 | 4
 5 | 5
   | 6
(4 rows)

mitch=> select * from ta full outer join tx on ta.a = tx.x;
 a | b | x | y 
---+---+---+---
 2 | 2 |   |  
 3 | 3 | 3 | 3
 3 | 4 | 3 | 3
   | 1 |   |  
   |   | 4 | 4
   |   | 5 | 5
   |   |   | 6
(7 rows)

I'm not sure if you have a general story for how closely to match database semantics. I personally think about joins in a very database-y way; I typically avoid nested tables. So my vote (FWIW) is for the result above to at least be an option.

@hadley
Copy link
Member

hadley commented Jan 12, 2020

@skinner yes that's exactly what I was meaning.

@hadley
Copy link
Member

hadley commented Jan 12, 2020

@skinner does this look as you'd expect?

library(dplyr, warn.conflicts = FALSE)

ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
full_join(ta, tx, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 6 x 4
#>       a     b     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    NA     1    NA     6
#> 2     2     2    NA    NA
#> 3     3     3     3     3
#> 4     3     4     3     3
#> 5     4    NA     4     4
#> 6     5    NA     5     5

Created on 2020-01-12 by the reprex package (v0.3.0)

(I think the difference from the SQL results is NA matches NA in R, but NULL doesn't match NULL in SQL)

@skinner
Copy link
Author

skinner commented Jan 12, 2020

Hm, in that output, the a column has a 4 and a 5, which it didn't have in the input. That's not what I expected.

Given that there's an na_matches argument, I assume that passing both na_matches = "never" and keep = TRUE would give a result that matches the SQL result?

@hadley
Copy link
Member

hadley commented Jan 12, 2020

@skinner good catch, I'll have to think about that more. na_matches is currently unavailable in the dev version due to some changes in the underlying implementation (we're planning on bringing it back but it'll probably be a few weeks yet)

@hadley
Copy link
Member

hadley commented Jan 13, 2020

Ok, finally got it 😄

library(dplyr, warn.conflicts = FALSE)

df1 <- tibble(a = c(2, 3), b = c(1, 2))
df2 <- tibble(x = c(3, 4), y = c(3, 4))
full_join(df1, df2, by = c("a" = "x"))
#> # A tibble: 3 x 3
#>       a     b     y
#>   <dbl> <dbl> <dbl>
#> 1     2     1    NA
#> 2     3     2     3
#> 3     4    NA     4

full_join(df1, df2, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 3 x 4
#>       a     b     x     y
#>   <dbl> <dbl> <dbl> <dbl>
#> 1     2     1    NA    NA
#> 2     3     2     3     3
#> 3    NA    NA     4     4

Created on 2020-01-13 by the reprex package (v0.3.0)

hadley added a commit that referenced this issue Jan 13, 2020
@hadley hadley closed this as completed in 1e76e8a Jan 13, 2020
@skinner
Copy link
Author

skinner commented Jan 14, 2020

Awesome, thank you! 🎉

@ianmcook
Copy link
Contributor

@hadley do you think you'll implement the keep argument for left_join() and right_join() as well? In databases, left and right outer joins return the join key columns from both the left and right tables. (It's not just full outer joins that do this.) Thanks!

@ianmcook
Copy link
Contributor

@hadley dbplyr includes code to make databases return results that are consistent with keep = FALSE. See https://github.com/tidyverse/dbplyr/blob/master/R/query-join.R#L90-L106; I originally contributed this in tidyverse/dbplyr#3. Time permitting, I'll open a PR to make dbplyr act differently when keep = TRUE.

@hadley
Copy link
Member

hadley commented Jan 16, 2020

@ianmcook Oh yeah, that makes sense. Would you be interested in doing a PR?

@ianmcook
Copy link
Contributor

@hadley sure, I'll give it a try, following the way you implemented it in full_join() in 1e76e8a

@ianmcook
Copy link
Contributor

@hadley currently in full_join() when keep = TRUE, only the join column(s) from the RHS table get the suffix (.y) appended. I think the join column(s) from the LHS table should also get the suffix (.x) appended—because they're not coalesced, so they're the values from the LHS table. Ok if I make this change too?

@ianmcook
Copy link
Contributor

@hadley please take a look at #4762. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement tables 🧮 joins and set operations
Projects
None yet
Development

No branches or pull requests

4 participants