-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
outer joins don't keep join columns from both sides #4589
Comments
Maybe library(dplyr, warn.conflicts = FALSE)
ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
nest_join(ta, tx, by=c(a="x"))
#> # A tibble: 4 x 3
#> a b tx
#> <dbl> <dbl> <list>
#> 1 NA 1 <tibble [1 × 1]>
#> 2 2 2 <tibble [0 × 1]>
#> 3 3 3 <tibble [1 × 1]>
#> 4 3 4 <tibble [1 × 1]>
nest_join(tx, ta, by=c(x="a"))
#> # A tibble: 4 x 3
#> x y ta
#> <dbl> <dbl> <list>
#> 1 3 3 <tibble [2 × 1]>
#> 2 4 4 <tibble [0 × 1]>
#> 3 5 5 <tibble [0 × 1]>
#> 4 NA 6 <tibble [1 × 1]> Created on 2019-11-18 by the reprex package (v0.3.0.9000) |
Hm, thanks for the pointer (to nest_join and to reprex!), and for looking at the issue. I ended up making copies of both join columns and then renaming them back after the join: library(dplyr, warn.conflicts=FALSE)
ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
dplyr::full_join(
ta %>% mutate(a_copy = a),
tx %>% mutate(x_copy = x),
by = c(a="x"),
na_matches = "never"
) %>%
dplyr::select(-a) %>%
dplyr::rename(a = a_copy, x = x_copy)
#> # A tibble: 7 x 4
#> b a y x
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 NA NA NA
#> 2 2 2 NA NA
#> 3 3 3 3 3
#> 4 4 3 3 3
#> 5 NA NA 4 4
#> 6 NA NA 5 5
#> 7 NA NA 6 NA Created on 2019-11-18 by the reprex package (v0.3.0) I still think that Maybe there are backward-compatibility concerns at this point, though. If you do choose to keep the current behavior, I think it makes sense to document that the original columns aren't all kept. |
I think |
Adding an argument (with the previous behavior as a default) makes sense for back-compat. I'm not sure exactly what "works similarly to mitch=> select * from ta;
a | b
---+---
| 1
2 | 2
3 | 3
3 | 4
(4 rows)
mitch=> select * from tx;
x | y
---+---
3 | 3
4 | 4
5 | 5
| 6
(4 rows)
mitch=> select * from ta full outer join tx on ta.a = tx.x;
a | b | x | y
---+---+---+---
2 | 2 | |
3 | 3 | 3 | 3
3 | 4 | 3 | 3
| 1 | |
| | 4 | 4
| | 5 | 5
| | | 6
(7 rows) I'm not sure if you have a general story for how closely to match database semantics. I personally think about joins in a very database-y way; I typically avoid nested tables. So my vote (FWIW) is for the result above to at least be an option. |
@skinner yes that's exactly what I was meaning. |
@skinner does this look as you'd expect? library(dplyr, warn.conflicts = FALSE)
ta <- tibble(a=c(NA, 2, 3, 3), b=c(1, 2, 3, 4))
tx <- tibble(x=c(3, 4, 5, NA), y=c(3, 4, 5, 6))
full_join(ta, tx, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 6 x 4
#> a b x y
#> <dbl> <dbl> <dbl> <dbl>
#> 1 NA 1 NA 6
#> 2 2 2 NA NA
#> 3 3 3 3 3
#> 4 3 4 3 3
#> 5 4 NA 4 4
#> 6 5 NA 5 5 Created on 2020-01-12 by the reprex package (v0.3.0) (I think the difference from the SQL results is |
Hm, in that output, the Given that there's an |
@skinner good catch, I'll have to think about that more. |
Ok, finally got it 😄 library(dplyr, warn.conflicts = FALSE)
df1 <- tibble(a = c(2, 3), b = c(1, 2))
df2 <- tibble(x = c(3, 4), y = c(3, 4))
full_join(df1, df2, by = c("a" = "x"))
#> # A tibble: 3 x 3
#> a b y
#> <dbl> <dbl> <dbl>
#> 1 2 1 NA
#> 2 3 2 3
#> 3 4 NA 4
full_join(df1, df2, by = c("a" = "x"), keep = TRUE)
#> # A tibble: 3 x 4
#> a b x y
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2 1 NA NA
#> 2 3 2 3 3
#> 3 NA NA 4 4 Created on 2020-01-13 by the reprex package (v0.3.0) |
Awesome, thank you! 🎉 |
@hadley do you think you'll implement the |
@hadley dbplyr includes code to make databases return results that are consistent with |
@ianmcook Oh yeah, that makes sense. Would you be interested in doing a PR? |
@hadley currently in |
The docs say
but it doesn't return all columns; I expect the result of this join to have four columns, but it has three:
I'm interested in questions like "how many unique
a
have a matchingx
?" (and vice versa). To answer that, I'd need botha
andx
to exist in the output.Presumably, dplyr collapses join columns down to one because in some joins (e.g. inner join) they'd be the same. But with outer joins they're different.
Left, right, and full joins all appear to have this behavior. This is with
dplyr_0.8.1
The text was updated successfully, but these errors were encountered: