You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Join verbs applied to data frames have na_matches = "na" as default, with the option to change to na_matches = "never".
However, join verbs applied to data tables (left_join and inner_join) have na_matches = "never" as default and unique option.
See this Stack Exchange post.
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")
df_1
#> # A tibble: 2 x 3
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
df_2
#> # A tibble: 2 x 3
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
dt_1
#> # Source: table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
dt_2
#> # Source: table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
df_1 and df_2 are the "data frames" and dt_1 and dt_2 the "data tables":
Now with left_join (the same problem happens with inner_join):
left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
left_join(dt_1, dt_2)
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na" according to the doc) but in the case of tbl even with the explicit option na_matches = "na". This is unexpected.
Debug mode
For a little more "inside insight".
For data tables
debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#> debugging in: left_join(dt_1, dt_2, na_matches = "na")
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> debug: {
#> add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> }
Browse[3]>
#> debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#> debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> debug: {
#> if (!is.null(sql_on)) {
#> by <- list(x = character(0), y = character(0), on = sql(sql_on))
#> }
#> else if (identical(type, "full") && identical(by, character())) {
#> type <- "cross"
#> by <- list(x = character(0), y = character(0))
#> }
#> else {
#> by <- common_by(by, x, y)
#> }
#> y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#> list(by$y))
#> vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#> suffix = suffix)
#> x$ops <- op_double("join", x, y, args = list(vars = vars,
#> type = type, by = by, suffix = suffix))
#> x
#> }
Browse[4]> where
#> where 1: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> where 2: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> where 3: left_join(dt_1, dt_2, na_matches = "na")
Browse[4]> f
#> Joining, by = c("A", "D")
#> exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> exiting from: left_join(dt_1, dt_2, na_matches = "na")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA NA
We see here that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.
For data frames
left_join(df_1, df_2)
#> debugging in: left_join(df_1, df_2)
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_df(df_1, df_2)
#> debug: {
#> check_valid_names(tbl_vars(x))
#> check_valid_names(tbl_vars(y))
#> by <- common_by(by, x, y)
#> suffix <- check_suffix(suffix)
#> na_matches <- check_na_matches(na_matches)
#> y <- auto_copy(x, y, copy = copy)
#> vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#> by_x <- vars$idx$x$by
#> by_y <- vars$idx$y$by
#> aux_x <- vars$idx$x$aux
#> aux_y <- vars$idx$y$aux
#> out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#> environment())
#> names(out) <- vars$alias
#> reconstruct_join(out, x, vars)
#> }
Browse[3]>
#> debug: check_valid_names(tbl_vars(x))
Browse[3]>
#> debug: check_valid_names(tbl_vars(y))
Browse[3]>
#> debug: by <- common_by(by, x, y)
Browse[3]>
#> Joining, by = c("A", "D")
#> debug: suffix <- check_suffix(suffix)
Browse[3]>
#> debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#> debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#> [1] TRUE
Browse[3]> f
#> Joining, by = c("A", "D")
#> exiting from: left_join.tbl_df(df_1, df_2)
#> exiting from: left_join(df_1, df_2)
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA cc
Here we see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.
Also, a Stack Exchange commenter mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames".
So after that much research it is clear now that the default for data frames is na_matches = "na" but for data tables it is na_matches = "never" with no other option. But this is not at all clear in the join doc, which, for na_matches refers to the join.tbl_df doc.
It's always confusing for the user when a default value for the same function is different for different types, so it would be nice if someone could implement na_matches = "na" for data tables and set it as default, as it is for data frames.
Not sure how much work this is and I understand there might be other priorities, but in the meantime I think the doc should clearly state: "The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other option) for data tables".
The text was updated successfully, but these errors were encountered:
Moved to dbplyr since this is a documentation issue.
Please note in R data table means http://r-datatable.com; you're talking about databases.
hadley
changed the title
na_matches default in join verbs is different for data tables and for data frames
Clarify that na_matches argument doesn't apply
Dec 10, 2019
Join verbs applied to data frames have
na_matches = "na"
as default, with the option to change tona_matches = "never"
.However, join verbs applied to data tables (
left_join
andinner_join
) havena_matches = "never"
as default and unique option.See this Stack Exchange post.
Setting
The data
df_1
anddf_2
are the "data frames" anddt_1
anddt_2
the "data tables":The problem (left_join)
Now with
left_join
(the same problem happens withinner_join
):We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na" according to the doc) but in the case of tbl even with the explicit option na_matches = "na". This is unexpected.
Debug mode
For a little more "inside insight".
For data tables
We see here that
left_join
callsleft_join.tbl_lazy
on data tables with thena_matches = “na”
option. However this is followed by a call toadd_op_join
the definition of which does not have any mention ofna_matches
.For data frames
Here we see that
left_join
callsleft_join.tbl_df
on data frames. Further down we see thatna_matches
is set toTRUE
before being used as argument inleft_join_impl
. All this makes sense.Also, a Stack Exchange commenter mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames".
So after that much research it is clear now that the default for data frames is
na_matches = "na"
but for data tables it isna_matches = "never"
with no other option. But this is not at all clear in the join doc, which, forna_matches
refers to the join.tbl_df doc.It's always confusing for the user when a default value for the same function is different for different types, so it would be nice if someone could implement
na_matches = "na"
for data tables and set it as default, as it is for data frames.Not sure how much work this is and I understand there might be other priorities, but in the meantime I think the doc should clearly state: "The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other option) for data tables".
The text was updated successfully, but these errors were encountered: