Join verbs applied to data frames have na_matches = "na" as default, with the option to change to na_matches = "never".
However, join verbs applied to data tables (left_join and inner_join) have na_matches = "never" as default and unique option.
See this Stack Exchange post.
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")
df_1
#> # A tibble: 2 x 3
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
df_2
#> # A tibble: 2 x 3
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
dt_1
#> # Source: table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D
#> <chr> <chr> <chr>
#> 1 a b d
#> 2 aa bb <NA>
dt_2
#> # Source: table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#> A C D
#> <chr> <chr> <chr>
#> 1 a c d
#> 2 aa cc <NA>
df_1 and df_2 are the "data frames" and dt_1 and dt_2 the "data tables":
Now with left_join (the same problem happens with inner_join):
left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> cc
left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
left_join(dt_1, dt_2)
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb <NA> <NA>
We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na" according to the doc) but in the case of tbl even with the explicit option na_matches = "na". This is unexpected.
Debug mode
For a little more "inside insight".
For data tables
debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#> debugging in: left_join(dt_1, dt_2, na_matches = "na")
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> debug: {
#> add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> }
Browse[3]>
#> debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#> debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> debug: {
#> if (!is.null(sql_on)) {
#> by <- list(x = character(0), y = character(0), on = sql(sql_on))
#> }
#> else if (identical(type, "full") && identical(by, character())) {
#> type <- "cross"
#> by <- list(x = character(0), y = character(0))
#> }
#> else {
#> by <- common_by(by, x, y)
#> }
#> y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#> list(by$y))
#> vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#> suffix = suffix)
#> x$ops <- op_double("join", x, y, args = list(vars = vars,
#> type = type, by = by, suffix = suffix))
#> x
#> }
Browse[4]> where
#> where 1: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> where 2: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> where 3: left_join(dt_1, dt_2, na_matches = "na")
Browse[4]> f
#> Joining, by = c("A", "D")
#> exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#> suffix = suffix, auto_index = auto_index, ...)
#> exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> exiting from: left_join(dt_1, dt_2, na_matches = "na")
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA NA
We see here that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.
For data frames
left_join(df_1, df_2)
#> debugging in: left_join(df_1, df_2)
#> debug: {
#> UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_df(df_1, df_2)
#> debug: {
#> check_valid_names(tbl_vars(x))
#> check_valid_names(tbl_vars(y))
#> by <- common_by(by, x, y)
#> suffix <- check_suffix(suffix)
#> na_matches <- check_na_matches(na_matches)
#> y <- auto_copy(x, y, copy = copy)
#> vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#> by_x <- vars$idx$x$by
#> by_y <- vars$idx$y$by
#> aux_x <- vars$idx$x$aux
#> aux_y <- vars$idx$y$aux
#> out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#> environment())
#> names(out) <- vars$alias
#> reconstruct_join(out, x, vars)
#> }
Browse[3]>
#> debug: check_valid_names(tbl_vars(x))
Browse[3]>
#> debug: check_valid_names(tbl_vars(y))
Browse[3]>
#> debug: by <- common_by(by, x, y)
Browse[3]>
#> Joining, by = c("A", "D")
#> debug: suffix <- check_suffix(suffix)
Browse[3]>
#> debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#> debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#> [1] TRUE
Browse[3]> f
#> Joining, by = c("A", "D")
#> exiting from: left_join.tbl_df(df_1, df_2)
#> exiting from: left_join(df_1, df_2)
#> # A tibble: 2 x 4
#> A B D C
#> <chr> <chr> <chr> <chr>
#> 1 a b d c
#> 2 aa bb NA cc
Here we see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.
Also, a Stack Exchange commenter mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames".
So after that much research it is clear now that the default for data frames is na_matches = "na" but for data tables it is na_matches = "never" with no other option. But this is not at all clear in the join doc, which, for na_matches refers to the join.tbl_df doc.
It's always confusing for the user when a default value for the same function is different for different types, so it would be nice if someone could implement na_matches = "na" for data tables and set it as default, as it is for data frames.
Not sure how much work this is and I understand there might be other priorities, but in the meantime I think the doc should clearly state: "The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other option) for data tables".
The text was updated successfully, but these errors were encountered:
Moved to dbplyr since this is a documentation issue.
Please note in R data table means http://r-datatable.com; you're talking about databases.
hadley
changed the title
na_matches default in join verbs is different for data tables and for data frames
Clarify that na_matches argument doesn't apply
Dec 10, 2019
Join verbs applied to data frames have
na_matches = "na"
as default, with the option to change tona_matches = "never"
.However, join verbs applied to data tables (
left_join
andinner_join
) havena_matches = "never"
as default and unique option.See this Stack Exchange post.
Setting
The data
df_1
anddf_2
are the "data frames" anddt_1
anddt_2
the "data tables":The problem (left_join)
Now with
left_join
(the same problem happens withinner_join
):We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na" according to the doc) but in the case of tbl even with the explicit option na_matches = "na". This is unexpected.
Debug mode
For a little more "inside insight".
For data tables
We see here that
left_join
callsleft_join.tbl_lazy
on data tables with thena_matches = “na”
option. However this is followed by a call toadd_op_join
the definition of which does not have any mention ofna_matches
.For data frames
Here we see that
left_join
callsleft_join.tbl_df
on data frames. Further down we see thatna_matches
is set toTRUE
before being used as argument inleft_join_impl
. All this makes sense.Also, a Stack Exchange commenter mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames".
So after that much research it is clear now that the default for data frames is
na_matches = "na"
but for data tables it isna_matches = "never"
with no other option. But this is not at all clear in the join doc, which, forna_matches
refers to the join.tbl_df doc.It's always confusing for the user when a default value for the same function is different for different types, so it would be nice if someone could implement
na_matches = "na"
for data tables and set it as default, as it is for data frames.Not sure how much work this is and I understand there might be other priorities, but in the meantime I think the doc should clearly state: "The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other option) for data tables".
The text was updated successfully, but these errors were encountered: