Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that na_matches argument doesn't apply #382

Closed
bertrandh opened this issue Sep 10, 2019 · 2 comments
Closed

Clarify that na_matches argument doesn't apply #382

bertrandh opened this issue Sep 10, 2019 · 2 comments
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL

Comments

@bertrandh
Copy link

Join verbs applied to data frames have na_matches = "na" as default, with the option to change to na_matches = "never".
However, join verbs applied to data tables (left_join and inner_join) have na_matches = "never" as default and unique option.
See this Stack Exchange post.

Setting

library(tidyverse)
library(reprex)
sessionInfo()
#> R version 3.6.1 (2019-07-05)
#> Platform: x86_64-apple-darwin18.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> ...
#> other attached packages:
#>  [1] reprex_0.3.0    forcats_0.4.0   stringr_1.4.0   dplyr_0.8.1    
#>  [5] purrr_0.3.2     readr_1.3.1     tidyr_0.8.3     tibble_2.1.3   
#>  [9] ggplot2_3.2.0   tidyverse_1.2.1
#> ...

The data

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
df_1 <- tibble(A = c("a", "aa"), B = c("b", "bb"), D = c("d", NA))
df_2 <- tibble(A = c("a", "aa"), C = c("c", "cc"), D = c("d", NA))
copy_to(con, df_1, overwrite = T)
copy_to(con, df_2, overwrite = T)
dt_1 <- tbl(con, "df_1")
dt_2 <- tbl(con, "df_2")

df_1
#> # A tibble: 2 x 3
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

df_2
#> # A tibble: 2 x 3
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

dt_1
#> # Source:   table<df_1> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D    
#>   <chr> <chr> <chr>
#> 1 a     b     d    
#> 2 aa    bb    <NA>

dt_2
#> # Source:   table<df_2> [?? x 3]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     C     D    
#>   <chr> <chr> <chr>
#> 1 a     c     d    
#> 2 aa    cc    <NA>

df_1 and df_2 are the "data frames" and dt_1 and dt_2 the "data tables":

class(df_1)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(dt_1)
#> [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
#> [4] "tbl_lazy"             "tbl"

The problem (left_join)

Now with left_join (the same problem happens with inner_join):

left_join(df_1, df_2)
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  cc

left_join(df_1, df_2, na_matches = "never")
#> Joining, by = c("A", "D")
#> # A tibble: 2 x 4
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

left_join(dt_1, dt_2)
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

left_join(dt_1, dt_2, na_matches = "na")
#> Joining, by = c("A", "D")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C    
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c    
#> 2 aa    bb    <NA>  <NA>

We can see that the second row last column C has the expected cc in the case of data frames (by default na_matches = "na" according to the doc) but in the case of tbl even with the explicit option na_matches = "na". This is unexpected.

Debug mode

For a little more "inside insight".

For data tables

debug(left_join)
left_join(dt_1, dt_2, na_matches = "na")
#> debugging in: left_join(dt_1, dt_2, na_matches = "na")
#> debug: {
#>     UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]> 
#> debugging in: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> debug: {
#>     add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, 
#>         suffix = suffix, auto_index = auto_index, ...)
#> }
Browse[3]>
#> debug: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy, 
#>     suffix = suffix, auto_index = auto_index, ...)
Browse[3]> s
#> debugging in: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>     suffix = suffix, auto_index = auto_index, ...)
#> debug: {
#>     if (!is.null(sql_on)) {
#>         by <- list(x = character(0), y = character(0), on = sql(sql_on))
#>     }
#>     else if (identical(type, "full") && identical(by, character())) {
#>         type <- "cross"
#>         by <- list(x = character(0), y = character(0))
#>    }
#>    else {
#>        by <- common_by(by, x, y)
#>     }
#>    y <- auto_copy(x, y, copy = copy, indexes = if (auto_index)
#>        list(by$y))
#>    vars <- join_vars(op_vars(x), op_vars(y), type = type, by = by,
#>         suffix = suffix)
#>     x$ops <- op_double("join", x, y, args = list(vars = vars,
#>         type = type, by = by, suffix = suffix))
#>     x
#> }
Browse[4]> where
#> where 1: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>     suffix = suffix, auto_index = auto_index, ...)
#> where 2: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> where 3: left_join(dt_1, dt_2, na_matches = "na")
Browse[4]> f
#> Joining, by = c("A", "D")
#> exiting from: add_op_join(x, y, "left", by = by, sql_on = sql_on, copy = copy,
#>     suffix = suffix, auto_index = auto_index, ...)
#> exiting from: left_join.tbl_lazy(dt_1, dt_2, na_matches = "na")
#> exiting from: left_join(dt_1, dt_2, na_matches = "na")
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.29.0 [:memory:]
#>   A     B     D     C
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c
#> 2 aa    bb    NA    NA

We see here that left_join calls left_join.tbl_lazy on data tables with the na_matches = “na” option. However this is followed by a call to add_op_join the definition of which does not have any mention of na_matches.

For data frames

left_join(df_1, df_2)
#> debugging in: left_join(df_1, df_2)
#> debug: {
#>    UseMethod("left_join")
#> }
Browse[2]> n
#> debug: UseMethod("left_join")
Browse[2]>
#> debugging in: left_join.tbl_df(df_1, df_2)
#> debug: {
#>     check_valid_names(tbl_vars(x))
#>     check_valid_names(tbl_vars(y))
#>     by <- common_by(by, x, y)
#>     suffix <- check_suffix(suffix)
#>     na_matches <- check_na_matches(na_matches)
#>     y <- auto_copy(x, y, copy = copy)
#>     vars <- join_vars(tbl_vars(x), tbl_vars(y), by, suffix)
#>     by_x <- vars$idx$x$by
#>     by_y <- vars$idx$y$by
#>     aux_x <- vars$idx$x$aux
#>     aux_y <- vars$idx$y$aux
#>     out <- left_join_impl(x, y, by_x, by_y, aux_x, aux_y, na_matches,
#>         environment())
#>     names(out) <- vars$alias
#>     reconstruct_join(out, x, vars)
#> }
Browse[3]>
#>  debug: check_valid_names(tbl_vars(x))
Browse[3]>
#>  debug: check_valid_names(tbl_vars(y))
Browse[3]>
#>  debug: by <- common_by(by, x, y)
Browse[3]>
#>  Joining, by = c("A", "D")
#>  debug: suffix <- check_suffix(suffix)
Browse[3]>
#>  debug: na_matches <- check_na_matches(na_matches)
Browse[3]>
#>  debug: y <- auto_copy(x, y, copy = copy)
Browse[3]> na_matches
#>  [1] TRUE
Browse[3]> f
#> Joining, by = c("A", "D")
#> exiting from: left_join.tbl_df(df_1, df_2)
#> exiting from: left_join(df_1, df_2)
#> # A tibble: 2 x 4
#>   A     B     D     C
#>   <chr> <chr> <chr> <chr>
#> 1 a     b     d     c
#> 2 aa    bb    NA    cc

Here we see that left_join calls left_join.tbl_df on data frames. Further down we see that na_matches is set to TRUE before being used as argument in left_join_impl. All this makes sense.

Also, a Stack Exchange commenter mentioned this news link where it is stated "To match NA values, pass na_matches = 'na' to the join verbs; this is only supported for data frames".

So after that much research it is clear now that the default for data frames is na_matches = "na" but for data tables it is na_matches = "never" with no other option. But this is not at all clear in the join doc, which, for na_matches refers to the join.tbl_df doc.

It's always confusing for the user when a default value for the same function is different for different types, so it would be nice if someone could implement na_matches = "na" for data tables and set it as default, as it is for data frames.

Not sure how much work this is and I understand there might be other priorities, but in the meantime I think the doc should clearly state: "The defaults is na_matches = 'na' for data frames and na_matches = 'never' (with no other option) for data tables".

@hadley hadley transferred this issue from tidyverse/dplyr Dec 10, 2019
@hadley
Copy link
Member

hadley commented Dec 10, 2019

Moved to dbplyr since this is a documentation issue.

Please note in R data table means http://r-datatable.com; you're talking about databases.

@hadley hadley changed the title na_matches default in join verbs is different for data tables and for data frames Clarify that na_matches argument doesn't apply Dec 10, 2019
@hadley hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Dec 13, 2019
hadley added a commit that referenced this issue Sep 22, 2020
@hadley hadley closed this as completed in 3756175 Sep 22, 2020
@hadley
Copy link
Member

hadley commented Sep 28, 2020

BTW I figured out how to make na_matches work in #180, but the default behaviour remains the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
Development

No branches or pull requests

2 participants