Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database semi_join() doesn't match R's NA semantics #180

Closed
mkirzon opened this issue Nov 2, 2018 · 4 comments
Closed

Database semi_join() doesn't match R's NA semantics #180

mkirzon opened this issue Nov 2, 2018 · 4 comments
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL

Comments

@mkirzon
Copy link

mkirzon commented Nov 2, 2018

I was attempting a dplyr semi-join against SQL Server tables. In regular dplyr with local dataframes, semi-joins can succesfully join on NULL values between the left and right table. However, this is not the default behavior in SQL Server. For example, dbplyr implements semi-joins with the WHERE EXISTS method, where the WHERE clause will specify the join conditions. By default, dbplyr generates the following SQL code:

SELECT *   
FROM dbo.Test AS "TBL_LEFT"
WHERE EXISTS (
  SELECT 1
  FROM (
    ... -- skipping code here
  ) "TBL_RIGHT"
  WHERE (
    ("TBL_LEFT"."id" = "TBL_RIGHT"."id")
  )
)

To make the semi-join on SQL Server allow joins on NULLs, we'd modify the final WHERE clause to:

WHERE (
    ("TBL_LEFT"."id" = "TBL_RIGHT"."id" OR 
    ("TBL_LEFT"."id" IS NULL AND "TBL_RIGHT"."id" IS NULL))
)

I so far don't see a way that we'd be able to create the 2nd version from dplyr. Perhaps, we need a new parameter in the semi_join()

@hadley hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Jan 2, 2019
@hadley
Copy link
Member

hadley commented Jan 3, 2019

Or use one of the alternatives discussed in https://modern-sql.com/feature/is-distinct-from

  • SQLite: IS
  • MySQL: <=>
  • PostgresSQL: IS DISTINCT FROM
  • Oracle: DECODE()
  • MS SQL (and all others): (CASE WHEN (a = b) or (a IS NULL AND b IS NULL) THEN 0 ELSE 1) = 0

@hadley hadley added this to the v1.4.0 milestone Jan 10, 2019
@hadley
Copy link
Member

hadley commented Jan 10, 2019

i.e. we need to use the na_matches argument

@hadley
Copy link
Member

hadley commented Feb 6, 2019

Reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

mf1 <- memdb_frame(x = c(1, NA, 3))
mf2 <- memdb_frame(x = c(1, NA))

anti_join(mf1, mf2)
#> Joining, by = "x"
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.25.3 [:memory:]
#>       x
#>   <dbl>
#> 1    NA
#> 2     3
anti_join(collect(mf1), collect(mf2))
#> Joining, by = "x"
#> # A tibble: 1 x 1
#>       x
#>   <dbl>
#> 1     3

Created on 2019-02-06 by the reprex package (v0.2.1.9000)

@hadley hadley changed the title Joins fail when intending to join on NULL values Database semi_join() doesn't match R NA semantics Feb 7, 2019
@hadley hadley changed the title Database semi_join() doesn't match R NA semantics Database semi_join() doesn't match R's NA semantics Feb 7, 2019
@hadley hadley removed this from the v1.4.0 milestone Mar 14, 2019
@hadley
Copy link
Member

hadley commented Sep 25, 2020

Could provide sql_is_distinct(con, x, y) generic and then use that in sql_join_tbls() when na_matches is TRUE. I think we'd want to preserve the existing behaviour (clearly documented) and then allow people to choose the R behaviour if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
Development

No branches or pull requests

2 participants