I was attempting a dplyr semi-join against SQL Server tables. In regular dplyr with local dataframes, semi-joins can succesfully join on NULL values between the left and right table. However, this is not the default behavior in SQL Server. For example, dbplyr implements semi-joins with the WHERE EXISTS method, where the WHERE clause will specify the join conditions. By default, dbplyr generates the following SQL code:
FROM dbo.Test AS "TBL_LEFT"
WHERE EXISTS (
... -- skipping code here
("TBL_LEFT"."id" = "TBL_RIGHT"."id")
To make the semi-join on SQL Server allow joins on NULLs, we'd modify the final WHERE clause to:
("TBL_LEFT"."id" = "TBL_RIGHT"."id" OR
("TBL_LEFT"."id" IS NULL AND "TBL_RIGHT"."id" IS NULL))
I so far don't see a way that we'd be able to create the 2nd version from dplyr. Perhaps, we need a new parameter in the semi_join()
The text was updated successfully, but these errors were encountered:
Could provide sql_is_distinct(con, x, y) generic and then use that in sql_join_tbls() when na_matches is TRUE. I think we'd want to preserve the existing behaviour (clearly documented) and then allow people to choose the R behaviour if desired.