I was attempting a dplyr semi-join against SQL Server tables. In regular dplyr with local dataframes, semi-joins can succesfully join on NULL values between the left and right table. However, this is not the default behavior in SQL Server. For example, dbplyr implements semi-joins with the WHERE EXISTS method, where the WHERE clause will specify the join conditions. By default, dbplyr generates the following SQL code:
SELECT *
FROM dbo.Test AS "TBL_LEFT"
WHERE EXISTS (
SELECT 1
FROM (
... -- skipping code here
) "TBL_RIGHT"
WHERE (
("TBL_LEFT"."id" = "TBL_RIGHT"."id")
)
)
To make the semi-join on SQL Server allow joins on NULLs, we'd modify the final WHERE clause to:
WHERE (
("TBL_LEFT"."id" = "TBL_RIGHT"."id" OR
("TBL_LEFT"."id" IS NULL AND "TBL_RIGHT"."id" IS NULL))
)
I so far don't see a way that we'd be able to create the 2nd version from dplyr. Perhaps, we need a new parameter in the semi_join()
I was attempting a dplyr semi-join against SQL Server tables. In regular dplyr with local dataframes, semi-joins can succesfully join on
NULLvalues between the left and right table. However, this is not the default behavior in SQL Server. For example, dbplyr implements semi-joins with theWHERE EXISTSmethod, where theWHEREclause will specify the join conditions. By default, dbplyr generates the following SQL code:To make the semi-join on SQL Server allow joins on
NULLs, we'd modify the finalWHEREclause to:I so far don't see a way that we'd be able to create the 2nd version from dplyr. Perhaps, we need a new parameter in the
semi_join()