Skip to content

Database semi_join() doesn't match R's NA semantics #180

@mkirzon

Description

@mkirzon

I was attempting a dplyr semi-join against SQL Server tables. In regular dplyr with local dataframes, semi-joins can succesfully join on NULL values between the left and right table. However, this is not the default behavior in SQL Server. For example, dbplyr implements semi-joins with the WHERE EXISTS method, where the WHERE clause will specify the join conditions. By default, dbplyr generates the following SQL code:

SELECT *   
FROM dbo.Test AS "TBL_LEFT"
WHERE EXISTS (
  SELECT 1
  FROM (
    ... -- skipping code here
  ) "TBL_RIGHT"
  WHERE (
    ("TBL_LEFT"."id" = "TBL_RIGHT"."id")
  )
)

To make the semi-join on SQL Server allow joins on NULLs, we'd modify the final WHERE clause to:

WHERE (
    ("TBL_LEFT"."id" = "TBL_RIGHT"."id" OR 
    ("TBL_LEFT"."id" IS NULL AND "TBL_RIGHT"."id" IS NULL))
)

I so far don't see a way that we'd be able to create the 2nd version from dplyr. Perhaps, we need a new parameter in the semi_join()

Metadata

Metadata

Assignees

No one assigned

    Labels

    dplyr verbs 🤖Translation of dplyr verbs to SQLfeaturea feature request or enhancement

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions