Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing Addition of Rows in left_join #2278

Closed
rtaph opened this issue Dec 1, 2016 · 3 comments
Closed

Preventing Addition of Rows in left_join #2278

rtaph opened this issue Dec 1, 2016 · 3 comments

Comments

@rtaph
Copy link

rtaph commented Dec 1, 2016

Hi,

Thanks for the great package.

In many cases when I perform an outer left join, I would like the operation to fail in scenarios where it currently adds rows to the original (LHS) table. In other words, to fail fast if there there are duplicates in the (potentially composite) foreign key.

I have a wrapper function that achieves this:

strict_left_join <- function(x, y, by = NULL, ...){
  by <- common_by(by, x, y)
  if(any(duplicated(y[by$y]))) {
    stop("Duplicate values in foreign key")
  } else left_join(x, y, by = by, ...)
}

The benefit of this is that the resulting table is guaranteed to have the exact same number of entries as the original LHS table, and doesn't require the user to pre- or post-diagnose the join. Instead of adding rows, the wrapper throws an error:

# an example
df1 <- data.frame(day = c(1, 2, 1, 1), 
                  month = c("Jan", "Jan",  "Jan", "Feb"))
df2 <- data.frame(df1[-1, ], year = 2016)

df1
#>   day month
#> 1   1   Jan
#> 2   2   Jan
#> 3   1   Jan
#> 4   1   Feb

df2
#>   day month year
#> 2   2   Jan 2016
#> 3   1   Jan 2016
#> 4   1   Feb 2016

strict_left_join(df1, df2) # will work
#> Joining, by = c("day", "month")
#>   day month year
#> 1   1   Jan 2016
#> 2   2   Jan 2016
#> 3   1   Jan 2016
#> 4   1   Feb 2016

strict_left_join(df1, df2, by = "day") # will throw an error
#> Error in strict_left_join(df1, df2, by = "day"): Duplicate values in foreign key

strict_left_join(df1, rbind(df2, df2)) # will throw an error
#> Joining, by = c("day", "month")
#> Error in strict_left_join(df1, rbind(df2, df2)): Duplicate values in foreign key

My questions:

  1. Has anyone else been tripped up by the addition of rows? I searched the github issues and was surprised I could not find similar cases. There seems to be some discussion about a feature to create join diagnostics ([Feature] Would be nice to have a simple diagnosis after a join #2202), but nothing about preventing new rows from being added during the join operation itself.

  2. Could/should this be considered as a feature for a future release, either as a new function or added argument?

@krlmlr
Copy link
Member

krlmlr commented Dec 1, 2016

This looks very much related to #1619 and #2183.

@rtaph
Copy link
Author

rtaph commented Dec 1, 2016

Thanks for pointing those out, @krlmlr !

They are similar issues but no one is discussing that you cannot easily assert the expectation from the join-- you have to know the content of the foreign key beforehand, else play a game of roll-the-dice. With left joins in particular, I expect users will often want the unit of analysis (generally the row) on the LHS to stay fixed when joining-in other variables. While this can be done through a check_keys function, for a common operation as left_join, it would be nice to be able to integrate this straight into a pipe-friendly argument (or call separate function, like I do above).

@krlmlr
Copy link
Member

krlmlr commented Dec 2, 2016

There are plans to specify explicit keys, see #1792.

@hadley hadley closed this as completed Jan 31, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants