You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In many cases when I perform an outer left join, I would like the operation to fail in scenarios where it currently adds rows to the original (LHS) table. In other words, to fail fast if there there are duplicates in the (potentially composite) foreign key.
I have a wrapper function that achieves this:
strict_left_join<-function(x, y, by=NULL, ...){
by<- common_by(by, x, y)
if(any(duplicated(y[by$y]))) {
stop("Duplicate values in foreign key")
} else left_join(x, y, by=by, ...)
}
The benefit of this is that the resulting table is guaranteed to have the exact same number of entries as the original LHS table, and doesn't require the user to pre- or post-diagnose the join. Instead of adding rows, the wrapper throws an error:
# an exampledf1<-data.frame(day= c(1, 2, 1, 1),
month= c("Jan", "Jan", "Jan", "Feb"))
df2<-data.frame(df1[-1, ], year=2016)
df1#> day month#> 1 1 Jan#> 2 2 Jan#> 3 1 Jan#> 4 1 Febdf2#> day month year#> 2 2 Jan 2016#> 3 1 Jan 2016#> 4 1 Feb 2016
strict_left_join(df1, df2) # will work#> Joining, by = c("day", "month")#> day month year#> 1 1 Jan 2016#> 2 2 Jan 2016#> 3 1 Jan 2016#> 4 1 Feb 2016
strict_left_join(df1, df2, by="day") # will throw an error#> Error in strict_left_join(df1, df2, by = "day"): Duplicate values in foreign key
strict_left_join(df1, rbind(df2, df2)) # will throw an error#> Joining, by = c("day", "month")#> Error in strict_left_join(df1, rbind(df2, df2)): Duplicate values in foreign key
My questions:
Has anyone else been tripped up by the addition of rows? I searched the github issues and was surprised I could not find similar cases. There seems to be some discussion about a feature to create join diagnostics ([Feature] Would be nice to have a simple diagnosis after a join #2202), but nothing about preventing new rows from being added during the join operation itself.
Could/should this be considered as a feature for a future release, either as a new function or added argument?
The text was updated successfully, but these errors were encountered:
They are similar issues but no one is discussing that you cannot easily assert the expectation from the join-- you have to know the content of the foreign key beforehand, else play a game of roll-the-dice. With left joins in particular, I expect users will often want the unit of analysis (generally the row) on the LHS to stay fixed when joining-in other variables. While this can be done through a check_keys function, for a common operation as left_join, it would be nice to be able to integrate this straight into a pipe-friendly argument (or call separate function, like I do above).
Hi,
Thanks for the great package.
In many cases when I perform an outer left join, I would like the operation to fail in scenarios where it currently adds rows to the original (LHS) table. In other words, to fail fast if there there are duplicates in the (potentially composite) foreign key.
I have a wrapper function that achieves this:
The benefit of this is that the resulting table is guaranteed to have the exact same number of entries as the original LHS table, and doesn't require the user to pre- or post-diagnose the join. Instead of adding rows, the wrapper throws an error:
My questions:
Has anyone else been tripped up by the addition of rows? I searched the github issues and was surprised I could not find similar cases. There seems to be some discussion about a feature to create join diagnostics ([Feature] Would be nice to have a simple diagnosis after a join #2202), but nothing about preventing new rows from being added during the join operation itself.
Could/should this be considered as a feature for a future release, either as a new function or added argument?
The text was updated successfully, but these errors were encountered: