Join on inequality constraints #557
Several issues mention the need to join on inequality constraints, rather than only equality constraints (as well as the ability to join on equality constraints with differently named variables).
Also for your reference,
A join on inequality constraints (of which a rolling join in a special case) is extremely common and useful procedure, and one for which many users use sqldf().
It has been suggested elsewhere that a lazy cross join would be one way to approach this problem. However, this method results in an inner join only. In addition, it is less parsimonious and transparent than some users may be expecting. For reference, cross joins are discussed in #197.
Consider instead adding equality/inequality signs to the specification of join conditions. For example,
could be written in dplyr as
The variable(s) on the right hand side of the equality or inequality come from the first input table and those from the second are on the right hand side. In addition to permitting left joins and inequality constraints, this would permit users to join on equalities that have different names without the hassle of specifying them in
For backward compatability and ease of use, specifying simply a variable name would mean an equality condition with the specified variable appearing in both datasets.
This interface would also permit users to conveniently join a table to itself, a common operation that I am sure would be appreciated. If this method is too difficult to parse, the conditions could be passed as a vector of strings in the
This enhancement suggestion is that instead of relying on the relatively inflexible lazy cross-join methodology for achieving what users ultimately will want, why not jump straight to an interface that you will be glad of in the long run.
Thanks for your consideration.
I don't think syntax could work as is without breaking existing code. But you could do:
left_join( FundMonths, Returns, join_by(FundID == FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15) )
The challenge is to implement this efficiently for local data frames. I guess you probably need a nested loop that the checks the filter condition on each iteration. That would be considerably faster than generating the Cartesian product and then filtering (it would also obviate the need for a lazy cartesian join class)
What do you think of the last idea (passing each condition as a string to
I also think your suggested join_by syntax is good. It's just a little more verbose. Would this
Fair enough. As an alternative idea, you might consider keeping the
@romainfrancois How hard with this be to implement? It basically needs a double loop over all the rows in the input and output. For each combination, you evaluate the provided expression.
I think it's simplest for an inner join, but for a left join (etc) you have to make sure that rows in x that dont' match any rows in y are still preserved.
Does dplyr:: left_join has this feature now when working with two dataframes? I am looking to join by an inequality. As pointed earlier, I think this would be straight forward in SQL. Here is a MWE to capture my similar dataset.
I get the following error.