# tidyverse/dplyr

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

# Join on inequality constraints #557

Closed
opened this issue Aug 25, 2014 · 18 comments

Projects
None yet
10 participants

### gvfarns commented Aug 25, 2014

 Several issues mention the need to join on inequality constraints, rather than only equality constraints (as well as the ability to join on equality constraints with differently named variables). Also for your reference, https://groups.google.com/forum/#!topic/manipulatr/gviW8lwCCzI A join on inequality constraints (of which a rolling join in a special case) is extremely common and useful procedure, and one for which many users use sqldf(). It has been suggested elsewhere that a lazy cross join would be one way to approach this problem. However, this method results in an inner join only. In addition, it is less parsimonious and transparent than some users may be expecting. For reference, cross joins are discussed in #197. Consider instead adding equality/inequality signs to the specification of join conditions. For example, `````` FundsWithReturns <- sqldf('select d1.*, d2.ReturnRf from FundMonths d1 left join Returns d2 on d1.FundID = d2.FundID and d1.yearmonth - 3 > d2.gmonth and d1.yearmonth - 15 <= d2.gmonth') `````` could be written in dplyr as ``````FundsWithReturns <- left_join(FundMonths, Returns, FundID == FundID, yearmonth > gmonth +3, yearmonth <= gmonth + 15) `````` The variable(s) on the right hand side of the equality or inequality come from the first input table and those from the second are on the right hand side. In addition to permitting left joins and inequality constraints, this would permit users to join on equalities that have different names without the hassle of specifying them in `by.x` and `by.y` style conditions like those in #515. For backward compatability and ease of use, specifying simply a variable name would mean an equality condition with the specified variable appearing in both datasets. ``````FundsWithReturns <- left_join(FundMonths, Returns, "FundID", yearmonth > gmonth +3, yearmonth <= gmonth + 15) `````` This interface would also permit users to conveniently join a table to itself, a common operation that I am sure would be appreciated. If this method is too difficult to parse, the conditions could be passed as a vector of strings in the `by` parameter. That would be really backward compatible. `````` FundsWithReturns <- left_join(FundMonths, Returns, by=c("FundID", "yearmonth > gmonth +3","yearmonth <= gmonth + 15")) `````` This enhancement suggestion is that instead of relying on the relatively inflexible lazy cross-join methodology for achieving what users ultimately will want, why not jump straight to an interface that you will be glad of in the long run. Thanks for your consideration. Stackoverflow references:

Member

### hadley commented Aug 26, 2014

 I don't think syntax could work as is without breaking existing code. But you could do: ```left_join( FundMonths, Returns, join_by(FundID == FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15) )``` The challenge is to implement this efficiently for local data frames. I guess you probably need a nested loop that the checks the filter condition on each iteration. That would be considerably faster than generating the Cartesian product and then filtering (it would also obviate the need for a lazy cartesian join class)
Author

### gvfarns commented Aug 26, 2014

 What do you think of the last idea (passing each condition as a string to `by`)? I would think that would be compatible, no? It's not compatible with the named vector idea from #177, but as I understand it that is not yet being used by the general public. It should be compatible with `by` as it stands in 0.2. ``````FundsWithReturns <- left_join(FundMonths, Returns, by=c("FundID", "yearmonth > gmonth +3","yearmonth <= gmonth + 15")) `````` I also think your suggested join_by syntax is good. It's just a little more verbose. Would this `join_by()` output be an input to the `by` parameter, or would there be a separate parameter?
Member

### hadley commented Aug 26, 2014

 @gvfarns I'm really not keen on using strings to represent expressions. `join_key()` would be a function that did NSE to capture the expressions and would return a "join_key" object. Then it's easy to distinguish from character vectors representing equi-joins.
Author

### gvfarns commented Aug 26, 2014

 Fair enough. As an alternative idea, you might consider keeping the `*_join` functions as they are and creating `*_join_by` functions that use NSE themselves (at least `left` and `inner`). Then people doing simple equijoins can stick to the basic functions (which can be highly optimized for their one task) and those with more complex conditions can have functions that are flexible but of necessity less tuned.
Member

### hadley commented Aug 27, 2014

 @gvfarns the solution I prosed allows exactly that, and avoids creating additional functions that only differ in the type of their arguments.
Author

### gvfarns commented Aug 27, 2014

 Works for me. I'll look forward to the feature. Thanks

Closed

Closed

Member

### hadley commented Oct 8, 2014

 @romainfrancois How hard with this be to implement? It basically needs a double loop over all the rows in the input and output. For each combination, you evaluate the provided expression. `inner_join(x, y, join_by(TRUE)` would return the cartesian product of the tables. I think it's simplest for an inner join, but for a left join (etc) you have to make sure that rows in x that dont' match any rows in y are still preserved.

### whyusc commented Dec 11, 2014

 Hi there, just want to check on the status of this issue. I am really looking forward to have this feature and then I can move my whole workflow into R from current SAS+R... Thank you for all of your effort !

### shntnu commented Aug 4, 2015

 This is probably obvious, but when implementing the suggestion in #557 (comment), `!=` should also be permitted: ``````left_join( FundMonths, Returns, join_by(FundID != FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15) ) ``````

Closed

### hxrts commented Feb 22, 2016

 The most natural route I've come up with (at least in my case) is grouping, left joining, slicing on a conditional that collapses to 1 row, and ungrouping. However, this gets a bit unwieldy with massive tables so I'd love a more general solution.

Closed

### cwbishop commented Apr 11, 2016

 +1 for this feature. I've been leveraging sqldf in the meantime.

### whyusc commented Apr 12, 2016

 This feature is recently made available in data.table. see data.table implementation

Closed

### vijayrajesh98 commented May 18, 2016

 Does dplyr:: left_join has this feature now when working with two dataframes? I am looking to join by an inequality. As pointed earlier, I think this would be straight forward in SQL. Here is a MWE to capture my similar dataset. `````` id <- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,5,5,5,5,8,8,8,8,13,13,13) fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003, 1998,1999,2000,2001,2002,2003,1998,1999,2000,2001, 1998,1999,2000,2001,1998,1999,2000) byear <- c(1990,1995,2000,2005) eyear <- c(1995,2000,2005,2010) val <- c(3,1,5,6) sdata <- tbl_df(data.frame(byear,eyear,val)) test1 <- left_join(fdata,sdata, by = c("fyear" >= "byear","fyear" < "eyear")) `````` I get the following error. `````` Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds ``````

Closed

### iangow commented May 26, 2016

 SQL version should be easy to implement (see closed #1841).

Merged

### whyusc commented Jul 17, 2016

 Will this ever be implemented?

### krlmlr added a commit to krlmlr/dplyr that referenced this issue Sep 22, 2016

``` remove unused (tidyverse#557) ```
``` a24ecf1 ```

### krlmlr added a commit that referenced this issue Sep 29, 2016

``` remove unused (#557) ```
``` 5e3f327 ```

Closed

Closed

Closed

Closed

Open

### dan-svd commented Dec 7, 2016

 Oh, inequality joins! Is this a feature that will be implemented ? (thanks for the sqldf and data.table references above!)
Contributor

### benmarwick commented Dec 14, 2016

 These inequality joins are available in @dgrtwo's fuzzyjoin, here's a nice example: http://stackoverflow.com/q/37289405/1036500

Open

Member

### hadley commented Feb 2, 2017

 Moved into #2240

### lockbot locked as resolved and limited conversation to collaborators Jun 8, 2018

to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.