Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join on inequality constraints #557

Closed
gvfarns opened this issue Aug 25, 2014 · 18 comments

Comments

Projects
None yet
10 participants
@gvfarns
Copy link

commented Aug 25, 2014

Several issues mention the need to join on inequality constraints, rather than only equality constraints (as well as the ability to join on equality constraints with differently named variables).

#515
#177
#378

Also for your reference,

https://groups.google.com/forum/#!topic/manipulatr/gviW8lwCCzI

A join on inequality constraints (of which a rolling join in a special case) is extremely common and useful procedure, and one for which many users use sqldf().

It has been suggested elsewhere that a lazy cross join would be one way to approach this problem. However, this method results in an inner join only. In addition, it is less parsimonious and transparent than some users may be expecting. For reference, cross joins are discussed in #197.

Consider instead adding equality/inequality signs to the specification of join conditions. For example,

  FundsWithReturns <- sqldf('select d1.*, d2.ReturnRf
                         from FundMonths d1 left join Returns d2
                         on d1.FundID = d2.FundID
                         and d1.yearmonth - 3 > d2.gmonth
                         and d1.yearmonth - 15 <= d2.gmonth')

could be written in dplyr as

FundsWithReturns <- left_join(FundMonths, Returns, FundID == FundID, 
                                   yearmonth > gmonth +3, yearmonth <= gmonth + 15)

The variable(s) on the right hand side of the equality or inequality come from the first input table and those from the second are on the right hand side. In addition to permitting left joins and inequality constraints, this would permit users to join on equalities that have different names without the hassle of specifying them in by.x and by.y style conditions like those in #515.

For backward compatability and ease of use, specifying simply a variable name would mean an equality condition with the specified variable appearing in both datasets.

FundsWithReturns <- left_join(FundMonths, Returns, "FundID", 
                                   yearmonth > gmonth +3, yearmonth <= gmonth + 15)

This interface would also permit users to conveniently join a table to itself, a common operation that I am sure would be appreciated. If this method is too difficult to parse, the conditions could be passed as a vector of strings in the by parameter. That would be really backward compatible.

    FundsWithReturns <- left_join(FundMonths, Returns, by=c("FundID", 
                                   "yearmonth > gmonth +3","yearmonth <= gmonth + 15"))

This enhancement suggestion is that instead of relying on the relatively inflexible lazy cross-join methodology for achieving what users ultimately will want, why not jump straight to an interface that you will be glad of in the long run.

Thanks for your consideration.

Stackoverflow references:

http://stackoverflow.com/questions/21888910/how-to-specify-names-of-columns-for-x-and-y-when-joining-in-dplyr
http://stackoverflow.com/questions/25045630/how-specify-join-variables-with-different-names-in-different-mysql-tables

@hadley hadley self-assigned this Aug 26, 2014

@hadley hadley added this to the 0.4 milestone Aug 26, 2014

@hadley

This comment has been minimized.

Copy link
Member

commented Aug 26, 2014

I don't think syntax could work as is without breaking existing code. But you could do:

left_join( 
  FundMonths, Returns, 
  join_by(FundID == FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15)
)

The challenge is to implement this efficiently for local data frames. I guess you probably need a nested loop that the checks the filter condition on each iteration. That would be considerably faster than generating the Cartesian product and then filtering (it would also obviate the need for a lazy cartesian join class)

@gvfarns

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

What do you think of the last idea (passing each condition as a string to by)? I would think that would be compatible, no? It's not compatible with the named vector idea from #177, but as I understand it that is not yet being used by the general public. It should be compatible with by as it stands in 0.2.

FundsWithReturns <- left_join(FundMonths, Returns, by=c("FundID", 
                               "yearmonth > gmonth +3","yearmonth <= gmonth + 15"))

I also think your suggested join_by syntax is good. It's just a little more verbose. Would this join_by() output be an input to the by parameter, or would there be a separate parameter?

@hadley

This comment has been minimized.

Copy link
Member

commented Aug 26, 2014

@gvfarns I'm really not keen on using strings to represent expressions. join_key() would be a function that did NSE to capture the expressions and would return a "join_key" object. Then it's easy to distinguish from character vectors representing equi-joins.

@gvfarns

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Fair enough. As an alternative idea, you might consider keeping the *_join functions as they are and creating *_join_by functions that use NSE themselves (at least left and inner). Then people doing simple equijoins can stick to the basic functions (which can be highly optimized for their one task) and those with more complex conditions can have functions that are flexible but of necessity less tuned.

@hadley

This comment has been minimized.

Copy link
Member

commented Aug 27, 2014

@gvfarns the solution I prosed allows exactly that, and avoids creating additional functions that only differ in the type of their arguments.

@gvfarns

This comment has been minimized.

Copy link
Author

commented Aug 27, 2014

Works for me. I'll look forward to the feature. Thanks

@hadley

This comment has been minimized.

Copy link
Member

commented Oct 8, 2014

@romainfrancois How hard with this be to implement? It basically needs a double loop over all the rows in the input and output. For each combination, you evaluate the provided expression. inner_join(x, y, join_by(TRUE) would return the cartesian product of the tables.

I think it's simplest for an inner join, but for a left join (etc) you have to make sure that rows in x that dont' match any rows in y are still preserved.

@whyusc

This comment has been minimized.

Copy link

commented Dec 11, 2014

Hi there, just want to check on the status of this issue. I am really looking forward to have this feature and then I can move my whole workflow into R from current SAS+R...

Thank you for all of your effort !

@shntnu

This comment has been minimized.

Copy link

commented Aug 4, 2015

This is probably obvious, but when implementing the suggestion in #557 (comment), != should also be permitted:

left_join( 
  FundMonths, Returns, 
  join_by(FundID != FundID, yearmonth > gmonth + 3, yearmonth <= gmonth + 15)
)

@hadley hadley referenced this issue Aug 24, 2015

Closed

nse for joins #1181

@hadley hadley added the feature label Oct 22, 2015

@hxrts

This comment has been minimized.

Copy link

commented Feb 22, 2016

The most natural route I've come up with (at least in my case) is grouping, left joining, slicing on a conditional that collapses to 1 row, and ungrouping. However, this gets a bit unwieldy with massive tables so I'd love a more general solution.

@hadley hadley added the data frame label Mar 1, 2016

@hadley hadley modified the milestones: future, 0.5 Mar 10, 2016

@cwbishop

This comment has been minimized.

Copy link

commented Apr 11, 2016

+1 for this feature. I've been leveraging sqldf in the meantime.

@whyusc

This comment has been minimized.

Copy link

commented Apr 12, 2016

This feature is recently made available in data.table.
see data.table implementation

@krlmlr krlmlr referenced this issue Apr 20, 2016

Closed

Rolling join #378

@vijayrajesh98

This comment has been minimized.

Copy link

commented May 18, 2016

Does dplyr:: left_join has this feature now when working with two dataframes? I am looking to join by an inequality. As pointed earlier, I think this would be straight forward in SQL. Here is a MWE to capture my similar dataset.

   id <- c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,5,5,5,5,8,8,8,8,13,13,13)

    fyear <- c(1998,1999,2000,2001,1998,1999,2000,2001,2002,2003,
       1998,1999,2000,2001,2002,2003,1998,1999,2000,2001,
       1998,1999,2000,2001,1998,1999,2000)

    byear <- c(1990,1995,2000,2005)
    eyear <- c(1995,2000,2005,2010)
    val <- c(3,1,5,6)

    sdata <- tbl_df(data.frame(byear,eyear,val))

    test1 <- left_join(fdata,sdata, by = c("fyear" >= "byear","fyear" < "eyear")) 

I get the following error.

    Error: cannot join on columns 'TRUE' x 'TRUE': index out of bounds 
@iangow

This comment has been minimized.

Copy link

commented May 26, 2016

SQL version should be easy to implement (see closed #1841).

@whyusc

This comment has been minimized.

Copy link

commented Jul 17, 2016

Will this ever be implemented?

krlmlr added a commit to krlmlr/dplyr that referenced this issue Sep 22, 2016

krlmlr added a commit that referenced this issue Sep 29, 2016

@dan-svd

This comment has been minimized.

Copy link

commented Dec 7, 2016

Oh, inequality joins! Is this a feature that will be implemented ? (thanks for the sqldf and data.table references above!)

@benmarwick

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2016

These inequality joins are available in @dgrtwo's fuzzyjoin, here's a nice example: http://stackoverflow.com/q/37289405/1036500

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 2, 2017

Moved into #2240

@hadley hadley closed this Feb 2, 2017

@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.