Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling join #378

Closed
nirski opened this issue Apr 10, 2014 · 5 comments
Closed

Rolling join #378

nirski opened this issue Apr 10, 2014 · 5 comments
Labels
feature a feature request or enhancement
Milestone

Comments

@nirski
Copy link

nirski commented Apr 10, 2014

Would it be possible to add the rolling join feature to dplyr?

Rolling join, known also as last observation carried forward (LOCF), is an inequality join of two tables.

The example below contains two data frames:

  • Ds (sales log), with date of transaction, item id, customer id, and quantity sold
  • Dp (price list), with date of publicaton, item id, and price published
library(data.table)

Ds <- read.csv(text = "
date, cust, item, sold
2010-01-19, 101, 2, 11
2010-01-22, 102, 1, 7
2010-01-24, 102, 2, 9
2010-01-25, 101, 2, 9
2010-01-26, 101, 1, 10
")
Ds <- data.table(Ds)
Ds <- Ds[, date := as.Date(date, format = ("%Y-%m-%d"))]
setkey(Ds, item, date)

Dp <- read.csv(text = "
date, item, price
2010-01-20, 1, 18.3
2010-01-20, 2, 38.3
2010-01-23, 1, 18.9
2010-01-23, 2, 48.9
2010-01-26, 1, 19.1
2010-01-26, 2, 59.1
")
Dp <- data.table(Dp)
Dp <- Dp[, date := as.Date(date, format = ("%Y-%m-%d"))]
setkey(Dp, item, date)

What we need is match the most recent price to every record of sales log Ds. This is where data.table comes to rescue with a neat one-liner.

(Dt <- Dp[Ds, roll = TRUE])
##    item       date price cust sold
## 1:    1 2010-01-22  18.3  102    7
## 2:    1 2010-01-26  19.1  101   10
## 3:    2 2010-01-19    NA  101   11
## 4:    2 2010-01-24  48.9  102    9
## 5:    2 2010-01-25  48.9  101    9

Session info:

## R version 3.0.3 (2014-03-06)
## Platform: i386-w64-mingw32/i386 (32-bit)
## 
## locale:
## [1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
## [3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
## [5] LC_TIME=Polish_Poland.1250    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.2
## 
## loaded via a namespace (and not attached):
##  [1] evaluate_0.5.3   formatR_0.10     knitr_1.5.25     plyr_1.8.1      
##  [5] Rcpp_0.11.1      reshape2_1.2.2   rmarkdown_0.1.72 stringr_0.6.2   
##  [9] tools_3.0.3      yaml_2.1.11
@hadley hadley added this to the bluesky milestone Aug 1, 2014
@hadley
Copy link
Member

hadley commented Aug 1, 2014

This will require considerable work as we currently only support equi-joins. Might be best to do on top of lazy cross joins.

@krlmlr
Copy link
Member

krlmlr commented Apr 20, 2016

I like the syntax proposed in #557 (comment). How about something similar for rolling joins:

left_join( 
  events, days,
  join_by(collector_id == collector_id, rolling = event_timestamp >= day)
)

This would select, for each "event" record, just that one "day" record right before the event. This seems to be implemented efficiently for data.table, and my feeling is that a data.frame implementation is easier for this operation than for generic inequality constraints. I'm not sure which SQL engines implement this, but I can imagine a generic solution that creates a temporary column so that a "normal" inequality join can be used.

@krlmlr
Copy link
Member

krlmlr commented Apr 20, 2016

LOCF (last observation carried forward) is also used.

The result of the operation is different if you allow fuzz in the "left" or in the "right" table. How about:

left_join( 
  events, days,
  join_by(collector_id == collector_id, event_timestamp >= largest(day))
)

(This is different from smallest(event_timestamp) >= day.)

@hadley
Copy link
Member

hadley commented Feb 21, 2017

Moving all future discussion to #2240

@hadley hadley closed this as completed Feb 21, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
@hadley hadley added reprex needs a minimal reproducible example and removed reprex needs a minimal reproducible example labels Dec 13, 2019
@hadley
Copy link
Member

hadley commented Dec 13, 2019

Ooops, sorry, script ran mildy amoc

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants