Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make efficient C++ versions of window functions #133

Closed
4 of 5 tasks
hadley opened this issue Nov 25, 2013 · 4 comments
Closed
4 of 5 tasks

Make efficient C++ versions of window functions #133

hadley opened this issue Nov 25, 2013 · 4 comments
Assignees
Labels
feature a feature request or enhancement
Milestone

Comments

@hadley
Copy link
Member

hadley commented Nov 25, 2013

Loosely grouped below

  • lag and lead
  • nth_value, first_value, last_value
  • row_number, rank, dense_rank
  • percent_rank, cume_dist, ntile
  • cumsum, cummin, cummax
@romainfrancois
Copy link
Member

I've put some code for first (last will follow). I don't have the full argument matching at my disposal, so I'm handling the arguments this way:

  • the first is always considered to be the variable. I just check that the argument is either unnamed or called 'x'
  • for the other arguments, they must be named. I don't do position matching if not, but partial matching is allowed.

That's a bit more code than I anticipated for the implementation of something conceptually simple. The advantage over calling the R version of first is that there is no materialisation of the data. i.e:

> df <- data.frame( x = 1:16, g = rep(1:4, each = 4), y = 16:1 )
> df
    x g  y
1   1 1 16
2   2 1 15
3   3 1 14
4   4 1 13
5   5 2 12
6   6 2 11
7   7 2 10
8   8 2  9
9   9 3  8
10 10 3  7
11 11 3  6
12 12 3  5
13 13 4  4
14 14 4  3
15 15 4  2
16 16 4  1
> df %.% group_by(g) %.% summarise( first_x = first(x) )
Source: local data frame [4 x 2]

  g first_x
1 1       1
2 2       5
3 3       9
4 4      13
> df %.% group_by(g) %.% summarise( first_x = first(x, order_by = y) )
Source: local data frame [4 x 2]

  g first_x
1 1       4
2 2       8
3 3      12
4 4      16

In each case, I don't have to materialise the 4 vectors to pass them to the R function first. n the easy case first(x) I just pick the first one from the data virtually indexed by the indices.

For the case first(x, order_by = y) I loop around the y variable to find the smallest, but at no point I am materialising either x or y.

last should be straightforward, and nth should not be too hard.

@hadley
Copy link
Member Author

hadley commented Apr 7, 2014

Nice, thanks Romain.

@hadley hadley modified the milestones: 0.3.1, 0.3 Sep 11, 2014
@romainfrancois
Copy link
Member

The code for lead and lag was already there but not enabled. I've fixed it a bit and enabled. Although for now it does not support the full arguments of their R counterparts. They only handle the 2 args form, i.e. they don't handle default or order_by. If the call is anything more than 2 args, it just falls back to R eval.

I think I can handle default easily.

Then for order_by I can borrow some code from first ...

I'll handle the case order_by = symbol hybridly as it's the main use case I guess, and then anything else will fall back to R as it's more complicated to handle.

@hadley hadley modified the milestones: 0.4, 0.3.1 Oct 30, 2014
@hadley
Copy link
Member Author

hadley commented Oct 22, 2015

Is this finished?

@hadley hadley closed this as completed Mar 1, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants