posted this as a dplyr issue (since it's technically a different way to do a dplyr calculation, a way to integrate data.tables rather than supporting data.tables) but migrating it here:
I'm an avid user of data.table,
but dplyr has a syntax which is much more accessible, and when I first learned, it was dplyr that made that possible. However, with larger tables it became harder to justify its usage, so I switched and started using data.table.
If you have two technologies which accomplish the same things, and one is faster, but the other is more readable, you should be able to wrap the fast one to produce the readable one.
So I attempted to make dplyr verbs construct a data.table call, and added one additional verb,
calculate(), which evaluates the current state of the call (to mirror data.table's functionality of doing several things at the same time)
It's still extremely rough (supporting the basics), but the actual construction of the call isn't all that messy. dplyr and data.table syntax are really quite close to one another.
I'm wondering if something like this could work
(fully aware I'm overwriting the verb_ functions, that needs to change, this was just intended to be proof of concept) :
Details
Functions Definition:
library(magrittr)
library(data.table)
# Calculate the current version of the call
calculate <- function(.call) {
.call <- check_call(.call)
out <- do.call(function(data,...) `[`(data,...), .call)
out
}
# If it's a data object, turn it into a list containing a copy of that object
check_call <- function(.call) {
if(is.data.table(.call) |
is.data.frame(.call)) {
.call <- list(data = setDT(copy(.call)))
}
.call
}
# Extract the data portion of the call, and pass the arguments down to group_by_
group_by <- function(.call,...) {
.call <- check_call(.call)
call <- sys.call()[-2]
call[1] <- expression(group_by_)
c(.call,eval(call))
}
# take the group variables, use data.table `.()` syntax, assign to `by`
group_by_ <- function(...) {
call <- sys.call()
call[1] <- expression(.)
list(by = call)
}
# Extract the data portion of the call, and pass the arguments down to mutate_
mutate <- function(.call,...) {
.call <- check_call(.call)
call <- sys.call()[-2]
call[1] <- expression(mutate_)
c(.call, eval(call))
}
# take the assignment columns, use `:=` syntax, assign to `j`
mutate_ <- function(...) {
call <- sys.call()
call[1] <- expression(`:=`)
list(j = call)
}
# Extract the data portion of the call, and pass the arguments down to summarise_
summarise <- function(.call,...) {
.call <- check_call(.call)
call <- sys.call()[-2]
call[1] <- expression(summarise_)
c(.call, eval(call))
}
# take the assignment columns, use `.()` syntax, assign to `j`
summarise_ <- function(...) {
call <- sys.call()
call[1] <- expression(`.`)
list(j = call)
}
# Extract the data portion of the call, and pass the arguments down to filter_
filter <- function(.call,...) {
.call <- check_call(.call)
call <- sys.call()[-2]
call[1] <- expression(filter_)
c(.call, eval(call))
}
# take the filter columns, use `()` syntax, assign to `i`
filter_ <- function(...) {
call <- sys.call()
call[1] <- expression(`(`)
list(i = call)
}
# Evaluates expression up till this point, then select columns
select <- function(.call,...) {
call <- sys.call()[-2]
call[1] <- expression(`.`)
do.call(function(...) `[`(calculate(.call),...), list(j = call))
}
# Evaluates expression up till this point, then arranges columns
arrange <- function(.call,...) {
call <- sys.call()[-2]
call[1] <- expression(`order`)
do.call(function(...) `[`(calculate(.call),...), list(i = call))
}
Example:
mtcars %>%
filter(cyl > 5) %>%
group_by(cyl, gear) %>%
summarise(avgMPG = mean(mpg)) %>%
calculate
Output:
cyl gear avgMPG
1: 6 4 21.000
2: 4 4 26.925
3: 6 3 21.400
4: 4 3 21.500
5: 4 5 28.200
posted this as a dplyr issue (since it's technically a different way to do a dplyr calculation, a way to integrate data.tables rather than supporting data.tables) but migrating it here:
I'm an avid user of
data.table,but
dplyrhas a syntax which is much more accessible, and when I first learned, it wasdplyrthat made that possible. However, with larger tables it became harder to justify its usage, so I switched and started usingdata.table.If you have two technologies which accomplish the same things, and one is faster, but the other is more readable, you should be able to wrap the fast one to produce the readable one.
So I attempted to make dplyr verbs construct a data.table call, and added one additional verb,
calculate(), which evaluates the current state of the call (to mirror data.table's functionality of doing several things at the same time)
It's still extremely rough (supporting the basics), but the actual construction of the call isn't all that messy.
dplyranddata.tablesyntax are really quite close to one another.I'm wondering if something like this could work
(fully aware I'm overwriting the verb_ functions, that needs to change, this was just intended to be proof of concept) :
Details
Functions Definition:
Example:
Output: