implement one-stage group-by for data.table #239

xiaodaigh · 2019-12-18T22:39:07Z

No description provided.

ThoDuyNguyen · 2019-12-19T00:43:12Z

In my case, the data is already aggregated in each chunk. Thus, with first-step-group-by by data.table, all the data will be loaded in RAM?
(This question is for understanding the logic)

xiaodaigh · 2019-12-22T03:09:10Z

Thre seems to be some weird bug when overloading [[.disk.frame hence this can't be easily progressed. Submit a bug report to future.apply and hopefullys they are related.

Unpinning this issue for now as there is no clear way forward.

ColeMiller1 · 2021-02-25T05:03:36Z

Could you expand on this? My understanding is that you need a hardby = ... argument within the [.disk.frame call. Related, you could use NSE to capture the by argument, do a hardby and shard based on the grouping variables, and then do the future.apply::future_lapply call.

After looking into your approach for dplyr verbs, it would also be possible to use NSE to combine although it would not be particularly fun. But if you are game for PRs, I'd be happy to assist. Here's a very rough sketch although I believe we could develop a framework to be more generalized.

my_NSE = function(df, ...) {
    
    res = df[...]
    
    dots = match.call(expand.dots = FALSE)$...
    dot_names = names(dots)
    
    do_one_stage = TRUE
    if (any(dot_names == 'by')) 
        by_sub = dots$by
    else if (length(dots) >= 3L)
        by_sub = dots[[3L]]
    else 
        do_one_stage = FALSE
    
    if (do_one_stage) {
        sub_j = if (any(dot_names == 'j')) dots$j else dots[[2L]]
        if (is.name(sub_j) && sub_j == quote(.N)) 
            second_j = quote(.(N = sum(N)))
        else
            return(res) ## one_stage not found and we just return the chunked aggregation.
    
        eval(call('[', res, j = second_j, by = by_sub))
    }
    else
        res
}
iris.df = as.disk.frame(iris)
my_NSE(iris.df, , j = .N, by = Species)

##      Species     N
##       <fctr> <int>
##1:     setosa    50
##2: versicolor    50
##3:  virginica    50

Note, this takes about 40 ms on my computer.

xiaodaigh · 2021-02-25T11:27:23Z

do a hardby and shard based on the grouping variables, and then do the future.apply::future_lapply call.

The issue is that hard_by is slow. For many operations like mean and sum. You only need to perform a two-stage group-by where the first stage is performed per chunk and then a second stage to collect all the results from each chunk. That would be the ideal approach and would work for many functions.

Actually, instead of a PR are you able to create a new package like disk.frame.dt and store all data.table implementation there? I want to break disk.frame into smaller more independent pieces and I think supporting both dplyr and data.table syntax in one package is not going to be a good approach going forward. I would be happy to review the package if you do create one.

ColeMiller1 · 2021-02-25T12:15:43Z

I might be able to create a package. Would you still have a [ method within this package still?

xiaodaigh · 2021-02-26T02:14:01Z

I think it would not if an independent package exists. It will be migrated out.

ColeMiller1 · 2021-02-26T22:23:13Z

Ok. I will create a repo this weekend and start. For now the goal is NSE equivalent of what you’ve implemented for dplyr verbs.

xiaodaigh mentioned this issue Dec 18, 2019

Using group by with data.table syntax #238

Closed

xiaodaigh pinned this issue Dec 18, 2019

xiaodaigh unpinned this issue Dec 22, 2019

xiaodaigh mentioned this issue Sep 17, 2020

group_by having different results between dplyr and data.table syntax #310

Closed

matthewgson mentioned this issue Dec 30, 2020

hard group-by with data.table syntax #321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement one-stage group-by for data.table #239

implement one-stage group-by for data.table #239

xiaodaigh commented Dec 18, 2019

ThoDuyNguyen commented Dec 19, 2019

xiaodaigh commented Dec 22, 2019

ColeMiller1 commented Feb 25, 2021 •

edited

xiaodaigh commented Feb 25, 2021

ColeMiller1 commented Feb 25, 2021

xiaodaigh commented Feb 26, 2021

ColeMiller1 commented Feb 26, 2021

implement one-stage group-by for data.table #239

implement one-stage group-by for data.table #239

Comments

xiaodaigh commented Dec 18, 2019

ThoDuyNguyen commented Dec 19, 2019

xiaodaigh commented Dec 22, 2019

ColeMiller1 commented Feb 25, 2021 • edited

xiaodaigh commented Feb 25, 2021

ColeMiller1 commented Feb 25, 2021

xiaodaigh commented Feb 26, 2021

ColeMiller1 commented Feb 26, 2021

ColeMiller1 commented Feb 25, 2021 •

edited