`slice()` unnecessarily evaluates `...` twice for each group #377

eutwt · 2022-07-19T22:52:15Z

slice() could be made faster when the arguments in ... are expensive to compute by evaluating ... only once per group. In the example below where the argument to slice takes one second to compute and there are two groups, the code takes ~4 seconds to run instead of ~2.

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
library(data.table, warn.conflicts = FALSE)

dt <- data.table(id = c(1, 2))

return_1 <- function() {Sys.sleep(1); 1}

library(tictoc)
tic()
dt %>% 
  lazy_dt() %>% 
  group_by(id) %>% 
  slice(return_1())
#> Source: local data table [2 x 1]
#> Groups: id
#> Call:   `_DT1`[`_DT1`[, .I[return_1()[between(return_1(), -.N, .N)]], 
#>     by = .(id)]$V1]
#> 
#>      id
#>   <dbl>
#> 1     1
#> 2     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
toc()
#> 4.074 sec elapsed

tic()
dt[dt[,
  {
    slice_ind <- return_1()
    .I[slice_ind[between(slice_ind, -.N, .N)]]
  },
  by = .(id)
]$V1]
#>    id
#> 1:  1
#> 2:  2
toc()
#> 2.014 sec elapsed

^{Created on 2022-07-19 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

eutwt · 2022-07-20T01:33:44Z

Oops. I missed that, thanks to Rdatatable/data.table#4353, this can be fixed more easily by using nomatch = NULL once data.table releases and we bump the required version. Probably not worth fixing before then

markfairbanks · 2022-07-20T14:34:20Z

I sort of still like the idea of doing this one 🤷‍♂️

data.table seems to be pretty slow getting this release out. If we release before they do it'd be nice to have.

eutwt added the performance 🚀 label Jul 19, 2022

eutwt changed the title ~~performance: slice() unnecessarily evaluates ... twice for each group~~ slice() unnecessarily evaluates ... twice for each group Jul 19, 2022

eutwt closed this as completed Jul 20, 2022

markfairbanks mentioned this issue Jul 21, 2022

Use intermediate variable in slice() #378

Merged

markfairbanks mentioned this issue Feb 9, 2023

dtplyr release? #407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`slice()` unnecessarily evaluates `...` twice for each group #377

`slice()` unnecessarily evaluates `...` twice for each group #377

eutwt commented Jul 19, 2022

eutwt commented Jul 20, 2022

markfairbanks commented Jul 20, 2022

slice() unnecessarily evaluates ... twice for each group #377

slice() unnecessarily evaluates ... twice for each group #377

Comments

eutwt commented Jul 19, 2022

eutwt commented Jul 20, 2022

markfairbanks commented Jul 20, 2022

`slice()` unnecessarily evaluates `...` twice for each group #377

`slice()` unnecessarily evaluates `...` twice for each group #377