Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice() unnecessarily evaluates ... twice for each group #377

Closed
eutwt opened this issue Jul 19, 2022 · 2 comments
Closed

slice() unnecessarily evaluates ... twice for each group #377

eutwt opened this issue Jul 19, 2022 · 2 comments

Comments

@eutwt
Copy link
Collaborator

eutwt commented Jul 19, 2022

slice() could be made faster when the arguments in ... are expensive to compute by evaluating ... only once per group. In the example below where the argument to slice takes one second to compute and there are two groups, the code takes ~4 seconds to run instead of ~2.

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
#> Warning: package 'dplyr' was built under R version 4.1.2
library(data.table, warn.conflicts = FALSE)

dt <- data.table(id = c(1, 2))

return_1 <- function() {Sys.sleep(1); 1}

library(tictoc)
tic()
dt %>% 
  lazy_dt() %>% 
  group_by(id) %>% 
  slice(return_1())
#> Source: local data table [2 x 1]
#> Groups: id
#> Call:   `_DT1`[`_DT1`[, .I[return_1()[between(return_1(), -.N, .N)]], 
#>     by = .(id)]$V1]
#> 
#>      id
#>   <dbl>
#> 1     1
#> 2     2
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
toc()
#> 4.074 sec elapsed

tic()
dt[dt[,
  {
    slice_ind <- return_1()
    .I[slice_ind[between(slice_ind, -.N, .N)]]
  },
  by = .(id)
]$V1]
#>    id
#> 1:  1
#> 2:  2
toc()
#> 2.014 sec elapsed

Created on 2022-07-19 by the reprex package (v2.0.1)

@eutwt eutwt changed the title performance: slice() unnecessarily evaluates ... twice for each group slice() unnecessarily evaluates ... twice for each group Jul 19, 2022
@eutwt
Copy link
Collaborator Author

eutwt commented Jul 20, 2022

Oops. I missed that, thanks to Rdatatable/data.table#4353, this can be fixed more easily by using nomatch = NULL once data.table releases and we bump the required version. Probably not worth fixing before then

@eutwt eutwt closed this as completed Jul 20, 2022
@markfairbanks
Copy link
Collaborator

I sort of still like the idea of doing this one 🤷‍♂️

data.table seems to be pretty slow getting this release out. If we release before they do it'd be nice to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants