Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast select() #367

Closed
markfairbanks opened this issue Jun 23, 2022 · 1 comment · Fixed by #375
Closed

Fast select() #367

markfairbanks opened this issue Jun 23, 2022 · 1 comment · Fixed by #375

Comments

@markfairbanks
Copy link
Collaborator

Currently select() always takes a deep copy. We can instead drop columns by reference if an implicit or explicit copy has already occurred in the pipe chain.

Waiting on #366

library(dplyr, warn.conflicts = FALSE)
library(dtplyr)

df <- lazy_dt(tibble(x = 1, y = 2))

# Old
df %>%
  mutate(z = 3) %>%
  select(x, z)
#> Source: local data table [1 x 2]
#> Call:   copy(`_DT1`)[, `:=`(z = 3)][, .(x, z)]
#> 
#>       x     z
#>   <dbl> <dbl>
#> 1     1     3
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

# New
remove_vars <- dtplyr:::remove_vars

df %>%
  mutate(z = 3) %>%
  remove_vars("y")
#> Source: local data table [1 x 2]
#> Call:   copy(`_DT1`)[, `:=`(z = 3)][, `:=`("y", NULL)]
#> 
#>       x     z
#>   <dbl> <dbl>
#> 1     1     3
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results
@markfairbanks
Copy link
Collaborator Author

markfairbanks commented Jun 23, 2022

As I think about this there would also have to be some consideration for select() being able to reorder columns. The remove_vars case might also need a call to setcolorder().

Ex: df %>% mutate() %>% select(z, x)

Also would have to account for when columns are renamed.

And some benchmarks:

pacman::p_load(dplyr, dtplyr, stringi, data.table)

data_size <- 10000000
df <- tibble(a = sample(stri_rand_strings(100, 4), data_size, TRUE),
             b = sample(stri_rand_strings(100, 4), data_size, TRUE),
             c = sample(1:100, data_size, TRUE)) %>%
  lazy_dt()

remove_vars <- dtplyr:::remove_vars

bench::mark(
  old = df %>%
    mutate(d = 1) %>%
    select(a, b, d) %>%
    as.data.table(),
  new = df %>%
    mutate(d = 1) %>%
    remove_vars("c") %>%
    as.data.table(),
  check = FALSE, iterations = 30
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 old         406.1ms  461.7ms      2.12     728MB     2.19
#> 2 new          51.3ms   58.9ms     10.2      267MB     4.09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant