Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper to get vec_unique and vec_group_id in one go. #1857

Closed
orgadish opened this issue Jul 9, 2023 · 5 comments
Closed

Add helper to get vec_unique and vec_group_id in one go. #1857

orgadish opened this issue Jul 9, 2023 · 5 comments

Comments

@orgadish
Copy link

orgadish commented Jul 9, 2023

I discovered that fs::path_file and fs::path_dir run very slowly on windows (see r-lib/fs#424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This isn't just limited to fs::path_ functions, however.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
} 

However, calculating both unique(x) and match(x, ux) is duplicated effort since it could be done in one go by combining the implementations of vctrs::vec_unique_loc and vctrs::vec_duplicate_id. This makes the deduplicated faster, but also means that the overhead of running this on unique vectors is significantly reduced.

In the particular case below, using this implementation is ~2x faster than the naive implementation above. This is critical in the case of the entirely unique vector, where this approach essentially entirely removes the overhead. When the vector is smaller, the benefit is smaller, but still improved.

set.seed(0)

TOTAL_N <- 1e6
unique_vector <- purrr::map_chr(1:(TOTAL_N*2), \(...) sample(LETTERS, 5) |> paste(collapse="")) |> 
  unique() |> 
  vctrs::vec_slice(1:TOTAL_N)

repeated_vector <- sample(unique_vector, 10) |> 
  rep(TOTAL_N / 10) |> 
  sample()

call_with_deduplication <- function(x, f) {
  ux <- unique(x)
  f(ux)[match(x, ux)]
}

# Simulating what the implementation would look like by precomputing match(x, ux).
rep_match_locs <- match(repeated_vector, unique(repeated_vector))
uni_match_locs <- match(unique_vector, unique(unique_vector))  # Actually same as 1:length(unique_vector)
vec_call_with_deduplication <- function(x, f, match_locs) {
  ux <- vctrs::vec_slice(x, vctrs::vec_unique_loc(x))
  f(ux)[match_locs]
}

dplyr::bind_rows(
  bench::mark(
    rep_std = repeated_vector |> stringr::str_to_lower(),
    rep_dedup = repeated_vector |> call_with_deduplication(stringr::str_to_lower),
    rep_vec_dedup = repeated_vector |> vec_call_with_deduplication(stringr::str_to_lower, rep_match_locs),
    iterations = 100
  ),
  
  bench::mark(
    uni_std = unique_vector |> stringr::str_to_lower(),
    uni_dedup = unique_vector |> call_with_deduplication(stringr::str_to_lower),
    uni_vec_dedup = unique_vector |> vec_call_with_deduplication(stringr::str_to_lower, uni_match_locs),
    iterations = 100
  )
)
#> # A tibble: 6 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rep_std        110.5ms  132.4ms      7.17    7.66MB    1.07 
#> 2 rep_dedup       45.8ms   49.1ms     20.2    30.89MB   60.7  
#> 3 rep_vec_dedup   19.7ms   20.6ms     48.2    19.45MB   27.1  
#> 4 uni_std        374.6ms  558.5ms      1.69   23.63MB    0.209
#> 5 uni_dedup        749ms  843.8ms      1.08   61.78MB    2.40 
#> 6 uni_vec_dedup    469ms    498ms      1.92   46.52MB    2.35

Created on 2023-07-08 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Los_Angeles
#>  date     2023-07-08
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench         1.1.2   2021-11-30 [1] CRAN (R 4.2.0)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr         1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate      0.19    2022-12-13 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.10    2022-12-22 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  knitr         1.41    2022-11-18 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.2.0)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler        1.8.1   2022-11-07 [1] CRAN (R 4.2.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.36    2022-12-21 [1] CRAN (R 4.2.0)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@DavisVaughan
Copy link
Member

I think you may want something like #1851

So I think it would look like

info <- vec_group_loc(x)
vec_slice(f(info$key), list_ungroup(info$loc))

Not 100% sure as I didn't work an example out, but I think that's right

@orgadish
Copy link
Author

orgadish commented Jul 9, 2023

I had assumed that vec_group_loc had to do with dataframe "groups", so I didn't even consider it here, but I see now that it's doing sort of what I'm looking for (and that vec_group_id is in fact equivalent to match(x, unique(x))). Thanks for pointing that out!

What you've suggested (list_ungroup(info$loc)) is certainly much faster than vec_group_id(x) for vectors with significant repetition — O(length(unique(x)) vs O(length(x)) — but is still slower than getting both bits of data in the first go: getting the match is essentially O(1) since we just have to access the second element in the result.

I don't know if the added complexity is worth the time savings, though. One case in which I think it might be is when using with_deduplication without knowing the amount of repetition ahead of time. If there was no repetition, the match would be O(length(x)) and would more than double the time to run the original function. I guess I could check whether length(info$keys) is close to length(x) and skip the match call as a result.

@orgadish orgadish changed the title Add helper to get vec_unique_loc and match(x, unique(x)) in one go. Add helper to get vec_unique_loc and vec_group_id in one go. Jul 27, 2023
@orgadish orgadish changed the title Add helper to get vec_unique_loc and vec_group_id in one go. Add helper to get vec_unique and vec_group_id in one go. Oct 9, 2023
@orgadish
Copy link
Author

orgadish commented Oct 9, 2023

@DavisVaughan What do you think of #1882? I've added vec_unique_loc into the computation of vec_group_id, though I could definitely separate that out into a new helper (perhaps even one that is not exported and only used for vec_deduplicate).

@orgadish
Copy link
Author

orgadish commented Oct 9, 2023

@DavisVaughan After testing the performance of #1882, I realized that the gain isn't sufficiently better than using the separate functions. And if performance is desired it's better to use collapse::funique and data.table::chmatch, for example.

@orgadish orgadish closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
@orgadish
Copy link
Author

I ended up implementing this in a separate package deduped, since I use it often (though using fastmatch::fmatch instead of data.table::chmatch):

# if(!requireNamespace("deduped")) install.packages("deduped")

library(deduped)

N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |> 
  rep(N_TOTAL/10) |> 
  sample()

bench::mark(
  direct = repeated_paths |> fs::path_dir(),
  indirect = repeated_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       51.8ms   52.5ms      18.9  749.02KB     2.10
#> 2 indirect    206.3µs  213.5µs    4574.     6.13MB     0

all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")

bench::mark(
  direct = all_unique_paths |> fs::path_dir(),
  indirect = all_unique_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.6ms   54.6ms      18.3  901.88KB     0   
#> 2 indirect     53.6ms   54.9ms      18.2    1.03MB     2.02

Created on 2023-10-26 with reprex v2.0.2

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants