Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient vec_group_loc() vector access #911

Merged
merged 2 commits into from Mar 13, 2020

Conversation

DavisVaughan
Copy link
Member

This PR further optimizes vec_group_loc(), which should directly result in performance improvements for group_by().

In Instruments I noticed that the biggest killer in vec_group_loc() is the double deference in a tight loop of INTEGER(VECTOR_ELT()). That can be replaced with an array of integer pointers that index directly into those integer vectors. That really helped.

Side note: It is insane how fast this is if you order the data ahead of time. As pointed out by @brodieG

library(dplyr, warn.conflicts = FALSE)
library(vctrs)

set.seed(42)

g <- sample(1e6, 1e7, replace = TRUE)
tbl <- tibble(g = g)

o <- order(g)
ordered_tbl <- tbl[o,]
bench::mark(
  vec_group_loc(tbl),
  vec_group_loc(ordered_tbl),
  check = FALSE,
  iterations = 20
)

# Master
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_group_loc(tbl)            2.86s    2.99s     0.333     163MB    0.416
#> 2 vec_group_loc(ordered_tbl) 277.95ms 304.13ms     3.18      163MB    3.33

# This PR
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 vec_group_loc(tbl)            1.63s    1.71s     0.578     171MB    0.722
#> 2 vec_group_loc(ordered_tbl) 258.03ms 276.63ms     3.46      171MB    3.63

@brodieG
Copy link

brodieG commented Mar 12, 2020

Awesome! Obviously the ratios are not quite as good once you include the ordering time, but at least for this 1e7/1e6 case it seems to be worth it.

If you do go ahead and end up using the order thing, please considering giving a nod to the data table folk that contributed the radix sort to R. This trick is out of the question without that.

Also, as is probably obvious to you already, ordering might allow simplifying the group computation logic as well (maybe, I haven't looked at how you do it).

Thanks for keeping me in the loop, appreciate it.

Copy link
Member

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool improvement.

@DavisVaughan DavisVaughan merged commit e41c91b into r-lib:master Mar 13, 2020
@DavisVaughan DavisVaughan deleted the vec-group-loc-speed branch March 13, 2020 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants