New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy ordering of character vectors #2204
Conversation
c554442
to
3b20350
Compare
710d55a
to
0d110fd
Compare
So far the benchmarks cannot detect a difference, will wait until I have a benchmark for this code path. |
Interestingly, the performance difference depends very much on the size of the alphabet; in some cases, this PR is slower. I'll work on adding examples with alphabet sizes of 4, 10, 26 and ~80 to the benchmark code. Benchmark draft for alphabet size 4, where this PR performs much better than master: set.seed(123)
create_ids <- function(N) {
s <- paste(sample(c(letters[1:4], "|"), N, replace = TRUE), collapse = "")
ss <- strsplit(s, "|", fixed = TRUE)[[1]]
ss <- unique(ss)
ss <- ss[nchar(ss) > 3]
ss
}
N <- 1e7
ids <- create_ids(N)
benchmark <- function(ids, summarize) {
force(ids)
df <- data_frame(ids, n = 0)
gc()
if (summarize) {
system.time(group_by(df, ids) %>% summarize(n = mean(n)))
} else {
system.time(group_by(df, ids))
}
}
devtools::load_all()
NN <- 3e5
benchmark(ids, TRUE)
benchmark(sample(ids, NN, replace = FALSE), TRUE)
benchmark(sample(ids, NN, replace = TRUE), TRUE) |
instead of order, to avoid computing order (expensive!) unless necessary
b1b18e8
to
d462a13
Compare
06b7a32
to
593dab8
Compare
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
because this is a rather expensive operation. Reduces run time for grouping a data frame with 1e5 unique strings from ~1.3 s to ~0.8 s. Also includes slight improvements for the CharacterVectorOrderer.
Fixes #2198.