Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy ordering of character vectors #2204

Merged
merged 14 commits into from Jul 28, 2017
Merged

Conversation

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Oct 26, 2016

because this is a rather expensive operation. Reduces run time for grouping a data frame with 1e5 unique strings from ~1.3 s to ~0.8 s. Also includes slight improvements for the CharacterVectorOrderer.

Fixes #2198.

@krlmlr krlmlr force-pushed the f-2198-lazy-order branch 3 times, most recently from c554442 to 3b20350 Oct 26, 2016
@krlmlr krlmlr force-pushed the f-2198-lazy-order branch from 710d55a to 0d110fd Nov 11, 2016
@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Nov 11, 2016

So far the benchmarks cannot detect a difference, will wait until I have a benchmark for this code path.

@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Nov 27, 2016

Interestingly, the performance difference depends very much on the size of the alphabet; in some cases, this PR is slower. I'll work on adding examples with alphabet sizes of 4, 10, 26 and ~80 to the benchmark code.

Benchmark draft for alphabet size 4, where this PR performs much better than master:

set.seed(123)

create_ids <- function(N) {
  s <- paste(sample(c(letters[1:4], "|"), N, replace = TRUE), collapse = "")
  ss <- strsplit(s, "|", fixed = TRUE)[[1]]
  ss <- unique(ss)
  ss <- ss[nchar(ss) > 3]
  ss
}

N <- 1e7
ids <- create_ids(N)

benchmark <- function(ids, summarize) {
  force(ids)
  df <- data_frame(ids, n = 0)
  gc()
  if (summarize) {
    system.time(group_by(df, ids) %>% summarize(n = mean(n)))
  } else {
    system.time(group_by(df, ids))
  }
}

devtools::load_all()

NN <- 3e5
benchmark(ids, TRUE)
benchmark(sample(ids, NN, replace = FALSE), TRUE)
benchmark(sample(ids, NN, replace = TRUE), TRUE)

@krlmlr krlmlr changed the title Lazy ordering of character vectors WIP: Lazy ordering of character vectors Nov 29, 2016
@krlmlr
Copy link
Member Author

@krlmlr krlmlr commented Nov 29, 2016

Alphabet size changes the expected length of common identifier parts. With fce0a9f a run time change can be observed with the existing tests, but this change assumes string equality if and only if the SEXP is equal; this will require that we use only UTF-8 for column data (#1885).

1
2
3
4

@krlmlr krlmlr force-pushed the f-2198-lazy-order branch from b1b18e8 to d462a13 Feb 10, 2017
@krlmlr krlmlr force-pushed the f-2198-lazy-order branch from 06b7a32 to 593dab8 Jul 28, 2017
@krlmlr krlmlr changed the title WIP: Lazy ordering of character vectors Lazy ordering of character vectors Jul 28, 2017
@krlmlr krlmlr requested a review from hadley Jul 28, 2017
hadley
hadley approved these changes Jul 28, 2017
@krlmlr krlmlr merged commit 9f1ff34 into tidyverse:master Jul 28, 2017
0 of 2 checks passed
@krlmlr krlmlr deleted the f-2198-lazy-order branch Jul 28, 2017
@lock
Copy link

@lock lock bot commented Jan 18, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants