Lazy ordering of character vectors #2204

krlmlr · 2016-10-26T23:37:40Z

because this is a rather expensive operation. Reduces run time for grouping a data frame with 1e5 unique strings from ~1.3 s to ~0.8 s. Also includes slight improvements for the CharacterVectorOrderer.

Fixes #2198.

krlmlr · 2016-11-11T21:18:34Z

So far the benchmarks cannot detect a difference, will wait until I have a benchmark for this code path.

krlmlr · 2016-11-27T22:18:27Z

Interestingly, the performance difference depends very much on the size of the alphabet; in some cases, this PR is slower. I'll work on adding examples with alphabet sizes of 4, 10, 26 and ~80 to the benchmark code.

Benchmark draft for alphabet size 4, where this PR performs much better than master:

set.seed(123)

create_ids <- function(N) {
  s <- paste(sample(c(letters[1:4], "|"), N, replace = TRUE), collapse = "")
  ss <- strsplit(s, "|", fixed = TRUE)[[1]]
  ss <- unique(ss)
  ss <- ss[nchar(ss) > 3]
  ss
}

N <- 1e7
ids <- create_ids(N)

benchmark <- function(ids, summarize) {
  force(ids)
  df <- data_frame(ids, n = 0)
  gc()
  if (summarize) {
    system.time(group_by(df, ids) %>% summarize(n = mean(n)))
  } else {
    system.time(group_by(df, ids))
  }
}

devtools::load_all()

NN <- 3e5
benchmark(ids, TRUE)
benchmark(sample(ids, NN, replace = FALSE), TRUE)
benchmark(sample(ids, NN, replace = TRUE), TRUE)

krlmlr · 2016-11-29T09:19:22Z

Alphabet size changes the expected length of common identifier parts. With fce0a9f a run time change can be observed with the existing tests, but this change assumes string equality if and only if the SEXP is equal; this will require that we use only UTF-8 for column data (#1885).

instead of order, to avoid computing order (expensive!) unless necessary

lock · 2019-01-18T12:32:39Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

krlmlr force-pushed the f-2198-lazy-order branch 3 times, most recently from c554442 to 3b20350 Compare October 26, 2016 23:57

krlmlr force-pushed the f-2198-lazy-order branch from 710d55a to 0d110fd Compare November 11, 2016 20:56

krlmlr mentioned this pull request Nov 11, 2016

Missing benchmarks krlmlr/dplyr.benchmark#1

Open

30 tasks

krlmlr changed the title ~~Lazy ordering of character vectors~~ WIP: Lazy ordering of character vectors Nov 29, 2016

krlmlr added 5 commits February 10, 2017 13:14

VectorVisitorImpl<STRSXP> uses SEXP address for hashing

a37b777

instead of order, to avoid computing order (expensive!) unless necessary

remove unused members

e974424

VectorVisitorImpl<STRSXP> computes order lazily

84d9cbb

CharacterVectorOrderer uses hash map of right size

f7e024e

logging

75dcd73

krlmlr force-pushed the f-2198-lazy-order branch from b1b18e8 to d462a13 Compare February 10, 2017 12:14

krlmlr added 5 commits July 15, 2017 15:06

don't need to establish order for equality comparison

36728ac

Merge master~250 into f-2198-lazy-order-prev (using imerge)

01a134a

Merge master~200 into f-2198-lazy-order-prev (using imerge)

2542db2

Merge master into f-2198-lazy-order-prev (using imerge)

6d7137e

Merge remote-tracking branch 'origin/master' into f-2198-lazy-order-prev

953209b

krlmlr mentioned this pull request Jul 15, 2017

Extract code related to encoding from join.cpp #2975

Merged

Merge branch 'master' into f-2198-lazy-order

593dab8

krlmlr force-pushed the f-2198-lazy-order branch from 06b7a32 to 593dab8 Compare July 28, 2017 13:24

krlmlr added 2 commits July 28, 2017 15:30

Merge branch 'master' into f-2198-lazy-order

0cfb60d

VectorVisitorImpl reencodes to UTF-8

ec814e0

krlmlr changed the title ~~WIP: Lazy ordering of character vectors~~ Lazy ordering of character vectors Jul 28, 2017

krlmlr requested a review from hadley July 28, 2017 14:43

hadley approved these changes Jul 28, 2017

View reviewed changes

NEWS

4044486

krlmlr merged commit 9f1ff34 into tidyverse:master Jul 28, 2017

krlmlr deleted the f-2198-lazy-order branch July 28, 2017 17:17

lock bot locked and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy ordering of character vectors #2204

Lazy ordering of character vectors #2204

krlmlr commented Oct 26, 2016 •

edited

krlmlr commented Nov 11, 2016

krlmlr commented Nov 27, 2016

krlmlr commented Nov 29, 2016 •

edited

lock bot commented Jan 18, 2019

Lazy ordering of character vectors #2204

Lazy ordering of character vectors #2204

Conversation

krlmlr commented Oct 26, 2016 • edited

krlmlr commented Nov 11, 2016

krlmlr commented Nov 27, 2016

krlmlr commented Nov 29, 2016 • edited

lock bot commented Jan 18, 2019

krlmlr commented Oct 26, 2016 •

edited

krlmlr commented Nov 29, 2016 •

edited