arrange() can utilize setorder() to be faster.
This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call copy() outright.
Note: In the case of a transmute expression inside arrange() we would need to still use order().
Ex: arrange(df, x - 1)
library(stringi)
library(data.table)
rand_strings <- stri_rand_strings(100, 4)
bench::press(
data_size = c(10000, 1000000, 10000000),
{
df <- data.table(a = sample(rand_strings, data_size, TRUE),
b = round(runif(data_size), 4))
bench::mark(
order = df[order(a, b)],
setorder = setorder(copy(df), a, b),
check = FALSE, iterations = 30
)
}
)
#> Running with:
#> data_size
#> 1 10000
#> 2 1000000
#> 3 10000000
#> # A tibble: 6 × 7
#> expression data_size min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 order 10000 1.06ms 1.38ms 703. 2.01MB 0
#> 2 setorder 10000 1.01ms 1.23ms 800. 371.95KB 0
#> 3 order 1000000 56.75ms 66.49ms 15.0 19.09MB 9.97
#> 4 setorder 1000000 45.15ms 51.5ms 19.0 27.67MB 34.9
#> 5 order 10000000 676.79ms 690.23ms 1.43 190.75MB 4.91
#> 6 setorder 10000000 560.98ms 648.71ms 1.55 276.58MB 3.88
# With implicit copy (from filter/slice/select/summarize)
bench::press(
data_size = c(10000, 1000000, 10000000),
{
df <- data.table(a = sample(rand_strings, data_size, TRUE),
b = round(runif(data_size), 4))
bench::mark(
order = df[, .(a, b)][order(a, b)],
setorder = setorder(df[, .(a, b)], a, b),
check = FALSE, iterations = 30
)
}
)
#> Running with:
#> data_size
#> 1 10000
#> 2 1000000
#> 3 10000000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 6 × 7
#> expression data_size min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 order 10000 1.65ms 1.97ms 488. 756.3KB 0
#> 2 setorder 10000 1.35ms 1.58ms 622. 488.4KB 21.4
#> 3 order 1000000 73.8ms 80.88ms 11.7 49.7MB 1.30
#> 4 setorder 1000000 59.8ms 67.33ms 14.7 43MB 2.26
#> 5 order 10000000 766.43ms 914.88ms 1.09 496MB 1.71
#> 6 setorder 10000000 628.64ms 733.18ms 1.35 429.2MB 1.67
Created on 2022-06-22 by the reprex package (v2.0.1)
arrange()can utilizesetorder()to be faster.This would have even more benefits in a longer pipe chain with an implicit copy where we wouldn't need to call
copy()outright.Note: In the case of a transmute expression inside
arrange()we would need to still useorder().Ex:
arrange(df, x - 1)Created on 2022-06-22 by the reprex package (v2.0.1)