Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop-off for arrange() #4962

Closed
markfairbanks opened this issue Mar 10, 2020 · 7 comments · Fixed by #6263
Closed

Performance drop-off for arrange() #4962

markfairbanks opened this issue Mar 10, 2020 · 7 comments · Fixed by #6263
Labels
performance 🚀 rows ↕️ Operations on rows: filter(), slice(), arrange()
Milestone

Comments

@markfairbanks
Copy link
Contributor

The dev version of dplyr has a performance drop-off when using arrange().

library(dplyr, warn.conflicts = FALSE)
library(bench)

data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
                  b = sample(1:20, data_size, TRUE))

bench::mark(
  tidyverse = arrange(test_df, a, b),
  check = FALSE,
  iterations = 5)

# dplyr 0.8.5
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyverse     369ms    372ms      2.69    19.2MB     4.04

# dplyr 1.0.0 dev
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyverse     2.18s    2.24s     0.446    48.3MB    0.804

Created on 2020-03-10 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Mar 10, 2020

@DavisVaughan can you please take a look? I see the roughly the same performance with latest vectrs.

@hadley hadley added this to the 1.0.0 milestone Mar 10, 2020
@DavisVaughan
Copy link
Member

DavisVaughan commented Mar 10, 2020

It seems to be due entirely to the order() call in arrange_rows(). It isn't choosing radix sort, it is choosing shell sort because of the character column (radix is only automatically applied when the columns are numeric, factor, or logical).

Should we be more aggressive than order() by manually setting the method to "radix", even for character vectors? It is much faster

library(dplyr, warn.conflicts = FALSE)
library(bench)

data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
                  b = sample(1:20, data_size, TRUE))

bench::mark(
  tidyverse = arrange(test_df, a, b),
  check = FALSE,
  iterations = 5)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyverse    42.1ms   65.2ms      13.8    56.4MB     16.6

Created on 2020-03-10 by the reprex package (v0.3.0)

Caveats:

  • We can't just blindly set to method = "radix". It fails for complex types, so we'd have to check if the type is integer/numeric/logical/character, otherwise fallback.

  • A test fails because it will no longer respect the locale. Radix sorting follows the C locale no matter what.

Exact notes from ?sort:

However, there are some caveats with the radix sort:
If x is a character vector, all elements must share the same encoding. Only UTF-8 (including ASCII) and Latin-1 encodings are supported. Collation always follows the "C" locale.
Long vectors (with more than 2^32 elements) and complex vectors are not supported yet.

A quick test seems to suggest that long vectors automatically fall back to a different method even if method = "radix" is supplied, so I don't think we would have to check that.

@DavisVaughan
Copy link
Member

DavisVaughan commented Mar 10, 2020

It is possible we should have arrange(.data, ..., .by_group = FALSE, .method = "auto"). It would choose "radix" for character vectors by default, unlike order(), but otherwise would work the same as order(method = "auto"). This would allow users who rely on the locale behavior to set .method = "shell", although they would probably see performance drop off?

@krlmlr
Copy link
Member

krlmlr commented Mar 10, 2020

In the longer run, can we use radix sort with sort keys, as outlined in #3044 or #2201?

@hadley
Copy link
Member

hadley commented Mar 26, 2020

I think we'll need to leave this for 1.1.0. If anyone gets a chance to work on it in the near future, it would still be useful, but I don't think it's a blocker for release.

@romainfrancois
Copy link
Member

With PR #5808 we get:

library(dplyr, warn.conflicts = FALSE)

data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
                  b = sample(1:20, data_size, TRUE))

bench::mark(
  tidyverse = arrange(test_df, a, b),
  check = FALSE,
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyverse      30ms   35.6ms      26.1    30.4MB     41.8

Created on 2021-03-10 by the reprex package (v0.3.0)

vs this on master:

library(dplyr, warn.conflicts = FALSE)

data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
                  b = sample(1:20, data_size, TRUE))

bench::mark(
  tidyverse = arrange(test_df, a, b),
  check = FALSE,
  iterations = 5
)
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tidyverse     1.33s    1.33s     0.753    17.3MB     1.13

Created on 2021-03-10 by the reprex package (v0.3.0)

@DavisVaughan
Copy link
Member

With #5868 we are a little faster than 0.8.5 (for this particular example) even with generation of the sort key

library(dplyr, warn.conflicts = FALSE)
library(bench)

data_size <- 1000000

test_df <- tibble(
  a = sample(c("a","a","b","c","d"), data_size, TRUE),
  b = sample(1:20, data_size, TRUE)
)

bench::mark(
  american_english = arrange(test_df, a, b),
  c_locale = arrange(test_df, a, b, .locale = "C"),
  check = FALSE,
  iterations = 10
)
#> # A tibble: 2 x 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 american_english  212.6ms  212.6ms      4.70      38MB     70.6
#> 2 c_locale           29.7ms   30.1ms     33.0     28.3MB     88.0

Created on 2021-07-01 by the reprex package (v2.0.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment