-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drop-off for arrange()
#4962
Comments
@DavisVaughan can you please take a look? I see the roughly the same performance with latest vectrs. |
It seems to be due entirely to the Should we be more aggressive than library(dplyr, warn.conflicts = FALSE)
library(bench)
data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
b = sample(1:20, data_size, TRUE))
bench::mark(
tidyverse = arrange(test_df, a, b),
check = FALSE,
iterations = 5)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyverse 42.1ms 65.2ms 13.8 56.4MB 16.6 Created on 2020-03-10 by the reprex package (v0.3.0) Caveats:
Exact notes from
A quick test seems to suggest that long vectors automatically fall back to a different method even if |
It is possible we should have |
I think we'll need to leave this for 1.1.0. If anyone gets a chance to work on it in the near future, it would still be useful, but I don't think it's a blocker for release. |
With PR #5808 we get: library(dplyr, warn.conflicts = FALSE)
data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
b = sample(1:20, data_size, TRUE))
bench::mark(
tidyverse = arrange(test_df, a, b),
check = FALSE,
iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyverse 30ms 35.6ms 26.1 30.4MB 41.8 Created on 2021-03-10 by the reprex package (v0.3.0) vs this on master: library(dplyr, warn.conflicts = FALSE)
data_size <- 1000000
test_df <- tibble(a = sample(c("a","a","b","c","d"), data_size, TRUE),
b = sample(1:20, data_size, TRUE))
bench::mark(
tidyverse = arrange(test_df, a, b),
check = FALSE,
iterations = 5
)
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tidyverse 1.33s 1.33s 0.753 17.3MB 1.13 Created on 2021-03-10 by the reprex package (v0.3.0) |
With #5868 we are a little faster than 0.8.5 (for this particular example) even with generation of the sort key library(dplyr, warn.conflicts = FALSE)
library(bench)
data_size <- 1000000
test_df <- tibble(
a = sample(c("a","a","b","c","d"), data_size, TRUE),
b = sample(1:20, data_size, TRUE)
)
bench::mark(
american_english = arrange(test_df, a, b),
c_locale = arrange(test_df, a, b, .locale = "C"),
check = FALSE,
iterations = 10
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 american_english 212.6ms 212.6ms 4.70 38MB 70.6
#> 2 c_locale 29.7ms 30.1ms 33.0 28.3MB 88.0 Created on 2021-07-01 by the reprex package (v2.0.0) |
The dev version of dplyr has a performance drop-off when using
arrange()
.Created on 2020-03-10 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: