-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nest() / unnest() in 1.0.0 significantly slower #751
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Most of the problems here have been resolved in the development version of vctrs. The issue was mainly with With your exact example: # devtools::install_github("r-lib/vctrs")
library(tidyr)
num_rows <- 100000
x <- dplyr::tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
before <- Sys.time()
y <- dplyr::tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
) %>%
tidyr::nest(second_and_third = c(second, third)) %>%
tidyr::unnest(second_and_third)
after <- Sys.time()
if(length(which(x != y)) != 0){
stop("nest() and unnest() procedure results in corrupted data!")
}
cat(paste("Execution Time:",difftime(after,before,units="secs"),"seconds"))
#> Execution Time: 9.04665207862854 seconds Created on 2019-09-24 by the reprex package (v0.2.1) I think we could get 1-2 seconds faster with r-lib/vctrs#592. And probably a little bit more from a native C implementation of |
Reprex comparing new and legacy functions directly: library(tidyr)
num_rows <- 10000
df <- tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
bench::mark(
new = df %>%
nest(second_and_third = c(second, third)) %>%
unnest(second_and_third),
old = df %>%
nest_legacy(second, third) %>%
unnest_legacy(data)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 906ms 906ms 1.10 3.6MB 27.6
#> 2 old 384ms 404ms 2.48 2.98MB 24.8 Created on 2019-11-13 by the reprex package (v0.3.0) So the worst of the performance gap is now resolved, although obviously it would be nicer to do better than the previous version (although the new version is much more general so it's not too surprising that it's a bit slower currently). Profiling shows ~15% of run time from |
In theory here are the benefits from r-lib/vctrs#592 library(tidyr)
num_rows <- 10000
tbl <- tibble(
first = 1:num_rows,
second = 5:(num_rows+5-1),
third = 7:(num_rows+7-1)
)
tbl_nest <- tbl %>%
nest(second_and_third = c(second, third))
df_nest <- as.data.frame(tbl_nest)
df_nest$second_and_third <- lapply(df_nest$second_and_third, as.data.frame)
bench::mark(
tibble_new = unnest(tbl_nest, second_and_third),
tibble_old = unnest_legacy(tbl_nest, second_and_third),
dataframe_new = unnest(df_nest, second_and_third),
dataframe_old = unnest_legacy(df_nest, second_and_third),
iterations = 30
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tibble_new 622.2ms 698ms 1.37 1.43MB 22.7
#> 2 tibble_old 92.4ms 98.9ms 10.1 808.02KB 23.2
#> 3 dataframe_new 345.6ms 431.9ms 2.25 680.05KB 19.4
#> 4 dataframe_old 62.3ms 67.2ms 14.7 575.67KB 28.3 Created on 2019-11-13 by the reprex package (v0.3.0.9000) |
Slightly simpler reprex: library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 488.4ms 512.8ms 1.95 2.82MB 22.4
#> 2 unnest_legacy(df, y) 88.6ms 91.9ms 10.7 1.44MB 24.9 Created on 2019-11-28 by the reprex package (v0.3.0) |
Incremental update. With vctrs master branch after inclusion of big benefits from r-lib/vctrs#825 and small benefits from r-lib/vctrs#824 library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y),
iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 204ms 297.5ms 3.39 3.38MB 14.0
#> 2 unnest_legacy(df, y) 48ms 73.6ms 13.4 1.25MB 11.6 Created on 2020-02-17 by the reprex package (v0.3.0) I've got another idea to further cut this down by moving some expensive |
With dev dplyr, I'm now seeing: library(tidyr)
n <- 10000
df <- tibble(
g = 1:n,
y = rep(list(tibble(x = 1:5)), n)
)
bench::mark(
unnest(df, y),
unnest_legacy(df, y)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 unnest(df, y) 87.2ms 90.8ms 11.0 3.62MB 20.1
#> 2 unnest_legacy(df, y) 123.7ms 129.6ms 7.73 4.11MB 23.2 Created on 2020-04-22 by the reprex package (v0.3.0) So I think this is a good place to leave it. We can certainly come back to improve performance in the future, but I think the original pressing motivation is now resolved. |
It appears the new implementation of
nest()
andunnest()
has resulted in a dramatic slowdown, compared to the previous version oftidyR
Perhaps the problem is related to size preallocation?
In that case,
num_rows
needs to be large to observe the slowdown. See code snippet below.On my system:
The text was updated successfully, but these errors were encountered: