Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vec_rbind() performance regression on R 4.0 with df-cols #1122

Closed
DavisVaughan opened this issue May 27, 2020 · 2 comments · Fixed by #1151
Closed

vec_rbind() performance regression on R 4.0 with df-cols #1122

DavisVaughan opened this issue May 27, 2020 · 2 comments · Fixed by #1151

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented May 27, 2020

Performance regression when rbinding lots of data frames that have df-cols. I'm sure this has to do with making extra copies, but I'm not sure where yet. I'll take a look.

This is with dev vctrs.

R 3.6

library(vctrs)

df_col <- new_data_frame(list(x = 1:2))
df <- new_data_frame(list(y = df_col))

x <- rep_len(list(df), 10000)
y <- rep_len(list(df_col), 10000)

lst_rbind <- function(x) {
  vec_rbind(!!!x)
}

bench::mark(lst_rbind(x))
#> # A tibble: 1 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(x)   37.5ms   37.9ms      26.2     277KB     52.4

bench::mark(lst_rbind(y))
#> # A tibble: 1 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(y)   16.5ms   18.2ms      55.1     274KB     16.5

R 4.0

library(vctrs)

df_col <- new_data_frame(list(x = 1:2))
df <- new_data_frame(list(y = df_col))

x <- rep_len(list(df), 10000)
y <- rep_len(list(df_col), 10000)

lst_rbind <- function(x) {
  vec_rbind(!!!x)
}

bench::mark(lst_rbind(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 1 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(x)    316ms    352ms      2.84     764MB     48.2

bench::mark(lst_rbind(y))
#> # A tibble: 1 x 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 lst_rbind(y)   15.8ms   17.6ms      55.6     274KB     23.4
@DavisVaughan
Copy link
Member Author

DavisVaughan commented May 27, 2020

It seems like when the output df-col is restored with vec_restore() before assigning it into out, that somehow increments the refcnt on each individual column of the df-col from 1 to 2, so then the columns of the output df-col are needlessly copied at the next assignment iteration

assigned = vec_restore(assigned, out_elt, R_NilValue);

@DavisVaughan
Copy link
Member Author

Hmm okay, so vec_restore() on a bare data frame / tibble passes through to bare_df_restore_impl() which calls r_clone_referenced() here

vctrs/src/proxy-restore.c

Lines 116 to 118 in 843c03c

static SEXP bare_df_restore_impl(SEXP x, SEXP to, R_len_t size) {
x = PROTECT(r_clone_referenced(x));
x = PROTECT(vec_restore_default(x, to));

This is a problem, because this is a df-col that we are restoring, meaning that it has already been set inside a data frame and the df-col itself has a refcnt of 1 already. So it is referenced, and a shallow duplication does happen here. This triggers a refcnt bump of all of the columns of that df-col, which bumps them from 1 up to 2.

It is possible we need to pass the ownership parameter down to vec_restore()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants