Performance regression when rbinding lots of data frames that have df-cols. I'm sure this has to do with making extra copies, but I'm not sure where yet. I'll take a look.
This is with dev vctrs.
R 3.6
library(vctrs)
df_col<- new_data_frame(list(x=1:2))
df<- new_data_frame(list(y=df_col))
x<- rep_len(list(df), 10000)
y<- rep_len(list(df_col), 10000)
lst_rbind<-function(x) {
vec_rbind(!!!x)
}
bench::mark(lst_rbind(x))
#> # A tibble: 1 x 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 lst_rbind(x) 37.5ms 37.9ms 26.2 277KB 52.4bench::mark(lst_rbind(y))
#> # A tibble: 1 x 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 lst_rbind(y) 16.5ms 18.2ms 55.1 274KB 16.5
R 4.0
library(vctrs)
df_col<- new_data_frame(list(x=1:2))
df<- new_data_frame(list(y=df_col))
x<- rep_len(list(df), 10000)
y<- rep_len(list(df_col), 10000)
lst_rbind<-function(x) {
vec_rbind(!!!x)
}
bench::mark(lst_rbind(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.#> # A tibble: 1 x 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 lst_rbind(x) 316ms 352ms 2.84 764MB 48.2bench::mark(lst_rbind(y))
#> # A tibble: 1 x 6#> expression min median `itr/sec` mem_alloc `gc/sec`#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>#> 1 lst_rbind(y) 15.8ms 17.6ms 55.6 274KB 23.4
The text was updated successfully, but these errors were encountered:
It seems like when the output df-col is restored with vec_restore() before assigning it into out, that somehow increments the refcnt on each individual column of the df-col from 1 to 2, so then the columns of the output df-col are needlessly copied at the next assignment iteration
static SEXP bare_df_restore_impl(SEXP x, SEXP to, R_len_t size) {
x = PROTECT(r_clone_referenced(x));
x = PROTECT(vec_restore_default(x, to));
This is a problem, because this is a df-col that we are restoring, meaning that it has already been set inside a data frame and the df-col itself has a refcnt of 1 already. So it is referenced, and a shallow duplication does happen here. This triggers a refcnt bump of all of the columns of that df-col, which bumps them from 1 up to 2.
It is possible we need to pass the ownership parameter down to vec_restore()
Performance regression when rbinding lots of data frames that have df-cols. I'm sure this has to do with making extra copies, but I'm not sure where yet. I'll take a look.
This is with dev vctrs.
R 3.6
R 4.0
The text was updated successfully, but these errors were encountered: