New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance issue of inds_combine with many columns. Could be improve using std::string #56

Open
ginolhac opened this Issue Mar 21, 2018 · 26 comments

Comments

Projects
None yet
5 participants
@ginolhac

ginolhac commented Mar 21, 2018

Hello,

We noticed that dplyr::select() is slow when many columns are supplied. After investigations, @koncina and me, we found that this line Rcpp line might be the one to change:
https://github.com/tidyverse/tidyselect/blob/master/src/combine_variables.cpp#L28

The declaration of out_name as <String> and not std::string is slowing down the process.

Historically, it comes from a fix of an encoding issue by @romainfrancois here. Later on, @romainfrancois did a second fix but String remains. I must stress out that I know nothing about encoding, so keeping the String type could be mandatory.

After replacing L28 by std::vector<std::string> out_name; creating a function inds_combine_std, we obtained:

fake_columns <- paste0("col_", 1:50000)
system.time(v1 <- tidyselect:::inds_combine(fake_columns, list(-1)))
system.time(v2 <- inds_combine_std(fake_columns, list(-1)))
identical(v1, v2)

  user   system elapsed
11.548    0.003  11.556
  user   system elapsed
 0.007    0.000   0.007
 [1] TRUE

Of note, this line is also present in the current dplyr code:
https://github.com/tidyverse/dplyr/blob/master/src/combine_variables.cpp#L29

We replaced in dplyr v0.7.4 L29 with std::string and did some benchmark, see how it also improves:

image

Is there any (encoding) reason to keep the String type in both tidyselect and dplyr?

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 21, 2018

Interesting, thanks! cc @krlmlr

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 21, 2018

🏊 back, did you check that your fix handles this issue tidyverse/dplyr#1023

Weird that std::string would beat Rcpp::String as String is designed to avoid making copies, etc ... or maybe it's a case when it's better to make a copy. I used to live under the assumptions that avoiding copies is best, but then perhaps it's a cache thing, and when you make a std::vector<std::string> the strings are close to each other. I don't know.

Thanks for reporting this.

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 21, 2018

It may be caused by SEXP preservation, you have to allocate one SEXP node each time because the precious list is a linked list.

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 21, 2018

well strings are pooled and implicitly protected so maybe not (haven't checked the Rcpp code)

Edit: Also std::string() allocates as well and even makes copies as Romain mentioned.

@ginolhac

This comment has been minimized.

ginolhac commented Mar 21, 2018

I have a mac and @koncina ubuntu, so we thought about testing it but we don't have the machine for testing

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 21, 2018

Thanks. On Ubuntu, we could try profiling the code with gprofiler and see where the time is burnt.

@ginolhac

This comment has been minimized.

ginolhac commented Mar 22, 2018

a small update, while @koncina is trying to use gprofiler. It turns out that the compiler is acting, but this is largely beyond my skills, so here are the facts.

combine_vars

using this rcpp example of the combine_vars() function using either std::string or String:

  • on ubuntu or windows (WM), the std::string is performing better
  • on mac (or VM with bsd): no gain!!
> system.time(a <- inds_combine(paste0("col_", 1:50000), list(-1)))
   user  system elapsed 
  0.061   0.003   0.065 

> system.time(b <- inds_combine_fix(paste0("col_", 1:50000), list(-1)))
   user  system elapsed 
  0.071   0.003   0.077 

> system.time(c <- inds_combine(paste0("col_", 1:50000), list(1:50000, -1)))
   user  system elapsed 
 11.306   0.124  11.822 

> system.time(d <- inds_combine_fix(paste0("col_", 1:50000), list(1:50000, -1)))
   user  system elapsed 
 11.184   0.099  11.470 

> identical(a, b, c, d)
[1] TRUE

simpler example

using this simpler cpp example

  • the 3 OS give similar results, std::string is faster
> system.time(a <- with_string(paste0("col_", 1:50000)))
   user  system elapsed 
 17.087   0.401  21.884 

> system.time(b <- with_stdstr(paste0("col_", 1:50000)))
   user  system elapsed 
  0.038   0.001   0.040 

> identical(a, b)
[1] TRUE

sorry to not be more helpful, but we cannot figure out the inconsistency of the OS used

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 22, 2018

This is very helpful, especially with the simpler example outside of Rcpp. might be worth saving the results of 1:50000 outside of the system.time

I believe in both cases out_vec would be empty since when it is created tmp is empty.

CharacterVector out_vec(tmp.begin(), tmp.end());

Probably an Rcpp thing. Perhaps as I guessed the other day about cache locality. I don't know enough about the allocators used in std::vector<std::string> but if they do store their data contiguously, maybe that's the reason.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 23, 2018

A run time difference of that magnitude suggests that some heavy copying happens, or nested loops.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 23, 2018

Probably in this constructor.
https://github.com/RcppCore/Rcpp/blob/43c24bc7dd7c771195b830e54528a3236ac7dccf/inst/include/Rcpp/String.h#L89

       /** from string proxy */
        String(const StringProxy& proxy): data(proxy.get()), valid(true), buffer_ready(false), enc(Rf_getCharCE(proxy.get())) {
            Rcpp_PreserveObject(data);
            RCPP_STRING_DEBUG("String(const StringProxy&)");
        }
  • with the global string cache, do individual strings CHARSXP have to be protected ? the preserve/release is not free.
  • is Rf_getCharCE costly ?
@lionel-

This comment has been minimized.

Member

lionel- commented Mar 23, 2018

with the global string cache, do individual strings CHARSXP have to be protected ? the preserve/release is not free.

I don't think so, once it's in the string pool it's there forever. But as long as it's not guaranteed by the R language "spec" we probably can't rely on it though.

Is the CHARSXP released right after being assigned to the vector or only at the end of the vector filling? I think the former should be more efficient.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 23, 2018

potentially to go into vector::push_back( const String& ) the String( StringProxy ) ctor is used and then the String copy constructor is used right after it, there's no move operations in Rcpp as far as I can see.

The copy might be elided with RVO by the compiler, but I'm not sure. If not then that means another preserver and release.

then all the Strings are held by the std::vector<String> and its destructor's job is to destruct all the strings, I'm not sure about the order of destruction though, but I guess we're looking at a loop calling R_ReleaseObject. That also means more opportunities for the gc to want to play.

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 23, 2018

I'm not sure about the order of destruction though

Oh for R_ReleaseObject that's extremely important because there's a linear search through the precious linked list if the object you are releasing is not on top of the stack.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 23, 2018

in the meantime, we can either:

  • use std::string as suggested
  • develop some lighter String that does not do preserve/release. I don't entirely trust the string pool though.

but yeah paying for preserve/release on individual strings feels wrong

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 23, 2018

  • Rewrite tidyselect in C ;)
@koncina

This comment has been minimized.

koncina commented Mar 23, 2018

To complete the simpler example posted earlier I created a function using two subfunctions to use either Rcpp::string or std::string.
The functions measures the time to perform the for loop at two different steps: just before the end of the subfunction and after returning from the subfunction (and you can allocate or not memory for the vector).

Here is a figure showing that the time loss might indeed be linked to a destructor (time is lost after exiting the function). Don't know if this observation is helping you.

image

@lionel-

This comment has been minimized.

Member

lionel- commented Mar 23, 2018

Thanks that's very helpful! That does suggest it's about the string preservation/release.

@koncina

This comment has been minimized.

koncina commented Mar 23, 2018

A small update to illustrate a previous remark that FreeBSD (or Mac OS) behaves differently. The following plot can be created using a FreeBSD (11.1) VM (and I presume on Mac OS):

image

I added the code to generate the plot in the gist.

Remark: The FreeBSD VM is running R 3.4.3 with Rcpp 0.12.16 while my Ubuntu setup is running R 3.4.4 with Rcpp 0.12.16. But using rocker to test R 3.4.3 on ubuntu does not change the observation.

@ginolhac

This comment has been minimized.

ginolhac commented Mar 23, 2018

thanks @koncina, I can confirm that on native macOS it behaves the same way

image

R version 3.4.4, system x86_64, darwin17.3.0, Rcpp 0.12.16

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 25, 2018

@koncina: Were you able to use gprofiler?

Destruction order of containers seems to be unspecified (https://stackoverflow.com/a/2083629/946850), this might explain different observations. We could add our own destructor to VarList to release in reverse order, how does that compare with using std::vector<String>?

@koncina

This comment has been minimized.

koncina commented Mar 29, 2018

@krlmlr: gprofiler identifies Rf_pmatch() as the limiting step (testing dplyr's select() with a large tibble or dplyr:::combine_vars() / tidyselect:::inds_combine() with a large character vector).

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(gprofiler)

# Create a large tibble to test select:
large_tibble <- matrix(runif(5e6), nrow = 100) %>%
  as_tibble()

# Use gprofiler
out_file <- tempfile("gprofiler", fileext = ".out")
start_profiler(out_file)
## Temporary files: /tmp/RtmprlMAfr/gprofiler47921b65c136.prof, /tmp/RtmprlMAfr/gprofiler47924d1a0bde.out/gprofiler4ca329b84657.out
x <- select(large_tibble, -1)
stop_profiler()

summaryRprof(out_file)
## $by.self
##                                                                                          self.time self.pct total.time total.pct
## "Rf_pmatch"                                                                                  12.24    95.77      12.24     95.77
## "_Z12combine_varsN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE"      0.46     3.60       0.48      3.76
## "FUN"                                                                                         0.06     0.47       0.08      0.63
## "<Anonymous>"                                                                                 0.02     0.16       0.02      0.16
## 
## $by.total
##                                                                                          total.time total.pct self.time self.pct
## "select_vars"                                                                                 12.78    100.00      0.00     0.00
## "select.data.frame"                                                                           12.78    100.00      0.00     0.00
## "select"                                                                                      12.78    100.00      0.00     0.00
## "combine_vars"                                                                                12.70     99.37      0.00     0.00
## "Rf_pmatch"                                                                                   12.24     95.77     12.24    95.77
## "_Z12combine_varsN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE"       0.48      3.76      0.46     3.60
## "_dplyr_combine_vars"                                                                          0.48      3.76      0.00     0.00
## "Rf_NewFrameConfirm"                                                                           0.48      3.76      0.00     0.00
## "FUN"                                                                                          0.08      0.63      0.06     0.47
## "lapply"                                                                                       0.06      0.47      0.00     0.00
## "map_if"                                                                                       0.06      0.47      0.00     0.00
## "map"                                                                                          0.06      0.47      0.00     0.00
## "<Anonymous>"                                                                                  0.02      0.16      0.02     0.16
## "map_lgl"                                                                                      0.02      0.16      0.00     0.00
## "map_mold"                                                                                     0.02      0.16      0.00     0.00
## "R_ReleaseObject"                                                                              0.02      0.16      0.00     0.00
## "vapply"                                                                                       0.02      0.16      0.00     0.00
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 12.78
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 17.10
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8   
##  [6] LC_MESSAGES=fr_FR.UTF-8    LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gprofiler_0.0-1 forcats_0.3.0   stringr_1.3.0   dplyr_0.7.4     purrr_0.2.4     readr_1.1.1     tidyr_0.8.0     tibble_1.4.2    ggplot2_2.2.1  
## [10] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.3   haven_1.1.1      lattice_0.20-35  colorspace_1.3-2 htmltools_0.3.6  yaml_2.1.18      rlang_0.2.0      pillar_1.2.1    
##  [9] withr_2.1.2      foreign_0.8-69   glue_1.2.0       modelr_0.1.1     readxl_1.0.0     bindrcpp_0.2.2   bindr_0.1.1      plyr_1.8.4      
## [17] RProtoBuf_0.4.11 munsell_0.4.3    gtable_0.2.0     cellranger_1.1.0 rvest_0.3.2      psych_1.7.8      evaluate_0.10.1  knitr_1.20      
## [25] parallel_3.4.4   broom_0.4.3      Rcpp_0.12.16     scales_0.5.0     backports_1.1.2  jsonlite_1.5     mnormt_1.5-5     hms_0.4.2       
## [33] digest_0.6.15    pprof_1.0        stringi_1.1.7    grid_3.4.4       rprojroot_1.3-2  cli_1.0.0        tools_3.4.4      bitops_1.0-6    
## [41] magrittr_1.5     lazyeval_0.2.1   RCurl_1.95-4.10  crayon_1.3.4     pkgconfig_2.0.1  xml2_1.2.0       profile_1.0      lubridate_1.7.3 
## [49] assertthat_0.2.0 rmarkdown_1.9    httr_1.3.1       rstudioapi_0.7   R6_2.2.2         nlme_3.1-131.1   compiler_3.4.4

I had some issues when I started to use gprofiler as stop_profiler() randomly failed to combine the profiles ("sum(ds_pprof$samples$value) == sum(ds_rprof$samples$value) is not TRUE" error). Making a long story short: I ended up in testing gprofiler with a self-compiled R 3.4.4 (same sessionInfo and capabilities) but gprofiler shows up the RecursiveRelease() function instead of Rf_pmatch() (no Rf_ function is visible)...

@romainfrancois: I checked the code in the https://github.com/tidyverse/dplyr/issues/1023 issue on a windows 10 installation. After replacing String with std::string on line 29 in dplyr's combine_variables.cpp, the performances are improved and I'm not reproducing the encoding issue.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

set.seed(123)
n <- 1e5
dat <- data_frame(
    Group = sample(c("m", "f"), size = n, replace=TRUE),
    Process = sample(c("å", "ä", "ö", "Å", "Ä", "Ö"), size = n, replace=TRUE)
)

dat %>% 
    group_by(Process) %>% 
    summarise(Number = n()) %>%
    spread(Process, Number, fill=0) %>% 
summarise_all(funs(sum))
## # A tibble: 1 x 6
##        ä      Ä      å      Å      ö      Ö
##    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 16685. 16713. 16692. 16586. 16621. 16703.
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252   
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
## [5] LC_TIME=French_France.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.8.0 dplyr_0.7.4
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16     knitr_1.20       bindr_0.1.1      magrittr_1.5    
##  [5] tidyselect_0.2.4 R6_2.2.2         rlang_0.2.0      stringr_1.3.0   
##  [9] tools_3.4.4      utf8_1.1.3       cli_1.0.0        htmltools_0.3.6 
## [13] yaml_2.1.18      assertthat_0.2.0 rprojroot_1.3-2  digest_0.6.15   
## [17] tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2     purrr_0.2.4     
## [21] glue_1.2.0       evaluate_0.10.1  rmarkdown_1.9    stringi_1.1.7   
## [25] compiler_3.4.4   pillar_1.2.1     backports_1.1.2  pkgconfig_2.0.1
@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 29, 2018

Thanks for the heads up, it would be interesting to see which test cases caused these random failures to combine profiles. Would you mind filing an issue at the gprofiler repo?

Maybe the home-compiled R version is confusing gprofiler. Again, that would be a problem we'd need to look at eventually.

In the session info I see that dplyr 0.7.4 is being measured. But tidyselect is used only in the development version of dplyr, not yet in 0.7.4. Maybe it's useful to look for weak performance using tidyselect functions only?

@ginolhac

This comment has been minimized.

ginolhac commented Mar 30, 2018

@krlmlr what @koncina changed in dplyr 0.7.4 is this line https://github.com/tidyverse/dplyr/blob/master/src/combine_variables.cpp#L29. Most of the tidyselect code is duplicated in the current version of dplyr.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Mar 30, 2018

This is good to know, but will this also lead to better performance when applied to tidyselect in the upcoming dplyr 0.7.5?

When using vars_select(), it seems that the size of vars is more crucial than the number of unquote-spliced (bang-bang-banged) arguments:

library(tidyselect)

test_vars <- function(m, n) {
  vars <- paste0("v", seq_len(n))
  sub_vars <- rlang::syms(sample(vars, m))
  system.time(vars_select(vars, !!!sub_vars))
}

test_vars(800, 16000)
#>    user  system elapsed 
#>   1.212   0.010   1.223
test_vars(1600, 16000)
#>    user  system elapsed 
#>   1.142   0.000   1.142
test_vars(3200, 16000)
#>    user  system elapsed 
#>    1.32    0.00    1.32
test_vars(6400, 16000)
#>    user  system elapsed 
#>   1.491   0.000   1.491
test_vars(1600, 32000)
#>    user  system elapsed 
#>   4.213   0.000   4.213
test_vars(3200, 32000)
#>    user  system elapsed 
#>   4.487   0.007   4.494
test_vars(6400, 32000)
#>    user  system elapsed 
#>   4.868   0.000   4.870
test_vars(12800, 32000)
#>    user  system elapsed 
#>   6.253   0.012   6.266

Created on 2018-03-30 by the reprex package (v0.2.0).

@koncina: Would you mind to gprofile a variant of this code?

@koncina

This comment has been minimized.

koncina commented Mar 31, 2018

I agree that we should switch to the development version and focus on tidyselect. The only drawback for now is that gprofiler fails to reveal the calls below the inds_combine() function. I was able to fix this for testing purposes by disabling the ByteCompile option in the DESCRIPTION file.

Another interesting observation is that performances in the development version are worse than in 0.7.4: In addition to the identification of inds_combine() / Rf_pmatch() as the limiting step, base::assign() is also showing up (using dplyr 0.7.4.9003 commit eafffad and tidyselect 0.2.4.9000 commit 72bedbb).

library(tidyverse)
library(gprofiler)
large_tibble <- matrix(runif(5e6), nrow = 100) %>%
  as_tibble()

out_file <- tempfile("gprofiler", fileext = ".out")
start_profiler(out_file)
x <- select(large_tibble, -1)
stop_profiler()

summaryRprof(out_file)[c("by.self", "sampling.time")]
## $by.self
##                                                                                          self.time self.pct total.time total.pct
## "Rf_pmatch"                                                                                  24.18    71.62      24.18     71.62
## "base::assign"                                                                                8.74    25.89       8.74     25.89
## "_Z12inds_combineN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE"      0.24     0.71       0.52      1.54
## "_ZN7VarList6updateEiN4Rcpp6StringE"                                                          0.24     0.71       0.24      0.71
## "env_bind_impl"                                                                               0.08     0.24       9.08     26.90
## "get"                                                                                         0.08     0.24       0.10      0.30
## "<GC>"                                                                                        0.08     0.24       0.08      0.24
## "getExportedValue"                                                                            0.04     0.12       0.22      0.65
## "asNamespace"                                                                                 0.04     0.12       0.04      0.12
## "::"                                                                                          0.02     0.06       0.24      0.71
## "setdiff2"                                                                                    0.02     0.06       0.02      0.06
## 
## $sampling.time
## [1] 33.76

Fixing the code in inds_combine() (replacing String by std::string on line 28 in combine_variables.cpp) improves also the performances in the development version but base::assign() remains a limiting step as shown in this plot (as I disabled ByteCompile in tidyselect I also added a condition with ByteCompile enabled to show that it doesn't explain the drop in performance):

image

@krlmlr: The random gprofiler error I mentioned also occurs with the code used in my posts and using the CRAN R binary... I'm still trying to find more informations before opening an issue: sometimes the error occurs more frequently suggesting that something is interfering...
I will look into the code of your last post by next week.

@krlmlr

This comment has been minimized.

Member

krlmlr commented May 26, 2018

I'm working on jointprof again and will take a look myself too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment