-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance issue of inds_combine with many columns. Could be improve using std::string #56
Comments
Interesting, thanks! cc @krlmlr |
🏊 back, did you check that your fix handles this issue tidyverse/dplyr#1023 Weird that Thanks for reporting this. |
It may be caused by SEXP preservation, you have to allocate one SEXP node each time because the precious list is a linked list. |
well strings are pooled and implicitly protected so maybe not (haven't checked the Rcpp code) Edit: Also |
I have a mac and @koncina ubuntu, so we thought about testing it but we don't have the machine for testing |
Thanks. On Ubuntu, we could try profiling the code with gprofiler and see where the time is burnt. |
a small update, while @koncina is trying to use combine_varsusing this
simpler exampleusing this
sorry to not be more helpful, but we cannot figure out the inconsistency of the OS used |
This is very helpful, especially with the simpler example outside of Rcpp. might be worth saving the results of I believe in both cases out_vec would be empty since when it is created tmp is empty.
Probably an Rcpp thing. Perhaps as I guessed the other day about cache locality. I don't know enough about the allocators used in |
A run time difference of that magnitude suggests that some heavy copying happens, or nested loops. |
Probably in this constructor. /** from string proxy */
String(const StringProxy& proxy): data(proxy.get()), valid(true), buffer_ready(false), enc(Rf_getCharCE(proxy.get())) {
Rcpp_PreserveObject(data);
RCPP_STRING_DEBUG("String(const StringProxy&)");
}
|
I don't think so, once it's in the string pool it's there forever. But as long as it's not guaranteed by the R language "spec" we probably can't rely on it though. Is the CHARSXP released right after being assigned to the vector or only at the end of the vector filling? I think the former should be more efficient. |
potentially to go into The copy might be elided with RVO by the compiler, but I'm not sure. If not then that means another preserver and release. then all the |
Oh for R_ReleaseObject that's extremely important because there's a linear search through the precious linked list if the object you are releasing is not on top of the stack. |
in the meantime, we can either:
but yeah paying for preserve/release on individual strings feels wrong |
|
To complete the simpler example posted earlier I created a function using two subfunctions to use either Rcpp::string or std::string. Here is a figure showing that the time loss might indeed be linked to a destructor (time is lost after exiting the function). Don't know if this observation is helping you. |
Thanks that's very helpful! That does suggest it's about the string preservation/release. |
A small update to illustrate a previous remark that FreeBSD (or Mac OS) behaves differently. The following plot can be created using a FreeBSD (11.1) VM (and I presume on Mac OS): I added the code to generate the plot in the gist. Remark: The FreeBSD VM is running R 3.4.3 with Rcpp 0.12.16 while my Ubuntu setup is running R 3.4.4 with Rcpp 0.12.16. But using rocker to test R 3.4.3 on ubuntu does not change the observation. |
thanks @koncina, I can confirm that on native macOS it behaves the same way R version 3.4.4, system x86_64, darwin17.3.0, Rcpp 0.12.16 |
@koncina: Were you able to use gprofiler? Destruction order of containers seems to be unspecified (https://stackoverflow.com/a/2083629/946850), this might explain different observations. We could add our own destructor to |
@krlmlr: gprofiler identifies library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.3.0
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gprofiler)
# Create a large tibble to test select:
large_tibble <- matrix(runif(5e6), nrow = 100) %>%
as_tibble()
# Use gprofiler
out_file <- tempfile("gprofiler", fileext = ".out")
start_profiler(out_file)
## Temporary files: /tmp/RtmprlMAfr/gprofiler47921b65c136.prof, /tmp/RtmprlMAfr/gprofiler47924d1a0bde.out/gprofiler4ca329b84657.out
x <- select(large_tibble, -1)
stop_profiler()
summaryRprof(out_file)
## $by.self
## self.time self.pct total.time total.pct
## "Rf_pmatch" 12.24 95.77 12.24 95.77
## "_Z12combine_varsN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE" 0.46 3.60 0.48 3.76
## "FUN" 0.06 0.47 0.08 0.63
## "<Anonymous>" 0.02 0.16 0.02 0.16
##
## $by.total
## total.time total.pct self.time self.pct
## "select_vars" 12.78 100.00 0.00 0.00
## "select.data.frame" 12.78 100.00 0.00 0.00
## "select" 12.78 100.00 0.00 0.00
## "combine_vars" 12.70 99.37 0.00 0.00
## "Rf_pmatch" 12.24 95.77 12.24 95.77
## "_Z12combine_varsN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE" 0.48 3.76 0.46 3.60
## "_dplyr_combine_vars" 0.48 3.76 0.00 0.00
## "Rf_NewFrameConfirm" 0.48 3.76 0.00 0.00
## "FUN" 0.08 0.63 0.06 0.47
## "lapply" 0.06 0.47 0.00 0.00
## "map_if" 0.06 0.47 0.00 0.00
## "map" 0.06 0.47 0.00 0.00
## "<Anonymous>" 0.02 0.16 0.02 0.16
## "map_lgl" 0.02 0.16 0.00 0.00
## "map_mold" 0.02 0.16 0.00 0.00
## "R_ReleaseObject" 0.02 0.16 0.00 0.00
## "vapply" 0.02 0.16 0.00 0.00
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 12.78 ## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 17.10
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8
## [6] LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] gprofiler_0.0-1 forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4 readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1
## [10] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.3 haven_1.1.1 lattice_0.20-35 colorspace_1.3-2 htmltools_0.3.6 yaml_2.1.18 rlang_0.2.0 pillar_1.2.1
## [9] withr_2.1.2 foreign_0.8-69 glue_1.2.0 modelr_0.1.1 readxl_1.0.0 bindrcpp_0.2.2 bindr_0.1.1 plyr_1.8.4
## [17] RProtoBuf_0.4.11 munsell_0.4.3 gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2 psych_1.7.8 evaluate_0.10.1 knitr_1.20
## [25] parallel_3.4.4 broom_0.4.3 Rcpp_0.12.16 scales_0.5.0 backports_1.1.2 jsonlite_1.5 mnormt_1.5-5 hms_0.4.2
## [33] digest_0.6.15 pprof_1.0 stringi_1.1.7 grid_3.4.4 rprojroot_1.3-2 cli_1.0.0 tools_3.4.4 bitops_1.0-6
## [41] magrittr_1.5 lazyeval_0.2.1 RCurl_1.95-4.10 crayon_1.3.4 pkgconfig_2.0.1 xml2_1.2.0 profile_1.0 lubridate_1.7.3
## [49] assertthat_0.2.0 rmarkdown_1.9 httr_1.3.1 rstudioapi_0.7 R6_2.2.2 nlme_3.1-131.1 compiler_3.4.4 I had some issues when I started to use gprofiler as @romainfrancois: I checked the code in the https://github.com/tidyverse/dplyr/issues/1023 issue on a windows 10 installation. After replacing library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
set.seed(123)
n <- 1e5
dat <- data_frame(
Group = sample(c("m", "f"), size = n, replace=TRUE),
Process = sample(c("å", "ä", "ö", "Å", "Ä", "Ö"), size = n, replace=TRUE)
)
dat %>%
group_by(Process) %>%
summarise(Number = n()) %>%
spread(Process, Number, fill=0) %>%
summarise_all(funs(sum))
## # A tibble: 1 x 6
## ä Ä å Å ö Ö
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16685. 16713. 16692. 16586. 16621. 16703. sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_0.8.0 dplyr_0.7.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 knitr_1.20 bindr_0.1.1 magrittr_1.5
## [5] tidyselect_0.2.4 R6_2.2.2 rlang_0.2.0 stringr_1.3.0
## [9] tools_3.4.4 utf8_1.1.3 cli_1.0.0 htmltools_0.3.6
## [13] yaml_2.1.18 assertthat_0.2.0 rprojroot_1.3-2 digest_0.6.15
## [17] tibble_1.4.2 crayon_1.3.4 bindrcpp_0.2 purrr_0.2.4
## [21] glue_1.2.0 evaluate_0.10.1 rmarkdown_1.9 stringi_1.1.7
## [25] compiler_3.4.4 pillar_1.2.1 backports_1.1.2 pkgconfig_2.0.1 |
Thanks for the heads up, it would be interesting to see which test cases caused these random failures to combine profiles. Would you mind filing an issue at the gprofiler repo? Maybe the home-compiled R version is confusing gprofiler. Again, that would be a problem we'd need to look at eventually. In the session info I see that dplyr 0.7.4 is being measured. But tidyselect is used only in the development version of dplyr, not yet in 0.7.4. Maybe it's useful to look for weak performance using tidyselect functions only? |
@krlmlr what @koncina changed in |
This is good to know, but will this also lead to better performance when applied to tidyselect in the upcoming dplyr 0.7.5? When using library(tidyselect)
test_vars <- function(m, n) {
vars <- paste0("v", seq_len(n))
sub_vars <- rlang::syms(sample(vars, m))
system.time(vars_select(vars, !!!sub_vars))
}
test_vars(800, 16000)
#> user system elapsed
#> 1.212 0.010 1.223
test_vars(1600, 16000)
#> user system elapsed
#> 1.142 0.000 1.142
test_vars(3200, 16000)
#> user system elapsed
#> 1.32 0.00 1.32
test_vars(6400, 16000)
#> user system elapsed
#> 1.491 0.000 1.491
test_vars(1600, 32000)
#> user system elapsed
#> 4.213 0.000 4.213
test_vars(3200, 32000)
#> user system elapsed
#> 4.487 0.007 4.494
test_vars(6400, 32000)
#> user system elapsed
#> 4.868 0.000 4.870
test_vars(12800, 32000)
#> user system elapsed
#> 6.253 0.012 6.266
@koncina: Would you mind to gprofile a variant of this code? |
I agree that we should switch to the development version and focus on tidyselect. The only drawback for now is that gprofiler fails to reveal the calls below the Another interesting observation is that performances in the development version are worse than in 0.7.4: In addition to the identification of library(tidyverse)
library(gprofiler)
large_tibble <- matrix(runif(5e6), nrow = 100) %>%
as_tibble()
out_file <- tempfile("gprofiler", fileext = ".out")
start_profiler(out_file)
x <- select(large_tibble, -1)
stop_profiler()
summaryRprof(out_file)[c("by.self", "sampling.time")]
## $by.self
## self.time self.pct total.time total.pct
## "Rf_pmatch" 24.18 71.62 24.18 71.62
## "base::assign" 8.74 25.89 8.74 25.89
## "_Z12inds_combineN4Rcpp6VectorILi16ENS_15PreserveStorageEEENS_6ListOfINS0_ILi13ES1_EEEE" 0.24 0.71 0.52 1.54
## "_ZN7VarList6updateEiN4Rcpp6StringE" 0.24 0.71 0.24 0.71
## "env_bind_impl" 0.08 0.24 9.08 26.90
## "get" 0.08 0.24 0.10 0.30
## "<GC>" 0.08 0.24 0.08 0.24
## "getExportedValue" 0.04 0.12 0.22 0.65
## "asNamespace" 0.04 0.12 0.04 0.12
## "::" 0.02 0.06 0.24 0.71
## "setdiff2" 0.02 0.06 0.02 0.06
##
## $sampling.time
## [1] 33.76 Fixing the code in @krlmlr: The random gprofiler error I mentioned also occurs with the code used in my posts and using the CRAN R binary... I'm still trying to find more informations before opening an issue: sometimes the error occurs more frequently suggesting that something is interfering... |
I'm working on jointprof again and will take a look myself too. |
It looks like the old R implementation is faster than the C++ one. Small variable set:
Large variable set: fake_columns <- paste0("col_", 1:50000)
bench::mark(
tidysel = vars_select(fake_columns, -1),
base = fake_columns[-1],
check = FALSE
)[1:8]
# R
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#> 1 tidysel 1.04s 1.04s 0.958 20.7MB 8.62 1 9
#> 2 base 931.82µs 1.31ms 597. 976.8KB 13.7 304 7
# C++
#> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#> 1 tidysel 1.74s 1.74s 0.574 12.8MB 5.16 1 9
#> 2 base 944.32µs 1.37ms 496. 976.8KB 12.0 248 6 |
(This is with r-lib/rlang#828) |
Hello,
We noticed that
dplyr::select()
is slow when many columns are supplied. After investigations, @koncina and me, we found that this line Rcpp line might be the one to change:https://github.com/tidyverse/tidyselect/blob/master/src/combine_variables.cpp#L28
The declaration of
out_name
as<String>
and notstd::string
is slowing down the process.Historically, it comes from a fix of an encoding issue by @romainfrancois here. Later on, @romainfrancois did a second fix but
String
remains. I must stress out that I know nothing about encoding, so keeping theString
type could be mandatory.After replacing L28 by
std::vector<std::string> out_name;
creating a functioninds_combine_std
, we obtained:Of note, this line is also present in the current
dplyr
code:https://github.com/tidyverse/dplyr/blob/master/src/combine_variables.cpp#L29
We replaced in
dplyr
v0.7.4 L29 withstd::string
and did some benchmark, see how it also improves:Is there any (encoding) reason to keep the
String
type in bothtidyselect
anddplyr
?The text was updated successfully, but these errors were encountered: