Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution time of bind_rows() #1396

Closed
mschubert opened this issue Sep 9, 2015 · 30 comments
Closed

Execution time of bind_rows() #1396

mschubert opened this issue Sep 9, 2015 · 30 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@mschubert
Copy link

I am trying to bind together a large number of rows, for which I would assume that the time used should be approximately O(n).

With bind_rows and N individual (1-row, 6-column) data.frame objects, this is what I see:

14,500 rows: 1 second
75,000 rows: 23 seconds
145,000 rows: 90 seconds
750,000 rows: 6000 seconds
1,450,000 rows: not complete after 24 hours

Measurements are taken loading a data.frame row list into memory, and then using system.time(bind_rows(data)).

Is there any reason performance decreases that much with a higher number of rows?

I guess that Rcpp should be able to bind 1M rows in a couple of minutes.

@hadley
Copy link
Member

hadley commented Sep 9, 2015

Please provide a reproducible example. It matters what's in your columns.

Are you using dplyr 0.4.3?

@mschubert
Copy link
Author

Very simple example:

library(dplyr)
for (n in c(1e4, 1e5, 1e6)) {
    ll = lapply(1:n, function(x) as.list(setNames(runif(5), letters[1:5])))
    print(n)
    print(system.time(bind_rows(ll)))
}

[1] 10000
user system elapsed
0.823 0.004 0.831
[1] 1e+05
user system elapsed
101.075 0.046 101.560
[1] 1e+06
[still running, will update]

Between 1e4 and 1e5 rows, the time difference is about 100 x.

_edit_ 0.4.3 (could not compile the git HEAD, for whatever reason):

[1] 10000
user system elapsed
0.932 0.011 0.962
[1] 1e+05
user system elapsed
131.910 0.510 134.963

@hadley
Copy link
Member

hadley commented Sep 9, 2015

Thanks, no need to also supply the 1e6 time.

@romainfrancois romainfrancois self-assigned this Sep 10, 2015
@romainfrancois
Copy link
Member

I get this with the devel version:

> for (n in c(1e4, 1e5, 1e6)) {
+     ll = lapply(1:n, function(x) as.list(setNames(runif(5), letters[1:5])))
+     print(n)
+     print(system.time(bind_rows(ll)))
+ }
[1] 10000
   user  system elapsed
  0.171   0.004   0.175
[1] 1e+05
   user  system elapsed
  1.663   0.002   1.665
[1] 1e+06
   user  system elapsed
 17.847   0.069  17.915

which looks perfectly linear. Can somebody else double check. @hadley @kevinushey ?

@romainfrancois
Copy link
Member

Perhaps @mschubert's computer struggles with memory and does a lot of GCs. Might be worthwhile running it through valgrind to make sure we have not forgotten to release objects to the GC.

@mschubert
Copy link
Author

It's not memory, that was running on a cluster with 10G requested just for the test.

Now I tested it on the following systems with the same symptoms:

Gentoo Prefix, R 3.2.0 & dplyr_0.4.3.9000

R version 3.2.0 (2015-04-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3.9000 modules_0.8.1   

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.1       assertthat_0.1 parallel_3.2.0 tools_3.2.0   
[6] DBI_0.3.1      Rcpp_0.12.1

Archlinux, R 3.2.1 & dplyr_0.4.3.9000

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Arch Linux

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3.9000

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.1       assertthat_0.1 parallel_3.2.1 tools_3.2.1   
[6] DBI_0.3.1      Rcpp_0.12.1

And for those it worked fine:

OS-X, R 3.2.1 & dplyr_0.4.3

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.3

loaded via a namespace (and not attached):
[1] magrittr_1.5   R6_2.1.1       assertthat_0.1 parallel_3.2.1 DBI_0.3.1      tools_3.2.1
    Rcpp_0.12.0   

@kevinushey
Copy link
Contributor

Everything looks fine on my machine:

library(dplyr)
sizes <- seq(1E3, 1E5, by = 1E3)
times <- lapply(sizes, function(n) {
  ll = lapply(1:n, function(x) as.list(setNames(runif(5), letters[1:5])))
  print(n)
  (system.time(bind_rows(ll)))
})

plot(unlist(lapply(times, "[[", 1)), type = "b")

giving

image

which is linear aside from gc spikes.

> devtools::session_info()
Session info -----------------------------------------------------------------------------------------------------------------
 setting  value                                      
 version  R version 3.2.2 Patched (2015-09-09 r69342)
 system   x86_64, darwin13.4.0                       
 ui       RStudio (0.99.484)                         
 language (EN)                                       
 collate  en_US.UTF-8                                
 tz       America/Los_Angeles                        

Packages ---------------------------------------------------------------------------------------------------------------------
 package       * version    date       source                            
 assertthat      0.1        2013-12-06 CRAN (R 3.2.0)                    
 BiocInstaller * 1.18.4     2015-09-03 Bioconductor                      
 crayon          1.3.1      2015-07-13 CRAN (R 3.2.0)                    
 curl            0.9.3      2015-08-25 CRAN (R 3.2.1)                    
 DBI             0.3.1      2014-09-24 CRAN (R 3.2.0)                    
 devtools      * 1.8.0      2015-05-09 CRAN (R 3.2.0)                    
 digest          0.6.8      2014-12-31 CRAN (R 3.2.0)                    
 dplyr         * 0.4.3.9000 2015-09-15 Github (hadley/dplyr@c0f64cb)     
 git2r           0.11.0     2015-08-12 CRAN (R 3.2.0)                    
 knitr         * 1.11       2015-08-14 CRAN (R 3.2.1)                    
 magrittr        1.5        2014-11-22 CRAN (R 3.2.0)                    
 memoise         0.2.1      2014-04-22 CRAN (R 3.2.0)                    
 R6              2.1.1      2015-08-19 CRAN (R 3.2.0)                    
 Rcpp            0.12.1     2015-09-10 CRAN (R 3.2.2)                    
 rversions       1.0.2      2015-07-13 CRAN (R 3.2.0)                    
 testthat      * 0.10.0     2015-05-22 CRAN (R 3.2.0)                    
 xml2            0.1.2      2015-09-01 CRAN (R 3.2.0)    

@mschubert
Copy link
Author

I got a further confirmation on an independent machine running Arch (@deeenes) with the same issues and two Macs (@luzgaral, @akalamara) without.

@kevinushey
Copy link
Contributor

I'll try it out on my Ubuntu VM to see if I can reproduce. What compilers + flags are you using in each case, just to check? If you have time, does it make a difference whether you compile with clang versus gcc?

@mschubert
Copy link
Author

I'll check it tonight, will update here and ping you.

@kevinushey
Copy link
Contributor

I can repro on my Ubuntu VM (Ubuntu 14.04 64bit, with gcc 4.8.2); also with clang-3.6.

Maybe some weird performance edge-case when using libstdc++ versus libc++? It seems surprising that performance would be linear on OS X but quadratic on linux.

@romainfrancois
Copy link
Member

One potential confusing factor is the version of Rcpp. As things appear to be fine with 0.12.0 but problematic with 0.12.1.

I know for example that String in Rcpp now get systematic protection. Not sure this is relevant.

@mschubert
Copy link
Author

@kevinushey had 0.12.1 on his Mac and it worked fine.

@romainfrancois
Copy link
Member

Right. So glad we can rule that out. That's a weird one.

@mschubert
Copy link
Author

My compile args are (R the same, with addition of -fopenmp; all gcc=5.2.0):

CPPFLAGS="-D_FORTIFY_SOURCE=2"
CFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
CXXFLAGS="-march=x86-64 -mtune=generic -O2 -pipe -fstack-protector --param=ssp-buffer-size=4"
LDFLAGS="-Wl,-O1,--sort-common,--as-needed,-z,relro"

The package dplyr.so links the following libraries (ldd dplyr.so):

        linux-vdso.so.1 (0x00007ffd6e5d7000)
        libR.so => /usr/lib/R/lib/libR.so (0x00007f8435239000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f8434eb7000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f8434bb9000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f84349a2000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f84345fe000)
        libblas.so => /usr/lib/libblas.so (0x00007f8433884000)
        libreadline.so.6 => /usr/lib/libreadline.so.6 (0x00007f8433638000)
        libpcre.so.1 => /usr/lib/libpcre.so.1 (0x00007f84333c8000)
        liblzma.so.5 => /usr/lib/liblzma.so.5 (0x00007f84331a2000)
        libbz2.so.1.0 => /usr/lib/libbz2.so.1.0 (0x00007f8432f91000)
        libz.so.1 => /usr/lib/libz.so.1 (0x00007f8432d7b000)
        librt.so.1 => /usr/lib/librt.so.1 (0x00007f8432b73000)
        libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f843296e000)
        libgomp.so.1 => /usr/lib/libgomp.so.1 (0x00007f843274c000)
        libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f843252f000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x0000560900506000)
        libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007f8432204000)
        libncursesw.so.5 => /usr/lib/libncursesw.so.5 (0x00007f8431f9f000)
        libquadmath.so.0 => /usr/lib/../lib/libquadmath.so.0 (0x00007f8431d5f000)

Tested also with R/dplyr compiled using clang=3.6, same issues.

@kevinushey
Copy link
Contributor

I think we'll need to see output from a sampling profiler to see where most of the time is being spent when running bind_rows(). My only hypothesis thus far is that it's related to the use of libstdc++ versus libc++, but that's also somewhat unlikely.

@mschubert
Copy link
Author

Ok, next set of tests.

Using gperftools, linking dplyr with -lprofiler, and running:

CPUPROFILE=samples.log R --no-save --no-restore < test.r
pprof --text /usr/lib/R/lib/libR.so samples.log

where test.r has got 1e3, 5e3 rows gives the following output:

Using local file /usr/lib/R/lib/libR.so.
Using local file samples.log.
Total: 35 samples
       8  22.9%  22.9%        8  22.9% R_BadLongVector
       8  22.9%  45.7%        8  22.9% Rf_pmatch
       3   8.6%  54.3%        3   8.6% CONS_NR
       2   5.7%  60.0%        2   5.7% Rf_allocVector3
       2   5.7%  65.7%        2   5.7% __strcmp_sse2_unaligned
       2   5.7%  71.4%        2   5.7% do_Rprof
       1   2.9%  74.3%        1   2.9% R_do_slot_assign
       1   2.9%  77.1%        1   2.9% R_init_stats

If I try with 1e4, 1e5 rows I get the error:

Check failed: depth > 0: ProfileData::Add depth <= 0
Aborted (core dumped)

@mschubert
Copy link
Author

After gperftools fixed gperftools/gperftools#721, I was able to run 1e5 rows (114 seconds total) and the profiler says the following:

Update: recompiled R with debugging symbols and -lprofiler, output below. The columns are

  • number of frames where function on top of stack
  • % of frames where function on top stack
  • total % of time for all functions listed (not relevant)
  • number of frames where function somewhere in stack
  • % of the above
   26636  96.0%  96.0%    26636  96.0% RecursiveRelease
       0   0.0%  96.0%     2249   8.1% do_dotcall
       0   0.0%  96.0%     2249   8.1% dplyr_rbind_all
       0   0.0%  96.0%     2249   8.1% rbind_all
       4   0.0%  96.0%     2234   8.1% Rcpp::Vector rbind__impl
       3   0.0%  96.0%     2156   7.8% boost::detail::shared_count::~shared_count (inline)
       0   0.0%  96.0%     2156   7.8% boost::shared_ptr::~shared_ptr (inline)
       0   0.0%  96.0%     2156   7.8% dplyr::DataFrameAble::~DataFrameAble (inline)
       0   0.0%  96.0%     2156   7.8% std::_Destroy (inline)
       0   0.0%  96.0%     2156   7.8% std::_Destroy_aux::__destroy (inline)
       0   0.0%  96.0%     2156   7.8% std::vector::~vector
       2   0.0%  96.0%     2153   7.8% boost::detail::sp_counted_base::release (inline)
       0   0.0%  96.0%      190   0.7% R_gc_internal

The main problem seems to be a call to RecursiveRelease (which may miss stack frames because of high recursion depth). that is not caused by dplyr_rbind_all() (i.e., dplyr_rbind_all() is not in the call stack for the majority of time that the function uses). On the other hand, also for dplyr_rbind_all(), the call that takes almost all of the time is RecursiveRelease.

Since at least a part of the delay includes boost::~shared_ptr, can you tell me which version of boost you were running on @kevinushey?

@kevinushey
Copy link
Contributor

I'm pretty sure it was Boost 1.58, but doesn't dplyr just use the version of BH found on the system?

It definitely seems like some weirdness with garbage collection is going on; using the 'stop it randomly' tactic, I see stack frames of size ~10000 (!!) so perhaps a huge number of temporary R objects are being created, protected, and unprotected in the call to bind_rows?


For posterity, the stack trace of one sample stopped.

< ... repeats up to first stack frame ... >
#4285 0x00007ffff7960218 in ?? () from /usr/lib/R/lib/libR.so
#4286 0x00007ffff7960218 in ?? () from /usr/lib/R/lib/libR.so
#4287 0x00007ffff7960218 in ?? () from /usr/lib/R/lib/libR.so
#4288 0x00007ffff7960bb0 in R_ReleaseObject () from /usr/lib/R/lib/libR.so
#4289 0x00007fffebd7d14d in Rcpp_ReleaseObject (x=<optimized out>)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:93
#4290 ~PreserveStorage (this=0x47d28f8, __in_chrg=<optimized out>)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/storage/PreserveStorage.h:13
#4291 ~Vector (this=0x47d28f8, __in_chrg=<optimized out>)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/vector/Vector.h:30
#4292 ~DataFrameAble_List (this=0x47d28f0, __in_chrg=<optimized out>)
    at ../inst/include/dplyr/DataFrameAble.h:55
#4293 dplyr::DataFrameAble_List::~DataFrameAble_List (this=0x47d28f0, 
---Type <return> to continue, or q <return> to quit---
    __in_chrg=<optimized out>) at ../inst/include/dplyr/DataFrameAble.h:55
#4294 0x00007fffebd80646 in release (this=0x47d2920)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/BH/include/boost/smart_ptr/detail/sp_counted_base_gcc_x86.hpp:146
#4295 ~shared_count (this=0x99a7f78, __in_chrg=<optimized out>)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/BH/include/boost/smart_ptr/detail/shared_count.hpp:443
#4296 ~shared_ptr (this=0x99a7f70, __in_chrg=<optimized out>)
    at /home/kevin/R/x86_64-pc-linux-gnu-library/3.2/BH/include/boost/smart_ptr/shared_ptr.hpp:323
#4297 ~DataFrameAble (this=0x99a7f70, __in_chrg=<optimized out>)
    at ../inst/include/dplyr/DataFrameAble.h:89
#4298 _Destroy<dplyr::DataFrameAble> (__pointer=0x99a7f70)
    at /usr/include/c++/4.8/bits/stl_construct.h:93
#4299 __destroy<dplyr::DataFrameAble*> (__last=<optimized out>, 
    __first=0x99a7f70) at /usr/include/c++/4.8/bits/stl_construct.h:103
#4300 _Destroy<dplyr::DataFrameAble*> (__last=<optimized out>, 
    __first=<optimized out>) at /usr/include/c++/4.8/bits/stl_construct.h:126
#4301 _Destroy<dplyr::DataFrameAble*, dplyr::DataFrameAble> (__last=0x99dc5e0, 
    __first=0x9999f60) at /usr/include/c++/4.8/bits/stl_construct.h:151
#4302 std::vector<dplyr::DataFrameAble, std::allocator<dplyr::DataFrameAble> >::~vector (this=0x7fffffff8ed0, __in_chrg=<optimized out>)
    at /usr/include/c++/4.8/bits/stl_vector.h:415
---Type <return> to continue, or q <return> to quit---
#4303 0x00007fffebd852cd in rbind__impl<Rcpp::Vector<19, Rcpp::PreserveStorage> > (dots=..., id=id@entry=0x604cf8) at bind.cpp:118
#4304 0x00007fffebd7c385 in rbind_all (dots=..., id=id@entry=0x604cf8)
    at bind.cpp:124
#4305 0x00007fffebd4cb51 in dplyr_rbind_all (dotsSEXP=0x8794bd0, 
    idSEXP=0x604cf8) at RcppExports.cpp:86
#4306 0x00007ffff78f7568 in ?? () from /usr/lib/R/lib/libR.so
#4307 0x00007ffff79369db in Rf_eval () from /usr/lib/R/lib/libR.so
#4308 0x00007ffff7938b60 in ?? () from /usr/lib/R/lib/libR.so
#4309 0x00007ffff79367e3 in Rf_eval () from /usr/lib/R/lib/libR.so
#4310 0x00007ffff7937b6f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#4311 0x00007ffff79365bf in Rf_eval () from /usr/lib/R/lib/libR.so
#4312 0x00007ffff7938b60 in ?? () from /usr/lib/R/lib/libR.so
#4313 0x00007ffff79367e3 in Rf_eval () from /usr/lib/R/lib/libR.so
#4314 0x00007ffff7937b6f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#4315 0x00007ffff79365bf in Rf_eval () from /usr/lib/R/lib/libR.so
#4316 0x00007ffff7936cb8 in ?? () from /usr/lib/R/lib/libR.so
#4317 0x00007ffff7934c57 in ?? () from /usr/lib/R/lib/libR.so
#4318 0x00007ffff7936440 in Rf_eval () from /usr/lib/R/lib/libR.so
#4319 0x00007ffff7937b6f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#4320 0x00007ffff79365bf in Rf_eval () from /usr/lib/R/lib/libR.so
#4321 0x00007ffff7939dc2 in ?? () from /usr/lib/R/lib/libR.so
#4322 0x00007ffff79368ac in Rf_eval () from /usr/lib/R/lib/libR.so
---Type <return> to continue, or q <return> to quit---
#4323 0x00007ffff7938b60 in ?? () from /usr/lib/R/lib/libR.so
#4324 0x00007ffff79367e3 in Rf_eval () from /usr/lib/R/lib/libR.so
#4325 0x00007ffff7937b6f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#4326 0x00007ffff793a11c in R_forceAndCall () from /usr/lib/R/lib/libR.so
#4327 0x00007ffff78952ea in ?? () from /usr/lib/R/lib/libR.so
#4328 0x00007ffff7968a59 in ?? () from /usr/lib/R/lib/libR.so
#4329 0x00007ffff7929978 in ?? () from /usr/lib/R/lib/libR.so
#4330 0x00007ffff7936440 in Rf_eval () from /usr/lib/R/lib/libR.so
#4331 0x00007ffff7937b6f in Rf_applyClosure () from /usr/lib/R/lib/libR.so
#4332 0x00007ffff79365bf in Rf_eval () from /usr/lib/R/lib/libR.so
#4333 0x00007ffff7939c3e in ?? () from /usr/lib/R/lib/libR.so
#4334 0x00007ffff79367e3 in Rf_eval () from /usr/lib/R/lib/libR.so
#4335 0x00007ffff795da32 in Rf_ReplIteration () from /usr/lib/R/lib/libR.so
#4336 0x00007ffff795dd81 in ?? () from /usr/lib/R/lib/libR.so
#4337 0x00007ffff795de34 in run_Rmainloop () from /usr/lib/R/lib/libR.so
#4338 0x00000000004007eb in main ()
#4339 0x00007ffff7261ec5 in __libc_start_main (main=0x4007d0 <main>, argc=4, 
    argv=0x7fffffffe238, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffe228) at libc-start.c:287
#4340 0x000000000040081b in _start ()

@mschubert
Copy link
Author

After some more manual debugging, I can confirm that almost all the time spent is to destruct DataFrameAble objects in the chunks vector in bind_rows().

Taken together with the profiling above, something is going seriously wrong with the boost::shared_ptrs.

@kevinushey
Copy link
Contributor

Perhaps the issue is that objects are getting released with a bad ordering?

If I understand correctly, we have a large vector of R objects. When a new object is added, it's protected with R_PreserveObject (pushed to top of R_PreciousList); later, when the parent vector is destructed, the objects are released in the same order (with R_ReleaseObject). This performs a recursive search into the R_PreciousList each time and that implies each release has to delve through the whole list to find and release that object. This could be avoided if the objects were destructed in reverse order.

@romainfrancois, does that sound plausible? Would it be possible to explicitly release the objects in reverse order (so that we just pop off the top of the PreciousList 'stack' each time, rather than recursing within?) Alternatively, one could presumedly just use PROTECT / UNPROTECT which would then avoid this 'eager' collection of objects.

@kevinushey
Copy link
Contributor

FWIW, it looks like destruction order of elements in a vector is unspecified so it is in fact likely that behaviour might differ between libc++ and libstdc++ here.

@romainfrancois
Copy link
Member

ah ok. I actually had something before to control order of destruction, but I ran a few tests and it seemed destruction happened in the reverse order. But I guess not on all implementations. I'll look into it.
Another way would be perhaps not to protect these data frames since they come from the R side anyway, so they are already protected.

@kevinushey
Copy link
Contributor

Indeed, explicitly popping items off the chunks list with something like:

while (chunks.size()) chunks.pop_back();

fixes this (tested on my Ubuntu VM). There's probably a better / more explicit way of handling this though.

@mschubert
Copy link
Author

Confirmed, 1e5 rows down from 95 to 1.8 seconds.

@hadley hadley added the bug an unexpected problem or unintended behavior label Oct 21, 2015
@hadley hadley added this to the 0.5 milestone Oct 21, 2015
@romainfrancois
Copy link
Member

I pushed some code based on @kevinushey's hint, but embedding the logic in the destructor of the DataFrameAbleVector class so that this order of destruction is also used in case of exceptions etc ...

@mschubert can you let me know if this gives you the right performance please.

@mschubert
Copy link
Author

1e5 rows in 2.8 seconds, 1e6 in 29s - yes, looks linear (but is a bit slower than the solution above, or more load on the node right now)

@romainfrancois
Copy link
Member

Slower than what solution ? If you mean @kevinushey's that's what I'm doing, the code just is in a destructor rather than at the end of the function.

@mschubert
Copy link
Author

Yes, that one. That was probably just more load on the node I tested, feel free to ignore if what you changed can't have caused it.

@romainfrancois
Copy link
Member

Fine. I'll consider this closed then.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants