New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement performance critical parts with Rcpp #41
Comments
That would be great! Let me do some profiling first to get an idea of how much this might help. |
I went ahead. See this commit to the fork (based on the master branch): https://github.com/ecoRoland2/ALFAM2/commit/d94d9653e3b22d45f92f25d2ab97b76c3cad1c96 For now, I've done no "structural changes". A good performance boost should be possible if |
Great! I am just getting to profiling now (sorry) so can compare. I'll let you know what I see. A matrix should be OK (I think all columns are or could be numeric). We can just go back to a data frame in R. |
I see a doubling in speed for 965 plots and 8133 observations. Here are Original (v2.1.1, eb15e93 on dev) (although I see you forked master, but the differences shouldn't affect speed):
Your fork (d94d965):
Very nice! |
There are also some simple issues I can fix. I'll work on these before coming back to your C++ addition. Or, maybe it makes the most sense to merge in your work with dev first, and then get back to the inefficient R code. Here is one that also doubles the speed.
Dropping a bunch of columns before
There seems to be an additional improvement from switching from a data frame to a matrix here. I'll take a closer look. |
All operations on data.frames are slow. You only get good performance with them if you can use list operations (likle Splitting a data.frame (by rows) is always slow and pretty much never necessary. I'd recommend using package data.table and it's group-by operation (looks like A combination of package data.table and Rcpp should enable a dramatic performance improvement and package parallel wouldn't be needed anymore. |
Of couse, implementing the "group-by" in Rcpp would be even better performance-wise because we would strongly reduce the number of expensive R-function calls. It wouldn't be diffcult either. |
@ChHaeni has also encouraged me to use data.table. I'm really impressed with it but have a lot of code (including within functions and packages) that uses a symbolic variable (character vector) for column names within an indexing operation. This doesn't work with data.tables or course. I know I can add But I think you two are probably right. |
That would be nice. Let me see what I can do with data.table first. Ideally that will give us a cleaner and already more efficient function to work with. I think I will merge that with the Rcpp work you have already done and then reassess. |
The development version of data.table has a new interface for programming on the language: https://rdatatable.gitlab.io/data.table/articles/datatable-programming.html I expect release to CRAN soonish. |
I've finally learned a bit of C++ and with some trial-and-error managed to move all the grouped stuff into C++. All in d269b90 in Rcpp-dev branch. Few issues to work on:
and there is still a lot of R code that is used to get things ready. |
Data processing for incorporation takes a lot of time.
The 800 ms used from t12 to t20 is for this chunk:
|
287f071 moved some operations out of the loop and includes some other improvements to the incorporation code.
179 ms still for the loop, which could probably be eliminated, but only with a |
So far I have not had any problems from Rcpp stuff. |
Functions that are basically a
for
loop with some basic arithmetics (likecalcEmis
) would be very easy to translate to compiled code. Would you like me to start doing that? There would be some development overhead:The text was updated successfully, but these errors were encountered: