Skip to content

Cpp optimisation - passing by reference, using const, avoiding Rcpp functions in cpp code #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

AlbanSagouis
Copy link
Collaborator

Description

This is a general attempt at speeding up parzer by passing objects by reference and now preparing work with @mpadge, decreasing Rcpp functions use in cpp code and moving toward replacing Rcpp by cpp11

@AlbanSagouis AlbanSagouis requested a review from mpadge October 18, 2022 09:00
@AlbanSagouis AlbanSagouis marked this pull request as ready for review October 18, 2022 09:01
tests were not updated and fail, not the good time
This reverts commit e013812.
In addition to Rcpp::List, calls to Rcpp::toString and to the is_negative function using regular expressions were deleted.
The new version of split_decimal_degree is a little faster:
> microbenchmark::microbenchmark(
+   times = 10^6, setup = {x <- 45.235687},
+   new = parzer:::split_decimal_degree(x),
+   old = parzer:::split_decimal_degree_old(x)
+ )
Unit: microseconds
 expr   min    lq     mean median    uq      max neval
  new 4.469 4.797 5.423814  4.920 5.084 56692.83 1e+06
  old 5.125 5.494 6.093964  5.617 5.781 10816.62 1e+06
A Rcpp::as<std::string> in a loop was also deleted.

No speed increase for pz_parse_parts_lon() or pz_parse_parts_lat().

More replacements needed such as in extract_floats_from_string

Tests and checks passed
execution of pz_parse_lat and pz_parse_lon faster by a couple %!

Rcpp objects are still used to build data.frames and to handle warnings and NAs
closes ropensci#43
@@ -55,12 +55,12 @@ BEGIN_RCPP
END_RCPP
}
// pz_parse_lat
Rcpp::NumericVector pz_parse_lat(const Rcpp::CharacterVector& x);
std::vector<float> pz_parse_lat(std::vector<std::string>& x);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbanSagouis I generally default to <double> rather than <float>, although that may not be necessary here, because the main advantage is the numeric limits of <float> can often accidentally be reached, and cause undefined behaviour. The limits of <double> are much higher (see here and the bottom of this page). The limits in this case are well defined, so that's not likely to be issue, but there is no performance effect anyway, so maybe worth thinking about?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with <float> and used it because we mentioned it orally :s
Sticking to <double> sounds good, I'll make the modifications.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried replacing <float> by <double> but got several test errors, for example:

Failure (test-parse_lon.R:46): parse_lon works: run through test_lons
round(parse_lon(test_lons[i]), 5) (`actual`) not equal to -74.64111 (`expected`).

  `actual`: -2330764414375590775143610959527936
`expected`:                                 -75

So this will be for later on a dedicated branch

@AlbanSagouis
Copy link
Collaborator Author

Ready to merge @mpadge

Issues to solve when switching to cpp11:

  • Rcpp::DataFrame in pz_split_llstr, pz_parse_parts_lat & pz_parse_parts_lon
  • Returning NA values in split_decimal_degree instead of zeros

@AlbanSagouis
Copy link
Collaborator Author

In this branch I kept old and new versions of many functions side by side in scripts (eg pz_parse_lat and pz_parse_lat_old). I used this to benchmark the effects of the changes made to the code.
@mpadge Do we agree that I should delete all the old code before merging?

@mpadge
Copy link
Member

mpadge commented Dec 8, 2022

In this branch I kept old and new versions of many functions side by side in scripts (eg pz_parse_lat and pz_parse_lat_old). I used this to benchmark the effects of the changes made to the code. @mpadge Do we agree that I should delete all the old code before merging?

Not necessarily. More importantly: It'd be good to document the benchmarking somewhere, so the old and new versions should be kept until that has been done. I suggest documenting in an issue, rather than a PR, because it's then easier to find at a later stage. The documentation should also be a full reprex starting with a gert::git_clone() call, and dumping gert::git_commit_id() (there is currently no way to use gert to checkout a specific commit, unfortunately).

The old code should then only be deleted once that's been done. And to make it easier to reproduce the results in the future, I'd suggest merging first to ensure the "old" versions are retained in the protected ("master") branch, rather than in a branch which we might later inadvertendly delete. Does that make sense?

@AlbanSagouis
Copy link
Collaborator Author

In this branch I kept old and new versions of many functions side by side in scripts (eg pz_parse_lat and pz_parse_lat_old). I used this to benchmark the effects of the changes made to the code. @mpadge Do we agree that I should delete all the old code before merging?

Not necessarily. More importantly: It'd be good to document the benchmarking somewhere, so the old and new versions should be kept until that has been done. I suggest documenting in an issue, rather than a PR, because it's then easier to find at a later stage. The documentation should also be a full reprex starting with a gert::git_clone() call, and dumping gert::git_commit_id() (there is currently no way to use gert to checkout a specific commit, unfortunately).

I create the issue #46 where I pasted reproducible code and results documenting the benchmarking. I did not use gert but directly built the package from the commit we are interested in:

remotes::install_github("albansagouis/parzer@2f8ede4c909929aacc491c1d7e9d63e94c7e5312")

microbenchmark::microbenchmark(...

@mpadge do you also think that this is reproducible?

@mpadge
Copy link
Member

mpadge commented Dec 12, 2022

Yep, i was able to reproduce it. Thanks!

AlbanSagouis and others added 4 commits March 11, 2024 19:56
Reformatted the R/parzer-package.R script
Added missing ORCIDs in the DESCRIPTION
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants