Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aorsf; accelerated oblique random survival forests #532

Closed
12 of 13 tasks
bcjaeger opened this issue Apr 28, 2022 · 95 comments
Closed
12 of 13 tasks

aorsf; accelerated oblique random survival forests #532

bcjaeger opened this issue Apr 28, 2022 · 95 comments
Assignees
Labels

Comments

@bcjaeger
Copy link

bcjaeger commented Apr 28, 2022

Date accepted: 2022-09-22

Submitting Author Name: Byron C Jaeger
Submitting Author Github Handle: @bcjaeger
Other Package Authors Github handles: (comma separated, delete if none) @nmpieyeskey, @sawyerWeld
Repository: https://github.com/bcjaeger/aorsf
Version submitted: 0.0.0.9000
Submission type: Stats
Badge grade: gold
Editor: @tdhock
Reviewers: @chjackson, @mnwright, @jemus42

Due date for @chjackson: 2022-07-29

Due date for @mnwright: 2022-09-21
Due date for @jemus42: 2022-09-21
Archive: TBD
Version accepted: TBD
Language: en

  • Paste the full DESCRIPTION file inside a code block below:
Package: aorsf
Title: Accelerated Oblique Random Survival Forests
Version: 0.0.0.9000
Authors@R: c(
    person(given = "Byron",
           family = "Jaeger",
           role = c("aut", "cre"),
           email = "bjaeger@wakehealth.edu",
           comment = c(ORCID = "0000-0001-7399-2299")),
    person(given = "Nicholas",  family = "Pajewski", role = "ctb"),
    person(given = "Sawyer", family = "Welden", role = "ctb", email = "swelden@wakehealth.edu")
    )
Description: Fit, interpret, and make predictions with oblique random
    survival forests. Oblique decision trees are notoriously slow compared
    to their axis based counterparts, but 'aorsf' runs as fast or faster than 
    axis-based decision tree algorithms for right-censored time-to-event 
    outcomes.
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE, roclets = c ("namespace", "rd", "srr::srr_stats_roclet"))
RoxygenNote: 7.1.2
LinkingTo: 
    Rcpp,
    RcppArmadillo
Imports: 
    table.glue,
    Rcpp,
    data.table
URL: https://github.com/bcjaeger/aorsf,
    https://bcjaeger.github.io/aorsf
BugReports: https://github.com/bcjaeger/aorsf/issues
Depends: 
    R (>= 3.6)
Suggests: 
    survival,
    survivalROC,
    ggplot2,
    testthat (>= 3.0.0),
    knitr,
    rmarkdown,
    glmnet,
    covr,
    units
Config/testthat/edition: 3
VignetteBuilder: knitr

Pre-submission Inquiry

  • A pre-submission inquiry has been approved in issue#525

General Information

  • Who is the target audience and what are scientific applications of this package?

Target audience: people who want to fit and interpret a risk prediction model, i.e., a prediction model for right-censored time-to-event outcomes.

Applications: fit an oblique random survival forest, compute predicted risk at a given time, estimate the importance of individual variables, and compute partial dependence to depict relationships between specific predictors and predicted risk.

  • Paste your responses to our General Standard G1.1 here, describing whether your software is:

    • An improvement on other implementations of similar algorithms in R.

    Please include hyperlinked references to all other relevant software. The obliqueRSF package was the original R package for oblique random forests. I wrote it and it is very slow. I wrote aorsf because I had multiple ideas about how to make obliqueRSF faster, specifically using a partial Newton Raphson algorithm instead of using penalized regression to derive linear combinations of variables in decision nodes. It would have been possible to rewrite obliqueRSF, but it would have been difficult to make the re-write backwards compatible with the version of obliqueRSF on CRAN.

  • (If applicable) Does your package comply with our guidance around Ethics, Data Privacy and Human Subjects Research?

Not applicable

Badging

Gold

  1. Compliance with a good number of standards beyond those identified as minimally necessary. aorsf complies with over 100 combined standards in the general and ML categories.
  2. Demonstrating excellence in compliance with multiple standards from at least two broad sub-categories. See 1. above
  3. Internal aspects of package structure and design. aorsf uses an optimized routine to partially complete Newton Raphson scoring for the Cox proportional hazards model and also an optimized routine to compute likelihood ratio tests. Both of these routines are heavily used when fitting oblique random survival forests, and both demonstrate the exact same answers as corresponding functions in the survival package (see tests in aorsf) while running at least twice as fast (thanks to Rcpparmadillo).

Technical checks

Confirm each of the following by checking the box.

I think aorsf is passing autotest and srr_stats_pre_submit(). I am having some issues running these on R 4.2. Currently, autotest is returning NULL, which I understand to be a good thing, and srr_stats_pre_submit is not able to run (not sure why; but it was fine before I updated to R 4.2).

This package:

Publication options

  • Do you intend for this package to go on CRAN?
  • Do you intend for this package to go on Bioconductor?

Code of conduct

@ropensci-review-bot
Copy link
Collaborator

Thanks for submitting to rOpenSci, our editors and @ropensci-review-bot will reply soon. Type @ropensci-review-bot help for help.

@ropensci-review-bot
Copy link
Collaborator

🚀

The following problem was found in your submission template:

  • 'statsgrade' variable must be one of [bronze, silver, gold]
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

@bcjaeger
Copy link
Author

🚀

The following problem was found in your submission template:

  • 'statsgrade' variable must be one of [bronze, silver, gold]
    Editors: Please ensure these problems with the submission template are rectified. Package checks have been started regardless.

👋

Just updated to fix this. I'm not sure if reviewers will think that 'gold' is the right statsgrade, but might as well aim high.

@ropensci-review-bot
Copy link
Collaborator

Checks for aorsf (v0.0.0.9000)

git hash: c73bb98c

  • ✔️ Package name is available
  • ✔️ has a 'codemeta.json' file.
  • ✔️ has a 'contributing' file.
  • ✔️ uses 'roxygen2'.
  • ✔️ 'DESCRIPTION' has a URL field.
  • ✔️ 'DESCRIPTION' has a BugReports field.
  • ✔️ Package has at least one HTML vignette
  • ✔️ All functions have examples.
  • ✔️ Package has continuous integration checks.
  • ✔️ Package coverage is 97.1%.
  • ✔️ R CMD check found no errors.
  • ✔️ R CMD check found no warnings.

Package License: MIT + file LICENSE


1. rOpenSci Statistical Standards (srr package)

This package is in the following category:

  • Machine Learning

✔️ All applicable standards [v0.1.0] have been documented in this package (102 complied with; 56 N/A standards)

Click to see the report of author-reported standards compliance of the package with links to associated lines of code, which can be re-generated locally by running the srr_report() function from within a local clone of the repository.


2. Package Dependencies

Details of Package Dependency Usage (click to open)

The table below tallies all function calls to all packages ('ncalls'), both internal (r-base + recommended, along with the package itself), and external (imported and suggested packages). 'NA' values indicate packages to which no identified calls to R functions could be found. Note that these results are generated by an automated code-tagging system which may not be entirely accurate.

type package ncalls
internal base 341
internal aorsf 152
internal utils 13
internal stats 11
internal methods 1
imports table.glue 3
imports Rcpp NA
imports data.table NA
suggests glmnet 1
suggests survival NA
suggests survivalROC NA
suggests ggplot2 NA
suggests testthat NA
suggests knitr NA
suggests rmarkdown NA
suggests covr NA
suggests units NA
linking_to Rcpp NA
linking_to RcppArmadillo NA

Click below for tallies of functions used in each package. Locations of each call within this package may be generated locally by running 's <- pkgstats::pkgstats(<path/to/repo>)', and examining the 'external_calls' table.

base

attr (40), names (35), for (23), length (21), c (20), paste (16), list (14), paste0 (12), mode (10), seq_along (10), vector (10), which (9), as.matrix (8), rep (7), as.integer (6), drop (6), seq (6), switch (6), order (5), match (4), min (4), setdiff (4), colnames (3), inherits (3), ncol (3), Sys.time (3), all (2), any (2), as.factor (2), cbind (2), data.frame (2), grepl (2), lapply (2), levels (2), matrix (2), nchar (2), nrow (2), rev (2), rle (2), row.names (2), sum (2), suppressWarnings (2), try (2), all.vars (1), as.data.frame (1), class (1), deparse (1), formals (1), grep (1), if (1), ifelse (1), intersect (1), is.na (1), max (1), mean (1), print (1), return (1), rownames (1), sapply (1), t (1), typeof (1), unique (1)

aorsf

paste_collapse (8), fctr_info (5), get_fctr_info (5), get_names_x (5), f_oobag_eval (3), get_n_obs (3), get_n_tree (3), get_names_y (3), get_numeric_bounds (3), last_value (3), orsf_fit (3), ref_code (3), unit_info (3), check_var_types (2), get_f_oobag_eval (2), get_importance (2), get_leaf_min_events (2), get_leaf_min_obs (2), get_max_time (2), get_mtry (2), get_n_events (2), get_n_leaves_mean (2), get_oobag_eval_every (2), get_type_oobag_eval (2), get_unit_info (2), is_empty (2), list_init (2), orsf_control_net (2), orsf_pd_summary (2), orsf_train_ (2), select_cols (2), check_arg_bound (1), check_arg_gt (1), check_arg_gteq (1), check_arg_is (1), check_arg_is_integer (1), check_arg_is_valid (1), check_arg_length (1), check_arg_lt (1), check_arg_lteq (1), check_arg_type (1), check_arg_uni (1), check_control_cph (1), check_control_net (1), check_new_data_fctrs (1), check_new_data_names (1), check_new_data_types (1), check_oobag_fun (1), check_orsf_inputs (1), check_pd_inputs (1), check_predict (1), check_units (1), contains_oobag (1), contains_vi (1), f_beta (1), fctr_check (1), fctr_check_levels (1), fctr_id_check (1), get_cph_do_scale (1), get_cph_eps (1), get_cph_iter_max (1), get_cph_method (1), get_cph_pval_max (1), get_f_beta (1), get_n_retry (1), get_n_split (1), get_net_alpha (1), get_net_df_target (1), get_oobag_pred (1), get_oobag_time (1), get_orsf_type (1), get_split_min_events (1), get_split_min_obs (1), get_tree_seeds (1), get_types_x (1), insert_vals (1), is_aorsf (1), is_error (1), is_trained (1), leaf_kaplan_testthat (1), lrt_multi_testthat (1), newtraph_cph_testthat (1), oobag_c_harrell_testthat (1), orsf (1), orsf_control_cph (1), orsf_oob_vi (1), orsf_pd_ (1), orsf_pd_ice (1), orsf_pred_multi (1), orsf_pred_uni (1), orsf_scale_cph (1), orsf_summarize_uni (1), orsf_time_to_train (1), orsf_train (1), orsf_unscale_cph (1), orsf_vi_ (1), x_node_scale_exported (1)

utils

data (13)

stats

formula (4), dt (2), terms (2), family (1), time (1), weights (1)

table.glue

round_spec (1), round_using_magnitude (1), table_value (1)

glmnet

glmnet (1)

methods

new (1)


3. Statistical Properties

This package features some noteworthy statistical properties which may need to be clarified by a handling editor prior to progressing.

Details of statistical properties (click to open)

The package has:

  • code in C++ (48% in 2 files) and R (52% in 21 files)
  • 1 authors
  • 3 vignettes
  • 1 internal data file
  • 3 imported packages
  • 15 exported functions (median 15 lines of code)
  • 216 non-exported functions in R (median 3 lines of code)
  • 48 R functions (median 38 lines of code)

Statistical properties of package structure as distributional percentiles in relation to all current CRAN packages
The following terminology is used:

  • loc = "Lines of Code"
  • fn = "function"
  • exp/not_exp = exported / not exported

All parameters are explained as tooltips in the locally-rendered HTML version of this report generated by the checks_to_markdown() function

The final measure (fn_call_network_size) is the total number of calls between functions (in R), or more abstract relationships between code objects in other languages. Values are flagged as "noteworthy" when they lie in the upper or lower 5th percentile.

measure value percentile noteworthy
files_R 21 82.3
files_src 2 79.1
files_vignettes 3 92.4
files_tests 18 95.7
loc_R 2139 85.4
loc_src 1982 75.6
loc_vignettes 342 67.9
loc_tests 1532 91.7
num_vignettes 3 94.2
data_size_total 9034 70.4
data_size_median 9034 78.5
n_fns_r 231 91.6
n_fns_r_exported 15 58.5
n_fns_r_not_exported 216 94.0
n_fns_src 48 66.0
n_fns_per_file_r 7 77.6
n_fns_per_file_src 24 94.8
num_params_per_fn 3 33.6
loc_per_fn_r 3 1.1 TRUE
loc_per_fn_r_exp 15 35.6
loc_per_fn_r_not_exp 3 1.5 TRUE
loc_per_fn_src 38 88.9
rel_whitespace_R 50 96.1 TRUE
rel_whitespace_src 58 89.6
rel_whitespace_vignettes 56 83.5
rel_whitespace_tests 45 96.8 TRUE
doclines_per_fn_exp 46 58.1
doclines_per_fn_not_exp 0 0.0 TRUE
fn_call_network_size 364 93.5

3a. Network visualisation

Click to see the interactive network visualisation of calls between objects in package


4. goodpractice and other checks

Details of goodpractice and other checks (click to open)

3a. Continuous Integration Badges

R-CMD-check
pkgcheck

GitHub Workflow Results

name conclusion sha date
Commands skipped 069021 2022-04-22
pages build and deployment success c2efe9 2022-04-28
pkgcheck success c73bb9 2022-04-28
pkgdown success c73bb9 2022-04-28
R-CMD-check success c73bb9 2022-04-28
test-coverage success c73bb9 2022-04-28

3b. goodpractice results

R CMD check with rcmdcheck

R CMD check generated the following note:

  1. checking installed package size ... NOTE
    installed size is 6.9Mb
    sub-directories of 1Mb or more:
    libs 6.0Mb

R CMD check generated the following check_fails:

  1. no_import_package_as_a_whole
  2. rcmdcheck_reasonable_installed_size

Test coverage with covr

Package coverage: 97.13

Cyclocomplexity with cyclocomp

The following functions have cyclocomplexity >= 15:

function cyclocomplexity
orsf 33
check_orsf_inputs 28
orsf_pd_ 22
ref_code 17
check_new_data_names 15
check_predict 15

Static code analyses with lintr

lintr found the following 223 potential issues:

message number of times
Avoid using sapply, consider vapply instead, that's type safe 10
Lines should not be more than 80 characters. 182
Use <-, not =, for assignment. 31


Package Versions

package version
pkgstats 0.0.4.30
pkgcheck 0.0.3.11
srr 0.0.1.149


Editor-in-Chief Instructions:

This package is in top shape and may be passed on to a handling editor

@jooolia
Copy link
Member

jooolia commented May 4, 2022

@ropensci-review-bot assign @tdhock as editor

@ropensci-review-bot
Copy link
Collaborator

Assigned! @tdhock is now the editor

@tdhock
Copy link

tdhock commented May 9, 2022

@ropensci-review-bot seeking reviewers

@ropensci-review-bot
Copy link
Collaborator

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

@bcjaeger
Copy link
Author

Please add this badge to the README of your package repository:

[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/532_status.svg)](https://github.com/ropensci/software-review/issues/532)

Furthermore, if your package does not have a NEWS.md file yet, please create one to capture the changes made during the review process. See https://devguide.ropensci.org/releasing.html#news

Done! Looking forward to the review.

@bcjaeger
Copy link
Author

Hi @tdhock,

Thank you for acting as the editor of this submission. I have recently added a paper.md file to the repository in hopes that aorsf can be eligible for an expedited review at the Journal of Open Source Software after the ropensci review. Can you let me know if any other files are needed or if other actions are needed from me to qualify for an expedited review at JOSS? Thank you!

@tdhock
Copy link

tdhock commented May 16, 2022

I'm not sure about the expedited JOSS review. You may ask @noamross or @mpadge

@noamross
Copy link
Contributor

@bcjaeger The paper.md file should be sufficient. For JOSS submission, after RO review is complete you should submit to JOSS and link to this review thread in the pre-review JOSS thread. JOSS editors will review the paper.md and can opt to use RO's software reviews rather than finding additional reviewers.

@bcjaeger
Copy link
Author

Thanks, @noamross! That makes sense.

@bcjaeger
Copy link
Author

bcjaeger commented Jun 7, 2022

Hi @tdhock, I see we are still seeking reviewers. Is there anything I can do to help find reviewers for this submission?

@tdhock
Copy link

tdhock commented Jun 14, 2022

I have asked a few people to review but no one has agreed yet, do you have any idea for potential reviewers to ask?

@bcjaeger
Copy link
Author

bcjaeger commented Jun 16, 2022

Thanks! Assuming folks in the ROpenSci circle have been asked already, I'll offer some names (username) that may not have been asked yet but would be good reviewers for aorsf.

Terry Therneau (therneau) would be a good reviewer - a lot of the C code in aorsf is based on his survival package coxph routine.

Torsten Hothorn (thothorn) would also be a good reviewer - Torsten is the author of the party package and I'd like aorsf to look like the party package in 10 years.

Hannah Frick (hfrick), Emil Hvitfeldt (EmilHvitfeldt), Max Kuhn (topepo), Davis Vaughan (DavisVaughan), and Julia Silge
(juliasilge) would all be good reviewers - they are all developers/contributors to the censored package, and I'd like aorsf to contribute to that package.

Raphael Sonabend (RaphaelS1), Andreas Bender (adibender), Michel Lang (mllg), and Patrick Schratz (pat-s) would all be good reviewers - they are developers/contributors to the mlr3-proba package, and I'd like aorsf to contribute to that package.

@bcjaeger
Copy link
Author

bcjaeger commented Jul 2, 2022

Hi @tdhock, I am thinking about reaching out to the folks I listed in my last comment. Before I try to reach them, I was wondering if you had already contacted them? If you have, I will not bother them again with a review request.

@tdhock
Copy link

tdhock commented Jul 6, 2022

Hi @bcjaeger actually I have not asked any of those reviewers yet, sorry! I just got back to the office from traveling. Yes that would be helpful if you could ask them to review. (although it is suggested to not use more than one of them, https://devguide.ropensci.org/editorguide.html?q=reviewers#where-to-look-for-reviewers)

@bcjaeger
Copy link
Author

bcjaeger commented Jul 6, 2022

Hi @tdhock, thanks! I will reach out to the folks in my earlier post and let you know if anyone is willing to review aorsf. Hope you had a great trip!

@tdhock
Copy link

tdhock commented Jul 8, 2022

@ropensci-review-bot add @chjackson to reviewers

@bcjaeger
Copy link
Author

bcjaeger commented Sep 2, 2022

@mnwright, thank you so much for your review! The suggestions you've made to improve aorsf are incredibly helpful. I've reviewed each of your comments and written up a short description of actions I will take to incorporate your suggestions. If you think any of the actions I am proposing should be modified, just let me know, I am happy to coordinate a better approach.

Once I have finished all of the updates, I will post and tag you.

?orsf contains startup messages from randomForestSRC and riskRegression, maybe suppress them?

Whoops, sorry I missed that. I will update with suppressPackageStartupMessages.

The function oobag_c_harrell() is repeated several times in the tests (presumable they are identical). Maybe reduce redundancy here?

I like this idea. I think making oobag_c_harrell() an unexported R function in aorsf will resolve this.

In the OOB vignette, the example for the user-supplied OOB function is incorrect (as explained in the text). In my experience, some people copy&paste such things anyway, so I would try to avoid incorrect examples.

Agreed - I think using SurvMetrics::Brier instead of writing a simple (incorrect) Brier score function is a good fix for this.

The changelog (https://bcjaeger.github.io/aorsf/news/index.html) starts with the oldest changes. Isn't that usually the other way around?

Whoops - yes, that needs to be reversed. Thank you! 😅

I think orsf can only be called with the formula parameter. Does this work for high dimensional data, e.g. p>10,000 or is it slow/failing?

That is a great question - I think I need to include a high-dimensional case in the benchmark above to answer it. I'd also like to experiment with adding an x and y input for orsf, similar to ranger, to see how much using x and y rather than formula impacts aorsf's efficiency in high-dimensional settings.

The C++ code is lacking documentation, i.e., it is unclear what the functions in the C++ code are doing. I think all exported Rcpp functions are internal (not directly called by the user), so I won't expect documentation on the same levels as for exported R functions. Still, some documentation would help contributors work with the C++ code.

Agreed, this code needs better documentation. I will go through each function and add comments that include description of the function's purpose and a description of each input for the function, including global variables that the function depends on.

All C++ code is in one large file with many global objects. It is common practice in C++ to avoid global objects where possible. Would it make sense to use some OOP instead?

Yes, using OOP would be a much better design and would make generalizations to oblique forests for regression and classification much cleaner. The long term plan for aorsf absolutely includes a more modular re-write of the C++ file. The only reason aorsf didn't have this design from the beginning is that I am pretty inexperienced when it comes to performant C++ code. I am slowly getting there.

What about oblique RF for classification and regression? Do other packages implement the speedups used here? If not, why not have a fast oblique RF package for all outcomes instead of a specific one for survival?

Agreed! I focused on survival initially because I wasn't aware of any R packages for oblique survival trees (the RLT package fits oblique trees for regression and classification, and obliqueRF can fit oblique trees for classification). Developing aorsf into a more general package that fits all three types of trees will be a big undertaking, but should be much more feasible after updating the C++ code as described above.

ranger does not reduce the number of unique time points and that's what makes it slow and increases the memory usage. The bottleneck not the sample size but that the CHF is estimated at each unique time point. rfsrc for example uses a grid (ntime parameter). See also here imbs-hl/ranger#410. I re-ran that experiment with a grid in ranger and then it was fastest (and didn't show the strange behavior shown in the plots above).

Ahh, I understand. Sorry I did not control for this in my initial benchmark. I'd like to update my benchmark code so that ranger can use a grid of timepoints for the CHF as you described. Would you be willing to share your code for this? I can also manually restrict the number of unique event times in the benchmark data if you think that would be a better approach.

Growing only one tree might give misleading results because of computation overhead, e.g. with Rcpp.

Agreed - I think I can revert to using 500 trees assuming the memory issue is fixed by restricting the number of unique event times.

Formula interfaces are quite slow for many features (I guess for all packages?). That might be a problem for p=1000 (see comment above).

Good point - if orsf() crashes due to its formula input with p = 1000, I will prioritize incorporating the x and y inputs into orsf() and reassess.

In summary, this is a great submission, ready for publication. I think most of the issues above are quite easy to fix.

Thank you! I sincerely appreciate the time you invested in reviewing aorsf and your thoughtful feedback. It will be my honor to include you as a package reviewer in the DESCRIPTION file.

@bcjaeger
Copy link
Author

bcjaeger commented Sep 6, 2022

I am happy to share an updated response to the excellent review from @mnwright:

  1. Examples in orsf() now suppress package start up messages (see here)
  2. oobag_c_harrell() is now an unexported function in aorsf to avoid repeatedly defining it in testing scripts.
  3. SurvMetrics::Brier is now used in place of the incorrect Brier score in the out-of-bag vignette (see here)
  4. NEWS.md now shows updates in the usual order (see here).
  5. C++ functions in src/orsf.cpp have documentation outlining the purpose of the function and its inputs (see here).

I am committed to re-writing the orsf.cpp functions with modular, object oriented design, using the ranger files as a guideline. This will, among other things, make it far more feasible to efficiently fit oblique regression and classification forests with this package. I'll be submitting a grant in the Spring and will hopefully get funding to work through this.

The benchmark of aorsf, ranger, and randomForestSRC has also undergone some updates:

  • 500 trees are now used instead of 1.
  • the number of unique event times is now fixed at 100, 1000, or 10000 by modifying the time values in the data.
  • the number of predictors is now fixed at 100, 1000, or 10000 (it was originally 10, 100, or 1000)
  • To avoid high memory usage from having many leaves with many unique event times, I increase the minimal terminal node size as a function of the sample size, i.e., node_size = n_obs / 10.

With these updates, it is clear that aorsf runs slow compared to ranger and randomForestSRC in cases where the number of unique event times is 100 or 1000. It is also clear that aorsf does not scale as well to high dimensional settings - it does best relative to the other packages when the number of predictors is 100 and the number of unique event times is 10000 (i.e., each observation has a unique event time). In my prediction benchmark, the number of predictors is often < 100 and there are few instances of non-unique event times, so I can understand why benchmark results seem to favor aorsf in that comparison.

If you aggregate over all 9 panels (see second figure), ranger has the fastest expected time and all three methods appear to have a linear asymptotic complexity.

temp

temp2

@jemus42
Copy link

jemus42 commented Sep 16, 2022

Package Review

  • Briefly describe any working relationship you have (had) with the package authors.
    None.
  • As the reviewer I confirm that there are no conflicts of interest for me to review this work

Documentation

The package includes all the following forms of documentation:

  • A statement of need: clearly stating problems the software is designed to solve and its target audience in README
    • From my understanding the "gist" is "obliqueRSF, but fast", so it inherits its usefulness/audience from obliqueRSF.
    • It clearly has a place between existing methods and implementations.
  • Installation instructions: for the development version of package and any non-standard dependencies in README
  • Vignette(s): demonstrating major functionality that runs successfully locally
  • Function Documentation: for all exported functions
    • The documentation is comprehensive and useful for both novice and experienced users.
  • Examples: (that run successfully locally) for all exported functions
  • Community guidelines: including contribution guidelines in the README or CONTRIBUTING, and DESCRIPTION with URL, BugReports and Maintainer (which may be autogenerated via Authors@R).
    • I found contribution guidelines at .github/CONTRIBUTING.md.

Functionality

  • Installation: Installation succeeds as documented.
  • Functionality: Any functional claims of the software have been confirmed.
  • Performance: Any performance claims of the software have been confirmed.
  • Automated tests: Unit tests cover essential functions of the package and a reasonable range of inputs and conditions. All tests pass on the local machine.
  • Packaging guidelines: The package conforms to the rOpenSci packaging guidelines.

Estimated hours spent reviewing: 4

  • Should the author(s) deem it appropriate, I agree to be acknowledged as a package reviewer ("rev" role) in the package DESCRIPTION file.

Review Comments

The majority of my review are based on the state of the package as of commit 2251066.

  • Overall the package features comprehensive documentation and examples.
  • Function names are consistent and discoverable as well as tab-completable.
  • User-supplied arguments are checked consistently via internal functions defined in R/check.R.
    I would like to suggest the checkmate package for argument checks, which could greatly reduce complexity of internal functions/checks. I do concede that now that the work is already done it's likely not worth re-implementing checks again. Also, the author wants to keep dependencies to a minimum, so I understand the manual implementation of argument checks.
  • The check_* functions mentioned above and internal roxy_* functions for documentation snippets both reduce code duplication and ensure documentation is easier to keep consistent and up to date.
    • Some of the roxy_* documentation helpers may be exchangeable for roxygen2 @importFrom tags or templates etc., but that's merely a gut feeling about idiomatic use of roxygen2.
  • I have not found any instances of @srr tags/comments I would challenge.
  • I could not find any reference to parallel processing, i.e. using multiple CPU cores/threads to speed up training. That's not a direct issue I think, but maybe something to look into in the future? Especially given the application on large datasets.
  • The test suite seems comprehensive and checks for correctness against well-established methods, but I'm wondering if it would be useful to add an automated regression test against obliqueRSF (at another location, to avoid the Suggests:-dependency on it).

Overall my impression of the package is very positive!

Performance

I have not conducted my own exhaustive experiments and have no additional notes regarding the public benchmark at aorsf-bench.

The author also kindly contributed to mlr3extralearners, which simplifies the algorithm's inclusion in other benchmark experiments.
The initiative to connect to existing frameworks like mlr3 and tidymodels is very much appreciated and allowed me to run a quick (naive) benchmark against ranger and rfrsc, showing comparable performance regarding C-index and brier score, as well as significantly shorter run-time for low-dimensional tasks (caveat: no tuning and default parameters were used for all algorithms, it was merely a quick test-run of the mlr3 learner).

A very minor nitpicky comment regarding code efficiency:
fctr.R#17 creates an empty vector (c()) and grows it with each character variable. For small datasets this is likely irrelevant, but imagining a scenario with many character variables supplied intentionally or accidentally by the user, it may be more appropriate to use a vapply construct like in this example:

# Example data
data <- as.data.frame(matrix(rnorm(100), ncol = 10))
data$V4 <- as.character(data$V4)
data$V8 <- as.character(data$V8)
data$V10 <- as.character(data$V10)

# Recreating .names argument
.names <- names(data)

# character detection without a for loop and a growing vector
which_char <- vapply(data, is.character, logical(1), USE.NAMES = FALSE)
chrs <- .names[which(which_char)]
chrs
#> [1] "V4"  "V8"  "V10"

Also note that the output will likely be excessive if a user were to accidentally supply a very large dataset with thousands of character variables, but that's a small edge case with little impact I think.

Methods

The inclusion of multiple easy to discover/use variable importance methods is very nice to see!

Partial dependence plots: Due to their known limitations/assumptions, it may be appropriate to
a) note those limitations in the documentation to ensure users unfamiliar with the technique can make an informed choice
b) consider additionally supporting accumulated local effects, a "faster and unbiased alternative" to PDPs, if possible.

Permutation feature importance: This method has drawbacks as well, see Hooker et. al., so the previous comment applies with the exception that I cannot suggest a direct alternative.

I was not familiar with the ANOVA- and negation-based importance measures, and I suspect many users may feel the same way. In that regard I think it would be useful to elaborate on these methods slightly, e.g. in vignettes/aorsf.Rmd where they are mentioned next to the aforementioned methods, or in the help page at ?orsf_vi in the Details section which already outlines the methods.
Reading the paper/preprint in aorsf-bench was helpful, but I think some rough guidance regarding pros/cons in the documentation would be beneficial for users, if one can even be provided.
Choosing interpretability methods is of course not trivial, and I don't expect the author to provide a conclusive heuristic here.

@tdhock
Copy link

tdhock commented Sep 16, 2022

Thanks for your review @jemus42! You are right it is not good practice for large data to grow vectors by over-writing the variable in a for loop. vapply is one option, and another option is to use a for loop, but write to a list element in each iteration (constant time operation), instead of over-writing the variable in each iteration (linear time operation), see https://tdhock.github.io/blog/2017/rbind-inside-outside/ for an example involving data frames, where the difference is noticeable even when there are only a few dozen things to combine.

@bcjaeger
Copy link
Author

@jemus42, thank you! I will write a more complete response to your feedback and a list of proposed actions in the next few days, but for now I just wanted to say how much I appreciate your review. You found a very nice solution to speed up fctr_check and I've just pushed a commit to aorsf with your suggested use of vapply . As expected, it's much faster. =]

n <- 500
p <- 1000

data_check <- list()

characters <- letters[1:5]

for( i in seq(p) ) {
 data_check[[i]] <- sample(characters, size = n, replace = T)
 names(data_check)[i] <- paste('x', i, sep = '_')
}

data_check <- as.data.frame(data_check)


c_loop <- function(data, .names){
 
 chrs <- c()
 
 for( .name in .names ){
  
  if(is.character(data[[.name]])){
   chrs <- c(chrs, .name)
  }
  
 }
 
 chrs
 
}

v_apply <- function(data, .names){
 
 chr_index <- which(
  vapply(data[, .names], is.character, logical(1), USE.NAMES = FALSE)
 )
 
 .names[chr_index]
 
}


testthat::expect_equal(
 c_loop(data_check, names(data_check)),
 v_apply(data_check, names(data_check))
)

microbenchmark::microbenchmark(
 c_loop = c_loop(data_check, names(data_check)),
 v_apply = v_apply(data_check, names(data_check))
)
#> Unit: microseconds
#>     expr      min       lq    mean    median       uq       max neval cld
#>   c_loop 8948.500 9234.001 9747.49 9380.5010 9530.401 13457.202   100   b
#>  v_apply  328.001  339.701  484.56  346.6005  385.701  5247.201   100  a

Created on 2022-09-16 by the reprex package (v2.0.1)

@bcjaeger
Copy link
Author

@jemus42, here is the more complete response to your comments, with planned action items. If you feel like any of the proposed actions could be improved, or if I didn't address all of your comments in my response, just let me know. Otherwise, I should have updates in aorsf based on this response completed in the next couple of days.

Overall the package features comprehensive documentation and examples. Function names are consistent and discoverable as well as tab-completable.

Thank you! I will continue to prioritize clear documentation and tab-friendly names in future updates.

I would like to suggest the checkmate package for argument checks, which could greatly reduce complexity of internal functions/checks. I do concede that now that the work is already done it's likely not worth re-implementing checks again. Also, the author wants to keep dependencies to a minimum, so I understand the manual implementation of argument checks.

I agree regarding checkmate, it would have made the checks more routine (I found out checkmate existed shortly after I had already written a substantial amount of code in check.R). For now, I'd like to keep the checks as they are, with the acknowledgment that if the checks fail or need a substantial update in the future I will incorporate checkmate.

The check_* functions mentioned above and internal roxy_* functions for documentation snippets both reduce code duplication and ensure documentation is easier to keep consistent and up to date. Some of the roxy_* documentation helpers may be exchangeable for roxygen2 @importFrom tags or templates etc., but that's merely a gut feeling about idiomatic use of roxygen2.

There probably is something I could do here but I also am not sure how to use roxygen2 optimally for the things that roxy_ functions currently do. I have learned through working with the mlr3extralearners package how to manage bibliographies better, so I think I could remove the roxy_cite code if I went that route. I am not sure if this is a high priority update though, since the roxy_cite functions appear to be working fine.

I have not found any instances of @srr tags/comments I would challenge.

Excellent! Thank you for checking these.

I could not find any reference to parallel processing, i.e. using multiple CPU cores/threads to speed up training. That's not a direct issue I think, but maybe something to look into in the future? Especially given the application on large datasets.

Good point - aorsf doesn't currently support parallel processing. I have initiated a branch (see https://github.com/bcjaeger/aorsf/tree/oop) where I am attempting to re-write aorsf's C++ in an object-oriented framework (thanks to Marvin's suggestion above), and I see parallel processing becoming far more feasible with this framework. I've designated parallel processing as a long-term objective tied to this branch (i.e., I don't think it will be done while this review is open, but it will be done).

The test suite seems comprehensive and checks for correctness against well-established methods, but I'm wondering if it would be useful to add an automated regression test against obliqueRSF (at another location, to avoid the Suggests:-dependency on it).

I like this idea. Would the test be something like graphing the predictions from aorsf in the x-axis against the predictions from obliqueRSF on the y-axis and expecting their correlation to be high? I'd propose to add a section to the aorsf::orsf() examples called "What about obliqueRSF?" and show there that aorsf is faster and provides very similar predictions using the pbc_orsf data. This shouldn't require adding obliqueRSF to Suggests: thanks to roxygen2's .Rmd options for examples.

The author also kindly contributed to mlr-org/mlr3extralearners#233, which simplifies the algorithm's inclusion in other benchmark experiments. The initiative to connect to existing frameworks like mlr3 and tidymodels is very much appreciated and allowed me to run a quick (naive) benchmark against ranger and rfrsc, showing comparable performance regarding C-index and brier score, as well as significantly shorter run-time for low-dimensional tasks (caveat: no tuning and default parameters were used for all algorithms, it was merely a quick test-run of the mlr3 learner).

I'm glad to hear the mlr3 learner ran as expected in the experiment! Would you be willing to share your code? I would like to add a section into the examples of aorsf::orsf() that demonstrates a simple use case of aorsf in mlr3. A pull request would of course be welcome, but that seems like a bigger ask.

A very minor nitpicky comment regarding code efficiency:
fctr.R#17 creates an empty vector (c()) and grows it with each character variable. For small datasets this is likely irrelevant, but imagining a scenario with many character variables supplied intentionally or accidentally by the user, it may be more appropriate to use a vapply construct.

Thank you! I found this very helpful and have implemented your suggestion.

Partial dependence plots: Due to their known limitations/assumptions, it may be appropriate to
a) note those limitations in the documentation to ensure users unfamiliar with the technique can make an informed choice
b) consider additionally supporting accumulated local effects, a "faster and unbiased alternative" to PDPs, if possible.

This is a good point. I will add notes in the documentation about the limitations of PD and will recommend users engage with the iml package to create ALE curves using aorsf. I was happy to write PD functions for aorsf because PD is relatively straightforward to compute, but I think ALE may be a little more complex so I'd prefer to lean on the iml package for ALE curves.

Permutation feature importance: This method has drawbacks as well, see Hooker et. al., so the previous comment applies with the exception that I cannot suggest a direct alternative.

I agree - I will add text to my documentation highlighting this and refer to the paper you linked as well as to the disadvantages of PD section in the iml book

I was not familiar with the ANOVA- and negation-based importance measures, and I suspect many users may feel the same way. In that regard I think it would be useful to elaborate on these methods slightly, e.g. in vignettes/aorsf.Rmd where they are mentioned next to the aforementioned methods, or in the help page at ?orsf_vi in the Details section which already outlines the methods. Reading the paper/preprint in aorsf-bench was helpful, but I think some rough guidance regarding pros/cons in the documentation would be beneficial for users, if one can even be provided. Choosing interpretability methods is of course not trivial, and I don't expect the author to provide a conclusive heuristic here.

That is a good point. ANOVA importance was developed and evaluated by Menze et al for oblique classification random forests. Negation importance is developed and evaluated in the ArXiv paper you mentioned. According to the simulation study in the ArXiv paper, all three methods (negation, ANOVA, and permutation) appear to be valid. It seems as if negation has an edge in the general comparison based on our simulation study, but that isn't quite enough evidence for me to recommend negation as a default approach. I will add some notes to the documentation about the origin of these three methods and will add pros/cons for each one.

Overall my impression of the package is very positive!

I am thrilled to hear that. =] Thank you for reviewing closely and providing so many useful ideas to improve aorsf!

@jemus42
Copy link

jemus42 commented Sep 19, 2022

  • roxy_*-functions: Sure, they serve their purpose and I don't see any reason to forcibly substitute all of them :) Might as well be consistent.
  • Parallel processing: Neat! Always nice to have some future enhancements already in the works.
  • Regression test: Yes, a scatterplot against obliqueRSF predictions would be a good way to check consistency. Wouldn't obliqueRSF then still be a build-time dependency for the docs though? I haven't had a chance to experiment with roxygen2's .Rmd option you're referring to, but I mean it has to be installed at some point to render it? Wouldn't CRAN require it as a dependency then?
  • Interpretability methods: All good, some documentation notes and leaning on existing implementations is fine :)

Regarding the mlr3extralearners test-run I did, my code is fairly bare-bones, and unfortunately due to mlr3proba being in a somewhat ill-defined state after its CRAN removal I'm not sure how to communicate that to users. Technically it works fine, just with an asterisk or two.

library(mlr3verse)
#> Loading required package: mlr3
library(mlr3proba)
library(mlr3extralearners)
library(mlr3viz)
library(aorsf)
library(ggplot2)

# Less logging during training
lgr::get_logger("mlr3")$set_threshold("warn")

# Some example tasks without missing values
tasks <- tsks(c("actg", "rats"))

# Learners with default parameters
learners <- lrns(
  c("surv.ranger", "surv.rfsrc", "surv.aorsf"),
  predict_sets = c("train", "test")
)

# Brier (Graf) score, c-index and training time as measures
measures <- msrs(c("surv.graf", "surv.cindex", "time_train"))

# Benchmark with 10-fold CV
design <- benchmark_grid(
  tasks = tasks,
  learners = learners,
  resamplings = rsmps("cv", folds = 10)
)

# Parallelization (optional)
future::plan("multisession", workers = 5)
benchmark_result <- benchmark(design)

bm_scores <- benchmark_result$score(measures, predict_sets = "test")

bm_scores |>
  dplyr::select(task_id, learner_id, surv.graf, surv.cindex, time_train) |>
  tidyr::pivot_longer(cols = surv.graf:time_train, names_to = "measure") |>
  ggplot(aes(x = learner_id, y = value)) +
  facet_wrap(vars(measure), nrow = 1, scales = "free_y") +
  geom_violin(draw_quantiles = c(.25, .5, .75)) +
  theme_minimal()

bm_summary <- benchmark_result$aggregate(measures)
bm_summary[, c("task_id", "learner_id", "surv.graf", "surv.cindex")]
#>    task_id  learner_id  surv.graf surv.cindex
#> 1:    actg surv.ranger 0.05933293   0.7265903
#> 2:    actg  surv.rfsrc 0.05776890   0.7373370
#> 3:    actg  surv.aorsf 0.05803357   0.7282324
#> 4:    rats surv.ranger 0.07233919   0.7992290
#> 5:    rats  surv.rfsrc 0.08087658   0.7636673
#> 6:    rats  surv.aorsf 0.08338740   0.7874255

Created on 2022-09-19 with reprex v2.0.2

@ropensci-review-bot
Copy link
Collaborator

📆 @mnwright you have 2 days left before the due date for your review (2022-09-21).

@ropensci-review-bot
Copy link
Collaborator

📆 @jemus42 you have 2 days left before the due date for your review (2022-09-21).

@tdhock
Copy link

tdhock commented Sep 19, 2022

@ropensci-review-bot submit review #532 (comment) time 4

@ropensci-review-bot
Copy link
Collaborator

Logged review for jemus42 (hours: 4)

@tdhock
Copy link

tdhock commented Sep 19, 2022

@ropensci-review-bot submit review #532 (comment) time 3

@ropensci-review-bot
Copy link
Collaborator

Logged review for mnwright (hours: 3)

@bcjaeger
Copy link
Author

Regression test: Yes, a scatterplot against obliqueRSF predictions would be a good way to check consistency. Wouldn't obliqueRSF then still be a build-time dependency for the docs though? I haven't had a chance to experiment with roxygen2's .Rmd option you're referring to, but I mean it has to be installed at some point to render it? Wouldn't CRAN require it as a dependency then?

The @includeRmd option in roxygen2 seems to be a valid backdoor to include packages in your examples without making them dependencies or suggested packages. I think this is because the .Rmd files are knitted prior to sending the package out, so the code is only run locally.

I am happy to share some updates!

  • I've added the mlr3 example to my examples in orsf.R (see here).

  • I've added documentation on limitations of partial dependence (see here and here).

  • I have placed the regression test for obliqueRSF into a testing file, and plan to run the test locally to prevent adding obliqueRSF as a suggested package. I am a little uncertain regarding the minimal correlation that would be acceptable, but I feel comfortable with 0.97 (i.e., the current correlation from the reprex below).

# Similar to obliqueRSF?
suppressPackageStartupMessages({
 library(aorsf)
 library(obliqueRSF)
})

set.seed(50)

fit_aorsf <- orsf(pbc_orsf, 
                  formula = Surv(time, status) ~ . - id,
                  n_tree = 100)
fit_obliqueRSF <- ORSF(pbc_orsf, ntree = 100, verbose = FALSE)


risk_aorsf <- predict(fit_aorsf, new_data = pbc_orsf, pred_horizon = 3500)
risk_obliqueRSF <- 1-predict(fit_obliqueRSF, newdata = pbc_orsf, times = 3500)

cor(risk_obliqueRSF, risk_aorsf)
#>           [,1]
#> [1,] 0.9747177
plot(risk_obliqueRSF, risk_aorsf)

Created on 2022-09-20 by the reprex package (v2.0.1)

@jemus42, do you feel these updates (in addition to the earlier update using vapply instead of c()) have successfully incorporated the feedback from your review into aorsf?

@jemus42
Copy link

jemus42 commented Sep 21, 2022

Yes, I have nothing more to add I think :)

Regarding the regression test, I don't think there's a meaningful cutoff one could define, but if for example you were to introduce a bug on the aorsf C++ code that resulted in nonsense predictions you would most likely immediately recognize them in that plot, and that's the main appeal :)

@bcjaeger
Copy link
Author

Excellent! Totally agree with you on the regression test. Thanks again for the suggestion!

@tdhock
Copy link

tdhock commented Sep 22, 2022

Since you have sufficiently addressed the reviewer comments, this package can now be accepted.

@tdhock
Copy link

tdhock commented Sep 22, 2022

@ropensci-review-bot approve aorsf

@ropensci-review-bot
Copy link
Collaborator

Approved! Thanks @bcjaeger for submitting and @chjackson, @mnwright, @jemus42 for your reviews! 😁

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You will need to enable two-factor authentication for your GitHub account.
    This invitation will expire after one week. If it happens write a comment @ropensci-review-bot invite me to ropensci/<package-name> which will re-send an invitation.
  • After transfer write a comment @ropensci-review-bot finalize transfer of <package-name> where <package-name> is the repo/package name. This will give you admin access back.
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • Delete your current code of conduct file if you had one since rOpenSci's default one will apply, see https://devguide.ropensci.org/collaboration.html#coc-file
  • If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar, https://github.com/ropensci/foobar
  • Fix any links in badges for CI and coverage to point to the new repository URL.
  • Increment the package version to reflect the changes you made during review. In NEWS.md, add a heading for the new version and one bullet for each user-facing change, and each developer-facing change that you think is relevant.
  • We're starting to roll out software metadata files to all rOpenSci packages via the Codemeta initiative, see https://docs.ropensci.org/codemetar/ for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.
  • You can add this installation method to your package README install.packages("<package-name>", repos = "https://ropensci.r-universe.dev") thanks to R-universe.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent).

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @ropensci/blog-editors in your reply. They will get in touch about timing and can answer any questions.

We maintain an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding (with advice on releases, package marketing, GitHub grooming); the guide also feature CRAN gotchas. Please tell us what could be improved.

Last but not least, you can volunteer as a reviewer via filling a short form.

@bcjaeger
Copy link
Author

@ropensci-review-bot finalize transfer of bcjaeger/aorsf

@ropensci-review-bot
Copy link
Collaborator

Can't find repository ropensci/bcjaeger/aorsf, have you forgotten to transfer it first?

@bcjaeger
Copy link
Author

@ropensci-review-bot finalize transfer of aorsf

@ropensci-review-bot
Copy link
Collaborator

Transfer completed.
The aorsf team is now owner of the repository and the author has been invited to the team

@bcjaeger
Copy link
Author

Thanks everyone for your help improving aorsf! =]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests