-
Notifications
You must be signed in to change notification settings - Fork 12
aorsf - accelerated oblique random survival forest #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just wanted to share an update on the rOpenSci review. I did think it was about to be closed but yesterday I was assigned two more reviewers. My bad - it looks like the rOpenSci review will continue for at least another few weeks. I am very excited to see what the new reviewers think and I will do everything I can to incorporate their feedback in a way that is compatible with the version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! It looks great but there are a few major points atm:
The prediction type "hazard"
in censored refers to the instantaneous hazard, while aorsf calculates the cumulative hazard. We currently don't have a prediction type for that so let's remove that aspect from this PR for now. (We can consider adding this in the future.)
The tidymodels principle is to return NA
if predictions are not possible so all engines in censored need to adhere to that. This is relevant for both predictions with missing values and out-of-range predictions.
There are several aspects where I think the changes required to this PR would be of interest to all users of aorsf so would you consider the following changes to aorsf?
-
An option for how missing values are handled for predictions: For example,
predict.lm()
has an argumentna.action
which determines what happens with missing values. aorsf could still default to an error but in censored we could request that it pads the results withNA
. See also?na.fail
. -
An option to allow
NA
values for out-of-range predictions instead of an error. -
Allowing any order of time points for prediction, and sorting those internally to
orsf()
. -
Allowing the standard flexibility of
Surv()
wrt the values of the event indicator.
I think all of these options would greatly enhance the user-friendliness of aorsf! Again, thanks for the PR already and good luck with the rOpenSci review.
Thanks for the great review! The changes you described will definitely make |
Happy to share updates in
Were you referring to the case where Thank you again for the thoughts you shared - the changes definitely make the API friendlier. I'm happy to make more modifications in suppressPackageStartupMessages({
library(aorsf)
library(tidyverse)
})
# Allowing the standard flexibility of Surv() wrt the values of the
# event indicator.
pbc_lgl <- mutate(pbc_orsf, status = as.logical(status))
pbc_12 <- mutate(pbc_orsf, status = status + 1)
set.seed(329)
fit_standard <- orsf(pbc_orsf, Surv(time, status) ~ . - id)
set.seed(329)
fit_lgl <- orsf(pbc_lgl, formula = Surv(time, status) ~ . - id)
set.seed(329)
fit_12 <- orsf(pbc_12, formula = Surv(time, status) ~ . - id)
all.equal(fit_standard$forest, fit_lgl$forest)
#> [1] TRUE
all.equal(fit_standard$forest, fit_12$forest)
#> [1] TRUE
# An option for how missing values are handled for predictions: For example,
# predict.lm() has an argument na.action which determines what happens with
# missing values. aorsf could still default to an error but in censored we
# could request that it pads the results with NA. See also ?na.fail.
pbc_first10 <- pbc_na <- slice(pbc_orsf, 1:10)
pbc_na$age[c(1,5,10)] <- NA
# consistent w/tidymodels
predict(fit_standard, pbc_na, na_action = 'pass')
#> [,1]
#> [1,] NA
#> [2,] 0.10615858
#> [3,] 0.77370799
#> [4,] 0.43724275
#> [5,] NA
#> [6,] 0.08677136
#> [7,] 0.06302206
#> [8,] 0.25354362
#> [9,] 0.96007034
#> [10,] NA
# consistent w/base R
predict(fit_standard, pbc_na, na_action = 'omit')
#> [,1]
#> [1,] 0.10615858
#> [2,] 0.77370799
#> [3,] 0.43724275
#> [4,] 0.08677136
#> [5,] 0.06302206
#> [6,] 0.25354362
#> [7,] 0.96007034
# An option to allow NA values for out-of-range predictions
# instead of an error.
## I allow out of range predictions but, instead of returning NA when
## boundary_checks is FALSE, return prediction at the max observed time
## (notice column 2 (t = 10,000) is the same as 3 (t = 100,000))
## I did this to be consistent with the way I implemented boundary_checks
## in the orsf_pd functions and also just in case a user really wants
## to make predictions this way.
predict(fit_standard,
new_data = pbc_first10,
boundary_checks = FALSE,
pred_horizon = c(1000, 10000, 100000))
#> [,1] [,2] [,3]
#> [1,] 0.91966955 0.9870598 0.9870598
#> [2,] 0.04281353 0.5127637 0.5127637
#> [3,] 0.40236198 0.9077745 0.9077745
#> [4,] 0.22127622 0.8731020 0.8731020
#> [5,] 0.09070765 0.7597322 0.7597322
#> [6,] 0.02118349 0.4610088 0.4610088
#> [7,] 0.02227554 0.5664215 0.5664215
#> [8,] 0.09007197 0.8394871 0.8394871
#> [9,] 0.88790002 0.9849913 0.9849913
#> [10,] 0.04862403 0.6593526 0.6593526
# Allowing any order of time points for prediction, and sorting those
# internally to orsf().
predict(fit_standard,
new_data = pbc_first10,
pred_horizon = c(1000, 500, 100))
#> [,1] [,2] [,3]
#> [1,] 0.91966955 0.785869003 0.142670246
#> [2,] 0.04281353 0.010758594 0.000000000
#> [3,] 0.40236198 0.135624711 0.016068543
#> [4,] 0.22127622 0.095856957 0.014200541
#> [5,] 0.09070765 0.027139137 0.001000000
#> [6,] 0.02118349 0.005302408 0.000000000
#> [7,] 0.02227554 0.010475093 0.000000000
#> [8,] 0.09007197 0.025564204 0.001844517
#> [9,] 0.88790002 0.744081557 0.270957115
#> [10,] 0.04862403 0.017911853 0.001200000 Created on 2022-09-04 by the reprex package (v2.0.1) |
- use data with missings for testing - include out-of-range predictions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you, that looks great! Thanks for the changes to aorsf. I've updated the tests so that they make use of these new features, i.e. actually try to predict on data with missing values and for time points outside of the observed range.
Re out-of-range predictions: yes, that is what I was referring to! The change in aorsf
seems fine to me, we should just document that. Documentation of engine-specific detail like this one goes into parsnip. How to do that is documented here: https://github.com/tidymodels/parsnip/blob/main/inst/README-DOCS.md.
An example are the partykit engine docs for rand_forest()
- the top-level file that include the md https://github.com/tidymodels/parsnip/blob/main/R/rand_forest_partykit.R
- the underlying Rmd used to generate the md https://github.com/tidymodels/parsnip/blob/main/man/rmd/rand_forest_partykit.Rmd
Would you open a PR on parsnip for that?
Otherwise we're pretty much done here: other than the documentation, what's left to do is to add a bullet point about the new engine in NEWS
, including the PR number and our github handle.
I'll wait with merging this PR until the rOpenSci review is done and a new version of aorsf is on CRAN so that we can depend on that, rather than the development version.
todo for myself: add tests on case weights |
Thank you! I will review the examples you shared and open a PR on parsnip. I noticed that my use of |
Great, thanks! re and there is also https://github.com/r-lib/backports |
Howdy! I just want to make sure I have not forgotten anything on my list. I have
There is no rush to complete this, I just want to make sure I'm not holding it up. If there is anything else I can do, I would be happy to help. |
Awesome, thank you! I'll get to this soon and also resolve the merge conflicts 👍 |
Great =] Thank you! |
Getting closer @bcjaeger ! Once the parsnip PR is good to go, this one is as well 👍 |
Howdy! Just wanted to let you know |
Thank you! |
Thank you for the great PR! 🙌 |
This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue. |
Hello! Following up on #172,
aorsf
has made it to CRAN and its API is stable.1aorsf
that were based on thepartykit
templatesaorsf
One possible concern is that
aorsf
fit and predict functions are designed to throw hard errors when the input data have missing values. This may not be in line with the design ofparsnip
predict functions, which propagate rows with missing data into the returned object as a missing value for the prediction. Would you like me to write some additional code into theaorsf
wrappers for prediction to make it more consistent with theparsnip
design?A second possible concern is that
aorsf
predict functions don't let users compute predictions for times that are greater than the maximum observed time in the model's training data. E.g., for thelung
data, the max time is 1022, so trying to predict a survival probability at time of 1023 or higher would trigger an error. Would you like me to write some additional code to prevent that sort of error from triggering?1In #172, we agreed to consider this pull request when
aorsf
was on CRAN and the rOpenSci review was done. Technically, my review is still open at rOpenSci, but the review is finished in the sense that all updates toaorsf
based on the reviewer feedback are complete. The issue is still open but the editor has let me know that it will be wrapped up in a week or so.