Skip to content

aorsf - accelerated oblique random survival forest #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Nov 9, 2022
Merged

aorsf - accelerated oblique random survival forest #211

merged 33 commits into from
Nov 9, 2022

Conversation

bcjaeger
Copy link
Contributor

Hello! Following up on #172, aorsf has made it to CRAN and its API is stable.1

  • I added wrappers for aorsf that were based on the partykit templates
  • I modified the matrix to tibble function (but just made a copy rather than changing the original)
    • and just made it so that the name of predictions was an input for the function so that it could be used for other time-specific predictions like hazard
  • I added tests and examples based on aorsf

One possible concern is that aorsf fit and predict functions are designed to throw hard errors when the input data have missing values. This may not be in line with the design of parsnip predict functions, which propagate rows with missing data into the returned object as a missing value for the prediction. Would you like me to write some additional code into the aorsf wrappers for prediction to make it more consistent with the parsnip design?

A second possible concern is that aorsf predict functions don't let users compute predictions for times that are greater than the maximum observed time in the model's training data. E.g., for the lung data, the max time is 1022, so trying to predict a survival probability at time of 1023 or higher would trigger an error. Would you like me to write some additional code to prevent that sort of error from triggering?

1In #172, we agreed to consider this pull request when aorsf was on CRAN and the rOpenSci review was done. Technically, my review is still open at rOpenSci, but the review is finished in the sense that all updates to aorsf based on the reviewer feedback are complete. The issue is still open but the editor has let me know that it will be wrapped up in a week or so.

@bcjaeger
Copy link
Contributor Author

bcjaeger commented Sep 1, 2022

Just wanted to share an update on the rOpenSci review. I did think it was about to be closed but yesterday I was assigned two more reviewers. My bad - it looks like the rOpenSci review will continue for at least another few weeks.

I am very excited to see what the new reviewers think and I will do everything I can to incorporate their feedback in a way that is compatible with the version of aorsf on CRAN. I'll let you know asap if breaking changes come up and I'll update this PR accordingly if they do.

Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! It looks great but there are a few major points atm:

The prediction type "hazard" in censored refers to the instantaneous hazard, while aorsf calculates the cumulative hazard. We currently don't have a prediction type for that so let's remove that aspect from this PR for now. (We can consider adding this in the future.)

The tidymodels principle is to return NA if predictions are not possible so all engines in censored need to adhere to that. This is relevant for both predictions with missing values and out-of-range predictions.

There are several aspects where I think the changes required to this PR would be of interest to all users of aorsf so would you consider the following changes to aorsf?

  • An option for how missing values are handled for predictions: For example, predict.lm() has an argument na.action which determines what happens with missing values. aorsf could still default to an error but in censored we could request that it pads the results with NA. See also ?na.fail.

  • An option to allow NA values for out-of-range predictions instead of an error.

  • Allowing any order of time points for prediction, and sorting those internally to orsf().

  • Allowing the standard flexibility of Surv() wrt the values of the event indicator.

I think all of these options would greatly enhance the user-friendliness of aorsf! Again, thanks for the PR already and good luck with the rOpenSci review.

@bcjaeger
Copy link
Contributor Author

bcjaeger commented Sep 3, 2022

Thanks for the great review! The changes you described will definitely make aorsf more user-friendly, and I'm happy to do that. I will get things updated in aorsf first and then work on the comments above.

@bcjaeger
Copy link
Contributor Author

bcjaeger commented Sep 4, 2022

Happy to share updates in aorsf based on your thoughts above. One follow-up question:

An option to allow NA values for out-of-range predictions instead of an error.

Were you referring to the case where times in predict() has at least one value over the maximum observed follow-up time? If so, great! That is allowed now with check_boundaries set to FALSE. The returned values are currently not NA (explanation in reprex comments).

Thank you again for the thoughts you shared - the changes definitely make the API friendlier. I'm happy to make more modifications in aorsf if that is helpful. Otherwise, I will work on updates in the censored code.

suppressPackageStartupMessages({
 library(aorsf)
 library(tidyverse)
})

# Allowing the standard flexibility of Surv() wrt the values of the 
# event indicator.

pbc_lgl <- mutate(pbc_orsf, status = as.logical(status))
pbc_12 <- mutate(pbc_orsf, status = status + 1)

set.seed(329)
fit_standard <- orsf(pbc_orsf, Surv(time, status) ~ . - id)
set.seed(329)
fit_lgl <- orsf(pbc_lgl, formula = Surv(time, status) ~ . - id)
set.seed(329)
fit_12 <- orsf(pbc_12, formula = Surv(time, status) ~ . - id)

all.equal(fit_standard$forest, fit_lgl$forest)
#> [1] TRUE
all.equal(fit_standard$forest, fit_12$forest)
#> [1] TRUE

# An option for how missing values are handled for predictions: For example, 
# predict.lm() has an argument na.action which determines what happens with 
# missing values. aorsf could still default to an error but in censored we 
# could request that it pads the results with NA. See also ?na.fail.


pbc_first10 <- pbc_na <- slice(pbc_orsf, 1:10)
pbc_na$age[c(1,5,10)] <- NA

# consistent w/tidymodels
predict(fit_standard, pbc_na, na_action = 'pass')
#>             [,1]
#>  [1,]         NA
#>  [2,] 0.10615858
#>  [3,] 0.77370799
#>  [4,] 0.43724275
#>  [5,]         NA
#>  [6,] 0.08677136
#>  [7,] 0.06302206
#>  [8,] 0.25354362
#>  [9,] 0.96007034
#> [10,]         NA
# consistent w/base R
predict(fit_standard, pbc_na, na_action = 'omit')
#>            [,1]
#> [1,] 0.10615858
#> [2,] 0.77370799
#> [3,] 0.43724275
#> [4,] 0.08677136
#> [5,] 0.06302206
#> [6,] 0.25354362
#> [7,] 0.96007034

# An option to allow NA values for out-of-range predictions 
# instead of an error.

## I allow out of range predictions but, instead of returning NA when
## boundary_checks is FALSE, return prediction at the max observed time
## (notice column 2 (t = 10,000) is the same as 3 (t = 100,000))
## I did this to be consistent with the way I implemented boundary_checks
## in the orsf_pd functions and also just in case a user really wants
## to make predictions this way.
predict(fit_standard, 
        new_data = pbc_first10, 
        boundary_checks = FALSE,
        pred_horizon = c(1000, 10000, 100000))
#>             [,1]      [,2]      [,3]
#>  [1,] 0.91966955 0.9870598 0.9870598
#>  [2,] 0.04281353 0.5127637 0.5127637
#>  [3,] 0.40236198 0.9077745 0.9077745
#>  [4,] 0.22127622 0.8731020 0.8731020
#>  [5,] 0.09070765 0.7597322 0.7597322
#>  [6,] 0.02118349 0.4610088 0.4610088
#>  [7,] 0.02227554 0.5664215 0.5664215
#>  [8,] 0.09007197 0.8394871 0.8394871
#>  [9,] 0.88790002 0.9849913 0.9849913
#> [10,] 0.04862403 0.6593526 0.6593526

# Allowing any order of time points for prediction, and sorting those 
# internally to orsf().
predict(fit_standard, 
        new_data = pbc_first10, 
        pred_horizon = c(1000, 500, 100))
#>             [,1]        [,2]        [,3]
#>  [1,] 0.91966955 0.785869003 0.142670246
#>  [2,] 0.04281353 0.010758594 0.000000000
#>  [3,] 0.40236198 0.135624711 0.016068543
#>  [4,] 0.22127622 0.095856957 0.014200541
#>  [5,] 0.09070765 0.027139137 0.001000000
#>  [6,] 0.02118349 0.005302408 0.000000000
#>  [7,] 0.02227554 0.010475093 0.000000000
#>  [8,] 0.09007197 0.025564204 0.001844517
#>  [9,] 0.88790002 0.744081557 0.270957115
#> [10,] 0.04862403 0.017911853 0.001200000

Created on 2022-09-04 by the reprex package (v2.0.1)

@bcjaeger bcjaeger requested a review from hfrick September 4, 2022 17:09
Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you, that looks great! Thanks for the changes to aorsf. I've updated the tests so that they make use of these new features, i.e. actually try to predict on data with missing values and for time points outside of the observed range.

Re out-of-range predictions: yes, that is what I was referring to! The change in aorsf seems fine to me, we should just document that. Documentation of engine-specific detail like this one goes into parsnip. How to do that is documented here: https://github.com/tidymodels/parsnip/blob/main/inst/README-DOCS.md.

An example are the partykit engine docs for rand_forest()

Would you open a PR on parsnip for that?

Otherwise we're pretty much done here: other than the documentation, what's left to do is to add a bullet point about the new engine in NEWS, including the PR number and our github handle.

I'll wait with merging this PR until the rOpenSci review is done and a new version of aorsf is on CRAN so that we can depend on that, rather than the development version.

@hfrick
Copy link
Member

hfrick commented Sep 7, 2022

todo for myself: add tests on case weights

@bcjaeger
Copy link
Contributor Author

bcjaeger commented Sep 7, 2022

Thank you! I will review the examples you shared and open a PR on parsnip.

I noticed that my use of deparse1 causes our tests to fail on oldrel-3 since deparse1 was introduced in R 4.0.0. I think as.character can be used instead of deparse1, and will make sure this fix is included in 0.0.3.

@hfrick
Copy link
Member

hfrick commented Sep 7, 2022

Great, thanks!

re deparse1(): if you'd like some inspiration, this is how it's being dealt with in testthat: https://github.com/r-lib/testthat/blob/426eb523f6772c76175d6e8d7ca04355b1ee7ad5/R/utils.R#L53

and there is also https://github.com/r-lib/backports

@bcjaeger
Copy link
Contributor Author

Howdy! I just want to make sure I have not forgotten anything on my list. I have

  • opened the PR on parsnip
  • finished the rOpenSci review
  • uploaded aorsf 0.0.3 to CRAN

There is no rush to complete this, I just want to make sure I'm not holding it up. If there is anything else I can do, I would be happy to help.

@hfrick
Copy link
Member

hfrick commented Oct 28, 2022

Awesome, thank you! I'll get to this soon and also resolve the merge conflicts 👍

@bcjaeger
Copy link
Contributor Author

Great =] Thank you!

@hfrick
Copy link
Member

hfrick commented Nov 1, 2022

Getting closer @bcjaeger ! Once the parsnip PR is good to go, this one is as well 👍

@bcjaeger
Copy link
Contributor Author

bcjaeger commented Nov 7, 2022

Howdy! Just wanted to let you know aorsf 0.0.4 is on its way to CRAN and should be available soon. (It includes support for Surv objects in the formula.)

@hfrick hfrick self-requested a review November 9, 2022 10:13
@hfrick hfrick merged commit 4043ce6 into tidymodels:main Nov 9, 2022
@bcjaeger
Copy link
Contributor Author

Thank you!

@hfrick
Copy link
Member

hfrick commented Nov 10, 2022

Thank you for the great PR! 🙌

@bcjaeger bcjaeger deleted the aorsf branch November 13, 2022 21:38
@github-actions
Copy link

This pull request has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants