-
Notifications
You must be signed in to change notification settings - Fork 10
/
aorsf.Rmd
158 lines (92 loc) · 5.93 KB
/
aorsf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
title: "Introduction to aorsf"
description: >
Learn how to get started with the basics of aorsf.
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Introduction to aorsf}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.height = 5,
fig.width = 7
)
```
This article covers core features of the `aorsf` package.
```{r}
library(aorsf)
```
## Background
The oblique random forest (RF) is an extension of the traditional (axis-based) RF. Instead of using a single variable to split data and grow new branches, trees in the oblique RF use a weighted combination of multiple variables.
## Oblique RFs for survival, classification, and regression
The purpose of `aorsf` ('a' is short for accelerated) is to provide a unifying framework to fit oblique RFs that can scale adequately to large data sets. The fastest algorithms available in the package are used by default because they often have equivalent prediction accuracy to more computational approaches.
The center piece of `aorsf` is the `orsf()` function. In the initial versions of `aorsf`, the `orsf()` function only fit `o`blique `r`andom `s`urvival `f`orests, but now it allows for classification, regression, and survival forests. (I may introduce an `orf()` function in the future if the name `orsf()` is misleading to users.)
For classification, we fit an oblique RF to predict penguin species using `penguin` data from the magnificent `palmerpenguins` [R package](https://allisonhorst.github.io/palmerpenguins/)
```{r}
# An oblique classification RF
penguin_fit <- orsf(data = penguins_orsf, formula = species ~ .)
penguin_fit
```
For regression, we use the same data but predict bill length of penguins:
```{r}
# An oblique regression RF
bill_fit <- orsf(data = penguins_orsf, formula = bill_length_mm ~ .)
bill_fit
```
My personal favorite is the oblique survival RF with accelerated Cox regression because it has a great combination of prediction accuracy and computational efficiency (see [JCGS paper](https://doi.org/10.1080/10618600.2023.2231048)). Here, we predict mortality risk following diagnosis of primary biliary cirrhosis:
```{r}
# An oblique survival RF
pbc_fit <- orsf(data = pbc_orsf,
n_tree = 5,
formula = Surv(time, status) ~ . - id)
pbc_fit
```
you may notice that the first input of `aorsf` is `data`. This is a design choice that makes it easier to use `orsf` with pipes (i.e., `%>%` or `|>`). For instance,
```{r, eval=FALSE}
library(dplyr)
pbc_fit <- pbc_orsf |>
select(-id) |>
orsf(formula = Surv(time, status) ~ .,
n_tree = 5)
```
## Interpretation
`aorsf` includes several functions dedicated to interpretation of ORSFs, both through estimation of partial dependence and variable importance.
### Variable importance
There are multiple methods to compute variable importance, and each can be applied to any type of oblique forest.
- To compute *negation* importance, ORSF multiplies each coefficient of that variable by -1 and then re-computes the out-of-sample (sometimes referred to as out-of-bag) accuracy of the ORSF model.
```{r}
orsf_vi_negate(pbc_fit)
```
- You can also compute variable importance using *permutation*, a more classical approach that noises up a predictor and then assigned the resulting degradation in prediction accuracy to be the importance of that predictor.
```{r}
orsf_vi_permute(penguin_fit)
```
- A faster alternative to permutation and negation importance is ANOVA importance, which computes the proportion of times each variable obtains a low p-value (p < 0.01) while the forest is grown.
```{r}
orsf_vi_anova(bill_fit)
```
### Partial dependence (PD)
`r aorsf:::roxy_pd_explain()`
For more on PD, see the [vignette](https://docs.ropensci.org/aorsf/articles/pd.html)
### Individual conditional expectations (ICE)
`r aorsf:::roxy_ice_explain()`
For more on ICE, see the [vignette](https://docs.ropensci.org/aorsf/articles/pd.html#individual-conditional-expectations-ice)
## What about the original ORSF?
The original ORSF (i.e., `obliqueRSF`) used `glmnet` to find linear combinations of inputs. `aorsf` allows users to implement this approach using the `orsf_control_survival(method = 'net')` function:
```{r, eval=FALSE}
orsf_net <- orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_survival(method = 'net'))
```
`net` forests fit a lot faster than the original ORSF function in `obliqueRSF`. However, `net` forests are still much slower than `cph` ones.
## aorsf and other machine learning software
The unique feature of `aorsf` is its fast algorithms to fit ORSF ensembles. `RLT` and `obliqueRSF` both fit oblique random survival forests, but `aorsf` does so faster. `ranger` and `randomForestSRC` fit survival forests, but neither package supports oblique splitting. `obliqueRF` fits oblique random forests for classification and regression, but not survival. `PPforest` fits oblique random forests for classification but not survival.
Note: The default prediction behavior for `aorsf` models is to produce predicted risk at a specific prediction horizon, which is not the default for `ranger` or `randomForestSRC`. I think this will change in the future, as computing time independent predictions with `aorsf` could be helpful.
## Learning more
`aorsf` began as a dedicated package for oblique random survival forests, and so most papers published so far have focused on survival analysis and risk prediction. However, the routines for regression and classification oblique RFs in `aorsf` have high overlap with the survival ones.
- See [orsf](https://docs.ropensci.org/aorsf/reference/orsf.html) for more details on oblique random survival forests.
- see the [JCGS](https://doi.org/10.1080/10618600.2023.2231048) paper for more details on algorithms used specifically by `aorsf`.