-
Notifications
You must be signed in to change notification settings - Fork 2
/
basic-workflow.Rmd
119 lines (100 loc) · 5.19 KB
/
basic-workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Getting started with ERF"
output:
rmarkdown::html_vignette:
toc: TRUE
toc_depth: 3
vignette: >
%\VignetteIndexEntry{Getting started with ERF}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(raster)
library(viridisLite)
```
## Overview
Running ERFs on a given dataset is easy. The function `ens_random_forests()` will take a given dataset in R `data.frame` format, amend it for modeling using `erf_data_prep()` and `erf_formula_prep()`, run each RF in the ensemble using `rf_ens_fn()`, and return a fitted ERF object. This object can then be passed to various output functions: `erf_plotter()` and ... to visualize and summarize.
First, we must load the R library.
```{r setup}
library(EnsembleRandomForests)
```
## Datasets
### Using the provided simulated dataset
The provided dataset is a `list` object that contains a `data.frame` of the sampled locations, the beta coefficients of the logistic model used to predict the probability of occurrence, and a `raster` `brick` object containing the gridded covariates, log-odds of occurrence, and probabilities of occurrence.
```{r sim_data_cov, fig.align='center', fig.height=5, fig.width=7}
# We can also visualize the covariates
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1))
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0),2,6,byrow=TRUE))
r <- range(cellStats(simData$grid[[1:5]],'range'))
for(i in 1:5){
image(simData$grid[[i]], col=inferno(100), zlim = r,
xaxt='n', yaxt='n', xlab="", ylab="")
title(paste0('Covariate ', i))
}
```
We can also see the beta coefficients that produced the probability of presence using the model below:
$$\begin{equation}
log\left[\frac{\hat{P}_{obs=1}}{1-\hat{P}_{obs=1}}\right] = \alpha + \beta_1X_1 + ... +\beta_nX_n
\end{equation}$$
```{r sim_data_prob, fig.align='center', fig.height=3, fig.width=6}
print(round(simData$betas,3))
# We can visualize the log-odds and the probability of presence
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1), mfrow=c(1,2))
image(simData$grid[[6]], col=inferno(100), xaxt='n', yaxt='n', xlab="", ylab="")
title("Log-odds")
image(simData$grid[[7]], col=viridis(100), xaxt='n', yaxt='n', xlab="", ylab="")
with(simData$samples[simData$samples$obs==1,],
points(x,y,pch=16,col='white'))
title("Probability of Presence")
```
## Running an Ensemble Random Forests model
Now that we have covered the datasets, let's run an ERF. This is simple using `ens_random_forests`.
```{r erf_ex}
ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",
covariates=grep("cov",colnames(simData$samples),value=T),
header = NULL,
save=FALSE,
out.folder=NULL,
duplicate = TRUE,
n.forests = 10L,
importance = TRUE,
ntree = 1000,
mtry = 5,
var.q = c(0.1,0.5,0.9),
cores = parallel::detectCores()-2)
```
The arguments to `ens_random_forests` are:
+ Data arguments:
+ `df`: this is the data.frame containing the presences/absences and the covariates
+ `var`: this is the column name of the presence/absence
+ `covariates`: these are the column names of the covariates to use. Here, we grabbed anything with "cov" in the column name
+ `header`: these are additional column names you may wish to append to data.frame produced internally
+ Output arguments:
+ `save`: this is a logical whether to save the model to the working directory or an optional `out.folder` directory
+ Control arguments:
+ `duplicate`: a logical flag to control whether to duplicate observations with more than one presence.
+ `n.forests`: this controls the number of forests to generate in the ensemble. See the optimization vignette for more information on tuning this parameter.
+ `importance`: a logical flag to calculate variable importance or not
+ `ntree`: number of trees in each Random Forests in the ensemble
+ `mtry`: number of covariates to try at each node in each tree in each Random Forests in the ensemble
+ `var.q`: quantiles for the distribution of the variable importance; only exectuted if importance=TRUE
+ `cores`: how many cores to run the model on.
We can look at some of the output produced by the random forests (see `help(ens_random_forests)` for a full list):
```{r view_erf}
#view the dataset used in the model
head(ens_rf_ex$data)
#view the ensemble model predictions
head(ens_rf_ex$ens.pred)
#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')])
#view the mean test threshold-free performance metrics for each RF
ens_rf_ex$mu.te.perf
#structure of the individual model predictions
str(ens_rf_ex$pred)
```
As we can see, the ensemble performs better than the mean test predictions. This is advantage of ERF over other RF modifications for extreme class imbalance. [Siders et al. 2020](https://www.int-res.com/abstracts/esr/v43/p183-197/) discusses the various performance of these other modifications if you are curious.