vignettes/basic-workflow.Rmd

---
title: "Getting started with ERF"
output: 
  rmarkdown::html_vignette:
    toc: TRUE
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{Getting started with ERF}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(raster)
library(viridisLite)
```

## Overview
Running ERFs on a given dataset is easy. The function `ens_random_forests()` will take a given dataset in R `data.frame` format, amend it for modeling using `erf_data_prep()` and `erf_formula_prep()`, run each RF in the ensemble using `rf_ens_fn()`, and return a fitted ERF object. This object can then be passed to various output functions: `erf_plotter()` and ... to visualize and summarize.

First, we must load the R library.
```{r setup}
library(EnsembleRandomForests)
```

## Datasets

### Using the provided simulated dataset
The provided dataset is a `list` object that contains a `data.frame` of the sampled locations, the beta coefficients of the logistic model used to predict the probability of occurrence, and a `raster` `brick` object containing the gridded covariates, log-odds of occurrence, and probabilities of occurrence. 

```{r sim_data_cov, fig.align='center', fig.height=5, fig.width=7}
# We can also visualize the covariates
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1))
layout(matrix(c(1,1,2,2,3,3,0,4,4,5,5,0),2,6,byrow=TRUE))
r <- range(cellStats(simData$grid[[1:5]],'range'))
for(i in 1:5){
  image(simData$grid[[i]], col=inferno(100), zlim = r, 
        xaxt='n', yaxt='n', xlab="", ylab="")
  title(paste0('Covariate ', i))
}
```

We can also see the beta coefficients that produced the probability of presence using the model below:
$$\begin{equation}
  log\left[\frac{\hat{P}_{obs=1}}{1-\hat{P}_{obs=1}}\right] = \alpha + \beta_1X_1 + ... +\beta_nX_n
\end{equation}$$
```{r sim_data_prob, fig.align='center', fig.height=3, fig.width=6}
print(round(simData$betas,3))

# We can visualize the log-odds and the probability of presence
par(mar=c(0,0.5,2,0.5), oma=c(1,1,1,1), mfrow=c(1,2))
image(simData$grid[[6]], col=inferno(100), xaxt='n', yaxt='n', xlab="", ylab="")
title("Log-odds")
image(simData$grid[[7]], col=viridis(100), xaxt='n', yaxt='n', xlab="", ylab="")
with(simData$samples[simData$samples$obs==1,],
     points(x,y,pch=16,col='white'))
title("Probability of Presence")
```


## Running an Ensemble Random Forests model
Now that we have covered the datasets, let's run an ERF. This is simple using `ens_random_forests`. 
```{r erf_ex}
ens_rf_ex <- ens_random_forests(df=simData$samples, var="obs",
                                covariates=grep("cov",colnames(simData$samples),value=T),
                                header = NULL,
                                save=FALSE,
                                out.folder=NULL,
                                duplicate = TRUE,
                                n.forests = 10L,
                                importance = TRUE,
                                ntree = 1000,
                                mtry = 5,
                                var.q = c(0.1,0.5,0.9),
                                cores = parallel::detectCores()-2)
```
The arguments to `ens_random_forests` are:

+ Data arguments:
  + `df`: this is the data.frame containing the presences/absences and the covariates
  + `var`: this is the column name of the presence/absence
  + `covariates`: these are the column names of the covariates to use. Here, we grabbed anything with "cov" in the column name
  + `header`: these are additional column names you may wish to append to data.frame produced internally
+ Output arguments:
  + `save`: this is a logical whether to save the model to the working directory or an optional `out.folder` directory
+ Control arguments:
  + `duplicate`: a logical flag to control whether to duplicate observations with more than one presence. 
  + `n.forests`: this controls the number of forests to generate in the ensemble. See the optimization vignette for more information on tuning this parameter.
  + `importance`: a logical flag to calculate variable importance or not
  + `ntree`: number of trees in each Random Forests in the ensemble
  + `mtry`: number of covariates to try at each node in each tree in each Random Forests in the ensemble
  + `var.q`: quantiles for the distribution of the variable importance; only exectuted if importance=TRUE
  + `cores`: how many cores to run the model on.

We can look at some of the output produced by the random forests (see `help(ens_random_forests)` for a full list):

```{r view_erf}
#view the dataset used in the model
head(ens_rf_ex$data) 

#view the ensemble model predictions
head(ens_rf_ex$ens.pred)

#view the threshold-free ensemble performance metrics
unlist(ens_rf_ex$ens.perf[c('auc','rmse','tss')])

#view the mean test threshold-free performance metrics for each RF
ens_rf_ex$mu.te.perf

#structure of the individual model predictions
str(ens_rf_ex$pred)
```

As we can see, the ensemble performs better than the mean test predictions. This is advantage of ERF over other RF modifications for extreme class imbalance. [Siders et al. 2020](https://www.int-res.com/abstracts/esr/v43/p183-197/) discusses the various performance of these other modifications if you are curious.