/
v02_modeling.Rmd
406 lines (297 loc) · 14.1 KB
/
v02_modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
---
title: "flexsdm: Overview of Modeling functions"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{flexsdm: Overview of Modeling functions}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
# fig.path = "man/figures/README-",
# out.width = "100%",
fig.width = 6,
fig.height = 6,
# dpi = 60,
echo = TRUE,
warning = FALSE,
eval = TRUE
)
```
```{r dependencies, include = FALSE}
library(knitr)
# devtools::install_github('sjevelazco/flexsdm')
```
## Introduction
Species distribution modeling (SDM) has become a standard tool in multiple research areas, including ecology, conservation biology, biogeography, paleobiogeography, and epidemiology. SDM is an area of active theoretical and methodological research. The *flexsdm* package provides users the ability to manipulate and parameterize models in a variety of ways that meet their unique research needs.
This flexibility enables users to define their own complete or partial modeling procedure specific for their modeling situations (e.g., number of variables, number of records, different algorithms and ensemble methods, algorithms tuning, etc.).
In this vignette, users will learn about the second set of functions in the *flexsdm* package that fall under the "modeling" umbrella. These functions were designed to construct and validate different types of models and can be grouped into fit_* , tune_* , and esm_* family functions. In addition there is a function to perform ensemble modeling.
The fit_* functions construct and validate models with default hyper-parameter values. The tune_* functions construct and validate models by searching for the best combination of hyper-parameter values, and esm_ functions can be used for constructing and validating Ensemble of Small Models. Finally, the fit_ensemble() function is for fitting and validating ensemble models.
These are the functions for model construction and validation:
**fit_* functions family**
+ fit_gam() Fit and validate Generalized Additive Models
+ fit_gau() Fit and validate Gaussian Process models
+ fit_gbm() Fit and validate Generalized Boosted Regression models
+ fit_glm() Fit and validate Generalized Linear Models
+ fit_max() Fit and validate Maximum Entropy models
+ fit_net() Fit and validate Neural Networks models
+ fit_raf() Fit and validate Random Forest models
+ fit_svm() Fit and validate Support Vector Machine models
**tune_* functions family**
+ tune_gbm() Fit and validate Generalized Boosted Regression models with exploration of hyper-parameters
+ tune_max() Fit and validate Maximum Entropy models with exploration of hyper-parameters
+ tune_net() Fit and validate Neural Networks models with exploration of hyper-parameters
+ tune_raf() Fit and validate Random Forest models with exploration of hyper-parameters
+ tune_svm() Fit and validate Support Vector Machine models with exploration of hyper-parameters
**model ensemble**
+ fit_ensemble() Fit and validate ensemble models with different ensemble methods
**esm_* functions family**
+ esm_gam() Fit and validate Generalized Additive Models with Ensemble of Small Model approach
+ esm_gau() Fit and validate Gaussian Process models Models with Ensemble of Small Model approach
+ esm_gbm() Fit and validate Generalized Boosted Regression models with Ensemble of Small Model approach
+ esm_glm() Fit and validate Generalized Linear Models with Ensemble of Small Model approach
+ esm_max() Fit and validate Maximum Entropy models with Ensemble of Small Model approach
+ esm_net() Fit and validate Neural Networks models with Ensemble of Small Model approach
+ esm_svm() Fit and validate Support Vector Machine models with Ensemble of Small Model approach
## Installation
First, install the flexsdm package. You can install the released version of *flexsdm* from [github](https://github.com/sjevelazco/flexsdm) with:
```{r download package}
# devtools::install_github('sjevelazco/flexsdm')
require(flexsdm)
require(terra)
require(dplyr)
```
## Project directory setup
Decide where on your computer you would like to store the inputs and outputs of your project (this will be your main directory). Use an existing one or use dir.create() to create your main directory. Then specify whether or not to include folders for projections, calibration areas, algorithms, ensembles, and thresholds. For more details see [Vignette 01_pre_modeling](https://sjevelazco.github.io/flexsdm/articles/01_pre_modeling.html)
## Data, species occurrence and background data
In this tutorial, we will be using species occurrences and environmental data that are available through the *flexsdm* package. The "abies" example dataset includes a pr_ab column (presence = 1, and absence = 0), location columns (x, y) and other environmental data. You can load the "abies" data into your local R environment by using the code below:
(THIS EXAMPLE LOOKS A LITTLE STRANGE BECAUSE WE ARE ALSO USING BACKGROUND DATA, WHILE THE ABIES DATASET CLEARLY HAS ABSENCES...)
```{r occ data}
data("abies")
data("backg")
dplyr::glimpse(abies)
dplyr::glimpse(backg)
```
If you want to replace the abies dataset with your own data, make sure that your dataset contains the environmental conditions related to presence-absence data. We use a pre-modeling family function for a k-fold partition method (to be used for cross-validation). In the partition method the number of folds or replications must be the same for presence-absence and for background points datasets.
```{r occ partition}
abies2 <- part_random(
data = abies,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 5)
)
dplyr::glimpse(abies2)
```
Now, in the abies2 object we have a new column called ".part" with the 5 k-folds (1, 2, 3, 4, 5), indicating which partition each record (row) is a member of. Next, we have to apply the same partition method and number of folds to the environmental conditions of the background points.
```{r background_points}
backg2 <- part_random(
data = backg,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 5)
)
dplyr::glimpse(backg2)
```
In backg2 object we have a new column called ".part" with the 5 k-folds (1, 2, 3, 4, 5).
### 1. Fit and validate models
We fit and validate models: I. a maximum entropy model with default hyper-parameter values (flexsdm::fit_max) and II. a random forest model with exploration of hyper-parameters (flexsdm::tune_raf).
I. Maximum Entropy models with default hyper-parameter values.
```{r fit_max}
max_t1 <- fit_max(
data = abies2,
response = "pr_ab",
predictors = c("aet", "ppt_jja", "pH", "awc", "depth"),
predictors_f = c("landform"),
partition = ".part",
background = backg2,
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
clamp = TRUE,
classes = "default",
pred_type = "cloglog",
regmult = 1
)
```
This function returns a list object with the following elements:
```{r fit_max_names}
names(max_t1)
```
model: A "MaxEnt" class object. This object can be used for predicting.
```{r fit_max_model}
options(max.print = 20)
max_t1$model
```
predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.
```{r fit_max_predictors}
max_t1$predictors
```
performance: The performance metric (see sdm_eval). Those metrics that are threshold dependent are calculated based on the threshold specified in the argument. We can see all the selected threshold values.
```{r fit_max_performance}
max_t1$performance
```
Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.
```{r fit_max_data_ens}
max_t1$data_ens
```
II- Random forest models with exploration of hyper-parameters.
First, we create a data.frame that provides hyper-parameters values to be tested. It Is recommended to generate this data.frame. Hyper-parameter needed for tuning is 'mtry'. The maximum mtry must be equal to total number of predictors.
```{r tune_grid}
tune_grid <-
expand.grid(mtry = seq(1, 7, 1))
```
We use the same data object abies2, with the same k-fold partition method:
```{r tune_raf}
rf_t <-
tune_raf(
data = abies2,
response = "pr_ab",
predictors = c(
"aet", "cwd", "tmin", "ppt_djf",
"ppt_jja", "pH", "awc", "depth"
),
predictors_f = c("landform"),
partition = ".part",
grid = tune_grid,
thr = "max_sens_spec",
metric = "TSS",
)
```
Let's see what the output object contains. This function returns a list object with the following elements:
```{r tune_raf_names}
names(rf_t)
```
model: A "randomForest" class object. This object can be used to see the formula details, a basic summary o fthe model, and for predicting.
```{r tune_raf_model}
rf_t$model
```
predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.
```{r tune_raf_predictors}
rf_t$predictors
```
performance: The performance metric (see sdm_eval). Those metrics that are threshold dependent are calculated based on the threshold specified in the argument. We can see all the selected threshold values.
```{r tune_raf_performance}
rf_t$performance
```
Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.
```{r tune_raf_data_ens}
rf_t$data_ens
```
These model objects can be used in flexsdm::fit_ensemble().
### 2. Model Ensemble
In this example we fit and validate and ensemble model using the two model objects that were just created.
```{r fit_ensemble_example}
# Fit and validate ensemble model
an_ensemble <- fit_ensemble(
models = list(max_t1, rf_t),
ens_method = "meansup",
thr = NULL,
thr_model = "max_sens_spec",
metric = "TSS"
)
```
```{r fit_ensemble_names}
# Outputs
names(an_ensemble)
an_ensemble$thr_metric
an_ensemble$predictors
an_ensemble$performance
```
### 3. Fit and validate models with Ensemble of Small Model approach
This method consists of creating bivariate models with all pair-wise combinations of predictors and perform an ensemble based on the average of suitability weighted by Somers' D metric (D = 2 x (AUC -0.5)). ESM is recommended for modeling species with very few occurrences. This function does not allow categorical variables because the use of these types of variables could be problematic when applied to species with few occurrences. For more detail see Breiner et al. (2015, 2018)
```{r esm_subset_kfold}
data("abies")
library(dplyr)
# Create a smaller subset of occurrences
set.seed(10)
abies2 <- abies %>%
na.omit() %>%
group_by(pr_ab) %>%
dplyr::slice_sample(n = 10) %>%
group_by()
```
We can use different methods in the flexsdm::part_random function according to our data. See [part_random](https://sjevelazco.github.io/flexsdm/reference/part_random.html) for more details.
```{r using_kfold}
# Using k-fold partition method for model cross validation
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 3)
)
abies2
```
This function constructs Generalized Additive Models using the Ensembles of Small Models (ESM) approach (Breiner et al., 2015, 2018).
```{r esm_gam}
# We set the model without threshold specification and with the kfold created above
esm_gam_t1 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
```
This function returns a list object with the following elements:
```{r esm_gam_t1_names}
names(esm_gam_t1)
```
esm_model: A list with "GAM" class object for each bivariate model. This object can be used for predicting using the ESM approachwith sdm_predict function.
```{r esm_gam_t1_model}
options(max.print = 10) # If you don't want to see printed all the output
esm_gam_t1$esm_model
```
predictors: A tibble with variables use for modeling.
```{r esm_gam_t1_predictors}
esm_gam_t1$predictors
```
performance: Performance metric (see sdm_eval). Those threshold dependent metrics are calculated based on the threshold specified in the argument.
```{r esm_gam_t1_performance}
esm_gam_t1$performance
```
Now, we test the rep_kfold partition method. In this method 'folds' refers to the number of partitions for data partitioning and 'replicate' refers to the number of replicates. Both assume values >=1.
```{r }
# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))
# Test with rep_kfold partition using 3 folds and 5 replicates
set.seed(10)
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "rep_kfold", folds = 3, replicates = 5)
)
abies2
```
We use the new rep_kfold partition in the gam model
```{r esm_gam_t2}
esm_gam_t2 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
```
Test with random bootstrap partitioning. In method 'replicate' refers to the number of replicates (assumes a value >=1), 'proportion' refers to the proportion of occurrences used for model fitting (assumes a value >0 and <=1). With this method we can configure the proportion of training and testing data according to the species occurrences. In this example, proportion='0.7' indicates that 70% of data will be used for model training, while 30% will be used for model testing. For this method, the function will return .partX columns with "train" or "test" words as the entries.
```{r test_bootstrap}
# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))
# Test with bootstrap partition using 10 replicates
set.seed(10)
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "boot", replicates = 10, proportion = 0.7)
)
abies2
```
Use the new rep_kfold partition in the gam model
```{r esm_gam_t3}
esm_gam_t3 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
```
#=========#=========#=========#=========#=========#=========#=========#
**Vignette still under construction and changes**
#=========#=========#=========#=========#=========#=========#=========#