/
v04_Red_fir_example.Rmd
326 lines (269 loc) · 13.1 KB
/
v04_Red_fir_example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: 'flexsdm: Red Fir example'
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{flexsdm: Red Fir example}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
# fig.path = "man/figures/README-",
# out.width = "100%",
fig.width = 6,
fig.height = 6,
# dpi = 60,
echo = TRUE,
warning = FALSE,
eval = TRUE
)
```
```{r dependencies, include = FALSE}
# devtools::install_github('sjevelazco/flexsdm')
library(knitr)
```
# Example of full modeling process
## Study species & overview of methods
Here, we used the *flexsdm* package to model the current distribution of California red fir (*Abies magnifica*). Red fir is a high-elevation conifer species that's geographic range extends through the Sierra Nevada in California, USA, into the southern portion of the Cascade Range of Oregon. For this species, we used presence data compiled from several public datasets curated by natural resources agencies. We built the distribution models using four hydro-climatic variables: actual evapotranspiration, climatic water deficit, maximum temperature of the warmest month, and minimum temperature of the coldest month. All variables were resampled (aggregated) to a 1890 m spatial resolution to improve processing time.
## Delimit of a calibration area
Delimiting the calibration area (aka accessible area) is an essential step in SDMs both in
methodological and theoretical terms. The calibration area will affect several characteristics of a
SDM like the range of environmental variables, the number of absences, the distribution of
background points and pseudo-absences, and unfortunately, some performance metrics like AUC and TSS.
There are several ways to delimit a calibration area. In [calib_area()](https://sjevelazco.github.io/flexsdm/reference/calib_area.html). We used a method that the calibration area is delimited by a 100-km buffer around presences (shown in the figure below).
```{r raw data, message=FALSE}
# devtools::install_github('sjevelazco/flexsdm')
library(flexsdm)
library(terra)
library(dplyr)
somevar <- system.file("external/somevar.tif", package = "flexsdm")
somevar <- terra::rast(somevar) # environmental data
names(somevar) <- c("aet", "cwd", "tmx", "tmn")
data(abies)
abies_p <- abies %>%
select(x, y, pr_ab) %>%
filter(pr_ab == 1) # filter only for presence locations
ca <-
calib_area(
data = abies_p,
x = 'x',
y = 'y',
method = c('buffer', width = 100000),
crs = crs(somevar)
) # create a calibration area with 100 km buffer around occurrence points
# visualize the species occurrences
layer1 <- somevar[[1]]
layer1[!is.na(layer1)] <- 1
plot(layer1, col="gray80", legend=FALSE, axes=FALSE)
plot(crop(ca, layer1), add=TRUE)
points(abies_p[,c("x", "y")], col = "#00000480")
```
## Occurrence filtering
Sample bias in species occurrence data has long been a recognized issue in SDM. However, environmental filtering of observation data can improve model predictions by reducing redundancy in environmental (e.g. climatic) hyper-space (Varela et al. 2014). Here we will use the function occfilt_env() to thin the red fir occurrences based on environmental space. This function is unique to *flexsdm*, and in contrast with other packages is able to use any number of environmental dimensions and does not perform a PCA before filtering.
Next we apply environmental occurrence filtering using 8 bins and display the resulting filtered occurrence data
```{r occurrence filtering}
abies_p$id <- 1:nrow(abies_p) # adding unique id to each row
abies_pf <- abies_p %>%
occfilt_env(
data = .,
x = "x",
y = "y",
id = "id",
nbins = 8,
env_layer = somevar
) %>%
left_join(abies_p, by = c("id", "x", "y"))
plot(layer1, col="gray80", legend=FALSE, axes=FALSE)
plot(crop(ca, layer1), add=TRUE)
points(abies_p[,c("x", "y")], col = "#00000480")
points(abies_pf[,c("x", "y")], col = "#5DC86180")
```
## Block partition with 4 folds
Data partitioning, or splitting data into testing and training groups, is a key step in building SDMs. *flexsdm* offers multiple options for data partitioning and here we use a spatial block method. Geographically structured data partitioning methods are especially useful if users want to evaluate model transferability to different regions or time periods. The part_sblock() function explores spatial blocks with different raster cells sizes and returns the one that is best suited for the input datset based on spatial autocorrelation, environmental similarity, and the number of presence/absence records in each block partition. The function's output provides users with 1) a tibble with presence/absence locations and the assigned partition number, 2) a tibble with information about the best partition, and 3) a SpatRaster showing the selected grid. Here we want to divide the data into 4 different partitions using the spatial block method.
```{r block partition}
set.seed(10)
occ_part <- abies_pf %>%
part_sblock(
data = .,
env_layer = somevar,
pr_ab = "pr_ab",
x = "x",
y = "y",
n_part = 4,
min_res_mult = 3,
max_res_mult = 200,
num_grids = 30,
prop = 1
)
abies_pf <- occ_part$part
# Transform best block partition to a raster layer with same resolution and extent than
# predictor variables
block_layer <- get_block(env_layer = somevar, best_grid = occ_part$grid)
cl <- c("#64146D", "#9E2962", "#F47C15", "#FCFFA4")
plot(block_layer, col=cl, legend=FALSE, axes=FALSE)
points(abies_pf[,c("x", "y")])
# Number of presences per block
abies_pf %>%
dplyr::group_by(.part) %>%
dplyr::count()
# Additional information of the best block
occ_part$best_part_info
```
## Pseudo-absence/background points (using partition previously created as a mask)
In this example, we only have species presence data. However, most SDM methods require either pseudo-absence or background data. Here, we will use the spatial block partition we just created to generate pseudo-absence and background points.
```{r pseudo/absence and background data}
# Spatial blocks where species occurs
# Sample background points throughout study area with random method, allocating 10X the number of presences a background
set.seed(10)
bg <- lapply(1:4, function(x) {
sample_background(
data = abies_pf,
x = "x",
y = "y",
n = sum(abies_pf$.part == x) * 10,
method = "random",
rlayer = block_layer,
maskval = x,
calibarea = ca
)
}) %>%
bind_rows()
bg <- sdm_extract(data = bg, x = "x", y = "y", env_layer = block_layer)
# Sample a number of pseudo-absences equal to the presence in each partition
set.seed(10)
psa <- lapply(1:4, function(x) {
sample_pseudoabs(
data = abies_pf,
x = "x",
y = "y",
n = sum(abies_pf$.part == x),
method = "random",
rlayer = block_layer,
maskval = x,
calibarea = ca
)
}) %>%
bind_rows()
psa <- sdm_extract(data = psa, x = "x", y = "y", env_layer = block_layer)
cl <- c("#280B50", "#9E2962", "#F47C15", "#FCFFA4")
plot(block_layer, col="gray80", legend=FALSE, axes=FALSE)
points(bg[,c("x", "y")], col=cl[bg$.part], cex=0.8) # Background points
points(psa[,c("x", "y")], bg=cl[psa$.part], cex=0.8, pch=21) # Pseudo-absences
# Bind a presences and pseudo-absences
abies_pa <- bind_rows(abies_pf, psa)
abies_pa # Presence-Pseudo-absence database
bg # Background points
```
Extract environmental data for the presence-absence and background data . View the distributions of present points, pseudo-absence points, and background points using the blocks as a reference map.
```{r extract environmental data for two datasets}
abies_pa <- abies_pa %>%
sdm_extract(
data = .,
x = "x",
y = "y",
env_layer = somevar,
filter_na = TRUE
)
bg <- bg %>%
sdm_extract(
data = .,
x = "x",
y = "y",
env_layer = somevar,
filter_na = TRUE
)
```
## Fit models with tune_max, fit_gau, and fit_glm
Now, fit our models. The *flexsdm* package offers a wide range of modeling options, from traditional statistical methods like GLMs and GAMs, to machine learning methods like random forests and support vector machines. For each modeling method, *flexsdm* provides both fit_ and tune_ functions, which allow users to use default settings or adjust hyperparameters depending on their research goals. Here, we will test out tune_max() (tuned Maximum Entropy model), fit_gau() (fit Guassian Process model), and fit_glm (fit Generalized Linear Model). For each model, we selected three threshold values to generate binary suitability predictions: the threshold that maximizes TSS (max_sens_spec), the threshold at which sensitivity and specificity are equal (equal_sens_spec), and the threshold at which the Sorenson index is highest (max_sorenson). In this example, we selected TSS as the performance metric used for selecting the best combination of hyper-parameter values in the tuned Maximum Entropy model.
```{r fitting models}
t_max <- tune_max(
data = abies_pa,
response = "pr_ab",
predictors = names(somevar),
background = bg,
partition = ".part",
grid = expand.grid(
regmult = seq(0.1, 3, 0.5),
classes = c("l", "lq", "lqhpt")
),
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
metric = "TSS",
clamp = TRUE,
pred_type = "cloglog"
)
f_gau <- fit_gau(
data = abies_pa,
response = "pr_ab",
predictors = names(somevar),
partition = ".part",
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen")
)
f_glm <- fit_glm(
data = abies_pa,
response = "pr_ab",
predictors = names(somevar),
partition = ".part",
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
poly = 2
)
```
## Fit an ensemble model
Spatial predictions from different SDM algorithms can vary substantially, and ensemble modeling has become increasingly popular. With the fit_ensemble() function, users can easily produce an ensemble SDM based on any of the individual fit_ and tune_ models included the package. In this example, we fit an ensemble model for red fir based on the weighted average of the three individual models. We used the same threshold values and performance metric that were implemented in the individual models.
```{r ensemble model}
ens_m <- fit_ensemble(
models = list(t_max, f_gau, f_glm),
ens_method = "meanw",
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
thr_model = "max_sens_spec",
metric = "TSS"
)
ens_m$performance
```
The output of *flexsdm* model objects allows you to easily compare metrics across models, such as AUC or TSS. For example, we can use the sdm_summarize() function to merge model performance tables.
```{r}
model_perf <- sdm_summarize(list(t_max, f_gau, f_glm, ens_m))
model_perf
```
## Project the ensemble model
Next we project the ensemble model in space across the entire extent of our environmental layer, the California Floristic Province, using the sdm_predict() function. This function can be use to predict species suitability across any area for species' current or future suitability. In this example, we only project the ensemble model with one threshold, though users have the option to project multiple models with multiple threshold values. Here, we also specify that we want the function to return a SpatRast with continuous suitability values above the threshold (con_thr = TRUE).
```{r predict models}
pr_1 <- sdm_predict(
models = ens_m,
pred = somevar,
thr = "max_sens_spec",
con_thr = TRUE,
predict_area = NULL
)
unconstrained <- pr_1$meanw[[1]]
names(unconstrained) <- "unconstrained"
cl <- c("#FDE725", "#B3DC2B", "#6DCC57", "#36B677", "#1F9D87", "#25818E", "#30678D", "#3D4988", "#462777", "#440154")
plot(unconstrained, col=cl, legend=FALSE, axes=FALSE)
```
## Constrain the model with msdm_posterior
Finally, *flexsdm* offers users function that help correct overprediction of SDM based on occurrence records and suitability patterns. In this example we constrained the ensemble model using the method "occurrence based restriction", which assumes that suitable patches that intercept species occurrences are more likely a part of species distributions than suitable patches that do not intercept any occurrences. Because all methods of the msdm_posteriori() function work with presences it is
important to always use the original database (i.e., presences that have not been spatially or environmentally filtered).
All of the methods available in the msdm_posteriori() function are based on Mendes et al. (2020).
```{r constrain with msdm}
thr_val <- ens_m$performance %>%
dplyr::filter(threshold == "max_sens_spec") %>%
pull(thr_value)
m_pres <- msdm_posteriori(
records = abies_p,
x = "x",
y = "y",
pr_ab = "pr_ab",
cont_suit = pr_1$meanw[[1]],
method = c("obr"),
thr = c("sensitivity", sens = thr_val),
buffer = NULL
)
constrained <- m_pres$meanw[[1]]
names(constrained) <- "constrained"
cl <- c("#FDE725", "#B3DC2B", "#6DCC57", "#36B677", "#1F9D87", "#25818E", "#30678D", "#3D4988", "#462777", "#440154")
plot(constrained, col=cl, legend=FALSE, axes=FALSE)
```
#=========#=========#=========#=========#=========#=========#=========#
**Vignette still under construction and changes**
#=========#=========#=========#=========#=========#=========#=========#