/
05-stan-timeseries.Rmd
208 lines (159 loc) · 7.23 KB
/
05-stan-timeseries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
title: "Fitting a timeseries model with Stan"
author: "Sean Anderson"
output:
html_document:
toc: true
toc_float: true
---
# Goals
- Gain exposure to programming a (timeseries) model in Stan that isn't a simple
linear regression.
- Practice interacting with Stan.
- Practice drawing inference from the posterior samples on a parameter.
- Practice comparing a parameter posterior to its prior distribution to assess
the information in the data.
# Package loading and option setting
We'll start by loading packages and setting some options:
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 6,
fig.asp = 0.618,
fig.align = "center"
)
show_file = function(file) {
cat(paste(readLines(file), collapse = "\n"))
}
```
```{r, warning=FALSE, message=FALSE}
library(tidyverse)
theme_set(theme_light())
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```
# The data
We are going to work with the time series of population counts for great herons in the UK. This is population ID 20579 from the Global Population Dynamics Database.
Data from Brooks, B., Traill, L. W., Bradshaw, C. J. A. 2006 Minimum viable
population sizes and global extinction risk are unrelated. Ecology Letters, 9
(4), 375-382.
Let's read in the data:
```{r}
d <- read.csv("data/grey-heron.csv") %>% as_tibble() %>%
select(sample_year, population_untransformed) %>%
rename(year = sample_year, abund_index = population_untransformed)
ggplot(d, aes(year, log(abund_index))) +
geom_point() + geom_line()
```
The large dips in abundance you see are a result of cold snowy winters when the herons couldn't find food.
# The model
Let's fit a Gompertz model with heavy-tailed process error to a time series of grey heron counts from England. We previously fit this model to this data set in:
Anderson, S.C., Branch, T.A., Cooper, A.B., and Dulvy, N.K. 2017. Black-swan events in animal populations. Proc. Natl. Acad. Sci. 114(12): 3252–3257. <https://doi.org/10.1073/pnas.1611525114>.
You can see the basic model we will work with in the file:
```{r}
show_file("stan/gompertz.stan")
```
Take a look at the model and make sure you understand it. This is fundamentally a timeseries model. How is it different from the other models we have worked with so far?
We can fit that model with Stan:
```{r, message=FALSE, warning=FALSE, results='hide'}
m <- stan("stan/gompertz.stan",
data = list(N = nrow(d), y = log(d$abund_index), nu_rate = 0.01),
iter = 2000)
```
Let's look at the parameters from our model:
```{r}
pars <- c("lambda", "b", "sigma", "nu")
print(m, pars = pars)
bayesplot::mcmc_trace(as.array(m), pars = pars)
bayesplot::mcmc_hist(as.array(m), pars = pars)
```
```{r}
post <- rstan::extract(m)
N <- 20
bayesplot::ppc_dens_overlay(y = log(d$abund_index), post$pred[1:N,])
bayesplot::ppc_ecdf_overlay(y = log(d$abund_index), post$pred[1:N,])
```
```{r}
bayesplot::ppc_intervals(y = log(d$abund_index), post$pred)
bayesplot::ppc_ribbon(y = log(d$abund_index), post$pred, prob = 0.5)
bayesplot::ppc_ribbon(y = log(d$abund_index), post$pred, prob = 0.9)
bayesplot::ppc_ribbon(y = d$abund_index, exp(post$pred), prob = 0.9)
```
Let's extract the posterior samples with `rstan::extract()`. Part of the Stan code calculated predictions from our model. Alternatively, we could calculate them in R ourselves.
As a comparison, we could fit the model with normally distributed process error. We will skip that here in the interest of time.
# Fitting the model to a timeseries without extremes
Let's try fitting our model to a second data set where there aren't extremes. This time we will use a data set of population counts for house wrens in the mid 20th century in Illinois, USA. GPDD population ID 28.
Data reference: Kendeigh, S.C. 1982 Bird populations in east central Illinois:
fluctuations, variation and development over a half-century. Illinois
Biological Monographs, 52:1-136.
```{r}
d_hw <- read.csv("data/house-wren.csv") %>% as_tibble() %>%
select(sample_year, population_untransformed) %>%
rename(year = sample_year, abund_index = population_untransformed)
ggplot(d_hw, aes(year, log(abund_index))) +
geom_point() + geom_line()
```
```{r, message=FALSE, warning=FALSE}
m_hw <- stan("stan/gompertz.stan",
data = list(N = nrow(d_hw), y = log(d_hw$abund_index), nu_rate = 0.01))
```
```{r}
print(m_hw, pars = pars)
```
So in this case our estimate of nu (the degrees of freedom parameter) is much higher. In fact, it approximately matches the prior because the data are not informative about heavy tails. In this case it reverts to be effectively normal:
```{r}
prior <- rexp(1e6, rate = 0.01)
prior <- prior[prior > 2]
hist(prior)
median(prior)
abline(v = median(prior), col = "red", lwd = 2)
```
This choice of prior is based on Fernandez C, Steel MFJ (1998) On Bayesian modeling of fat tails and skewness. J. Am. Stat. Assoc. 93(441):359–371.
Since our models are fit in a Bayesian framework, one useful value to calculate might be the probability density below some threshold value for the degrees of freedom parameter. For example:
```{r}
nu <- rstan::extract(m, pars = "nu")[[1]]
nu_hw <- rstan::extract(m_hw, pars = "nu")[[1]]
sum(nu < 10)/length(nu) # or mean(nu < 10)
sum(nu_hw < 10)/length(nu)
```
We've often used a value of 10 as a threshold since nu values much above 10 render distributions that are almost indistinguishable from the normal.
In the first case nu is less than 10 with `r mean(nu < 10)` probability. In the second case, there is only about a `r mean(nu_hw < 10)` probability of nu < 10. Remember that even with the prior, there is some probability that nu < 10.
```{r}
mean(prior < 10)
```
For many other possible heavy-tailed Stan population models, see <https://github.com/seananderson/heavy-tails/blob/master/analysis/1-compile-models.R>
```{r}
prior <- rexp(1e6, rate = 0.01)
prior <- prior[prior > 2]
g1 <- ggplot(tibble(prior), aes(prior)) + geom_density() +
ggtitle("original prior") +
coord_cartesian(xlim = c(0, 100))
prior <- rexp(1e6, rate = 0.005)
prior <- prior[prior > 2]
g2 <- ggplot(tibble(prior), aes(prior)) + geom_density() +
ggtitle("weaker prior") +
coord_cartesian(xlim = c(0, 100))
prior <- rexp(1e6, rate = 0.05)
prior <- prior[prior > 2]
g3 <- ggplot(tibble(prior), aes(prior)) + geom_density() +
ggtitle("stronger prior") +
coord_cartesian(xlim = c(0, 100))
cowplot::plot_grid(g1, g2, g3)
```
```{r, message=FALSE, warning=FALSE}
m_weaker_prior <- stan("stan/gompertz.stan",
data = list(N = nrow(d), y = log(d$abund_index), nu_rate = 0.005))
m_stronger_prior <- stan("stan/gompertz.stan",
data = list(N = nrow(d), y = log(d$abund_index), nu_rate = 0.05))
```
```{r}
nu <- rstan::extract(m, pars = "nu")[[1]]
mean(nu < 10)
nu_weaker_prior <- rstan::extract(m_weaker_prior, pars = "nu")[[1]]
mean(nu_weaker_prior < 10)
nu_stronger_prior <- rstan::extract(m_stronger_prior, pars = "nu")[[1]]
mean(nu_stronger_prior < 10)
```
So in this case, it doesn't really matter among this range of priors. The heron data contain a lot of information about the `nu` parameter. If we repeated this with the wren data, we would see that the posterior of `nu` would closely resemble the prior each time. In this case, there isn't information in the data that supports low values of `nu`.