/
loo2-example.Rmd
304 lines (235 loc) · 12 KB
/
loo2-example.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
---
title: "Using the loo package (version >= 2.0.0)"
author: "Aki Vehtari and Jonah Gabry"
date: "`r Sys.Date()`"
output:
html_vignette:
toc: yes
params:
EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Using the loo package}
-->
```{r, child="children/SETTINGS-knitr.txt"}
```
```{r, child="children/SEE-ONLINE.txt", eval = if (isTRUE(exists("params"))) !params$EVAL else TRUE}
```
# Introduction
This vignette demonstrates how to use the __loo__ package to carry out
Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO)
for purposes of model checking and model comparison.
In this vignette we can't provide all necessary background information on
PSIS-LOO and its diagnostics (Pareto $k$ and effective sample size), so we
encourage readers to refer to the following papers for more details:
* Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. _Statistics and Computing_. 27(5), 1413--1432. \doi:10.1007/s11222-016-9696-4. Links: [published](https://link.springer.com/article/10.1007/s11222-016-9696-4) | [preprint arXiv](https://arxiv.org/abs/1507.04544).
* Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2024).
Pareto smoothed importance sampling. *Journal of Machine Learning Research*,
25(72):1-58. [PDF](https://jmlr.org/papers/v25/19-556.html)
# Setup
In addition to the __loo__ package, we'll also be using __rstanarm__ and
__bayesplot__:
```{r setup, message=FALSE}
library("rstanarm")
library("bayesplot")
library("loo")
```
# Example: Poisson vs negative binomial for the roaches dataset
## Background and model fitting
The Poisson and negative binomial regression models used below in our example,
as well as the `stan_glm` function used to fit the models,
are covered in more depth in the __rstanarm__ vignette
[_Estimating Generalized Linear Models for Count Data with rstanarm_](http://mc-stan.org/rstanarm/articles/count.html). In the rest of
this vignette we will assume the reader is already familiar with these kinds of
models.
### Roaches data
The example data we'll use comes from Chapter 8.3 of [Gelman and Hill
(2007)](http://www.stat.columbia.edu/~gelman/arm/). We want to make inferences
about the efficacy of a certain pest management system at reducing the number of
roaches in urban apartments. Here is how Gelman and Hill describe the experiment
and data (pg. 161):
> the treatment and control were applied to 160 and 104 apartments, respectively, and the outcome measurement $y_i$ in each apartment $i$ was the number of roaches caught in a set of traps. Different apartments had traps for different numbers of days
In addition to an intercept, the regression predictors for the model are
`roach1`, the pre-treatment number of roaches (rescaled above to be in units of
hundreds), the treatment indicator `treatment`, and a variable indicating
whether the apartment is in a building restricted to elderly residents `senior`.
Because the number of days for which the roach traps were used is not the same
for all apartments in the sample, we use the `offset` argument to specify that
`log(exposure2)` should be added to the linear predictor.
```{r data}
# the 'roaches' data frame is included with the rstanarm package
data(roaches)
str(roaches)
# rescale to units of hundreds of roaches
roaches$roach1 <- roaches$roach1 / 100
```
### Fit Poisson model
We'll fit a simple Poisson regression model using the `stan_glm` function from
the __rstanarm__ package.
```{r count-roaches-mcmc, results="hide"}
fit1 <-
stan_glm(
formula = y ~ roach1 + treatment + senior,
offset = log(exposure2),
data = roaches,
family = poisson(link = "log"),
prior = normal(0, 2.5, autoscale = TRUE),
prior_intercept = normal(0, 5, autoscale = TRUE),
seed = 12345
)
```
Usually we would also run posterior predictive checks as shown in the
__rstanarm__ vignette
[Estimating Generalized Linear Models for Count Data with rstanarm](http://mc-stan.org/rstanarm/articles/count.html), but
here we focus only on methods provided by the __loo__ package.
<br>
## Using the __loo__ package for model checking and comparison
_Although cross-validation is mostly used for model comparison, it is also useful for model checking._
### Computing PSIS-LOO and checking diagnostics
We start by computing PSIS-LOO with the `loo` function. Since we fit our model
using __rstanarm__ we can use the `loo` method for `stanreg` objects (fitted
model objects from __rstanarm__), which doesn't require us to first extract the
pointwise log-likelihood values. If we had written our own Stan program instead
of using __rstanarm__ we would pass an array or matrix of log-likelihood values
to the `loo` function (see, e.g. `help("loo.array", package = "loo")`). We'll
also use the argument `save_psis = TRUE` to save some intermediate results to be
re-used later.
```{r loo1}
loo1 <- loo(fit1, save_psis = TRUE)
```
`loo` gives us warnings about the Pareto diagnostics, which indicate that for
some observations the leave-one-out posteriors are different enough from the
full posterior that importance-sampling is not able to correct the difference.
We can see more details by printing the `loo` object.
```{r print-loo1}
print(loo1)
```
The table shows us a summary of Pareto $k$ diagnostic, which is used to assess
the reliability of the estimates. In addition to the proportion of leave-one-out
folds with $k$ values in different intervals, the minimum of the effective
sample sizes in that category is shown to give idea why higher $k$ values are
bad. Since we have some $k>1$, we are not able to compute an estimate for the
Monte Carlo standard error (SE) of the expected log predictive density
(`elpd_loo`) and `NA` is displayed. (Full details on the interpretation of
the Pareto $k$ diagnostics are available in the Vehtari, Gelman, and Gabry
(2017) and Vehtari, Simpson, Gelman, Yao, and Gabry (2024) papers referenced
at the top of this vignette.)
In this case the `elpd_loo` estimate should not be considered reliable. If we
had a well-specified model we would expect the estimated effective number
of parameters (`p_loo`) to be smaller than or similar to the total number of
parameters in the model. Here `p_loo` is almost 300, which is about 70 times
the total number of parameters in the model, indicating severe model
misspecification.
### Plotting Pareto $k$ diagnostics
Using the `plot` method on our `loo1` object produces a plot of the $k$ values
(in the same order as the observations in the dataset used to fit the model)
with horizontal lines corresponding to the same categories as in the
printed output above.
```{r plot-loo1, out.width = "70%"}
plot(loo1)
```
This plot is useful to quickly see the distribution of $k$ values, but it's
often also possible to see structure with respect to data ordering. In our case
this is mild, but there seems to be a block of data that is somewhat easier to
predict (indices around 90--150). Unfortunately even for these data points we
see some high $k$ values.
### Marginal posterior predictive checks
The `loo` package can be used in combination with the `bayesplot` package for
leave-one-out cross-validation marginal posterior predictive checks [Gabry et al
(2018)](https://arxiv.org/abs/1709.01449). LOO-PIT values are cumulative
probabilities for $y_i$ computed using the LOO marginal predictive distributions
$p(y_i|y_{-i})$. For a good model, the distribution of LOO-PIT values should be
uniform. In the following QQ-plot the LOO-PIT values for our model (y-axi) is
compared to standard uniform distribution (x-axis).
```{r ppc_loo_pit_overlay}
yrep <- posterior_predict(fit1)
ppc_loo_pit_qq(
y = roaches$y,
yrep = yrep,
lw = weights(loo1$psis_object)
)
```
The excessive number of LOO-PIT values close to 0 indicates that the model is
under-dispersed compared to the data, and we should consider a model that allows
for greater dispersion.
## Try alternative model with more flexibility
Here we will try [negative binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution)
regression, which is commonly used for overdispersed count data.
Unlike the Poisson distribution, the negative binomial distribution
allows the conditional mean and variance of $y$ to differ.
```{r count-roaches-negbin, results="hide"}
fit2 <- update(fit1, family = neg_binomial_2)
```
```{r loo2}
loo2 <- loo(fit2, save_psis = TRUE, cores = 2)
print(loo2)
```
```{r plot-loo2}
plot(loo2, label_points = TRUE)
```
Using the `label_points` argument will label any $k$ values larger than the
diagnostic threshold with
the index of the corresponding data point. These high values are often the
result of model misspecification and frequently correspond to data points that
would be considered ``outliers'' in the data and surprising according to the
model [Gabry et al (2019)](https://arxiv.org/abs/1709.01449). Unfortunately,
while large $k$ values are a useful indicator of model misspecification, small
$k$ values are not a guarantee that a model is well-specified.
If there are a small number of problematic $k$
values then we can use a feature in __rstanarm__ that lets us refit the model
once for each of these problematic observations. Each time the model is refit,
one of the observations with a high $k$ value is omitted and the LOO
calculations are performed exactly for that observation. The results are then
recombined with the approximate LOO calculations already carried out for the
observations without problematic $k$ values:
```{r reloo}
if (any(pareto_k_values(loo2) > 0.7)) {
loo2 <- loo(fit2, save_psis = TRUE, k_threshold = 0.7)
}
print(loo2)
```
In the print output we can see that the Monte Carlo SE is small compared to
the other uncertainties.
On the other hand, `p_loo` is about 7 and still a bit higher than the total
number of parameters in the model. This indicates that there is almost certainly
still some degree of model misspecification, but this is much better than the
`p_loo` estimate for the Poisson model.
For further model checking we again examine the LOO-PIT values.
```{r ppc_loo_pit_overlay-negbin}
yrep <- posterior_predict(fit2)
ppc_loo_pit_qq(roaches$y, yrep, lw = weights(loo2$psis_object))
```
The plot for the negative binomial model looks better than the Poisson plot, but
we still see that this model is not capturing all of the essential features in
the data.
## Comparing the models on expected log predictive density
We can use the `loo_compare` function to compare our two models on
expected log predictive density (ELPD) for new data:
```{r loo_compare}
loo_compare(loo1, loo2)
```
The difference in ELPD is much larger than several times the estimated standard
error of the difference again indicating that the negative-binomial model is
xpected to have better predictive performance than the Poisson model. However,
according to the LOO-PIT checks there is still some misspecification, and a
reasonable guess is that a hurdle or zero-inflated model would be an improvement
(we leave that for another case study).
<br>
# References
Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. and Gelman, A. (2019),
Visualization in Bayesian workflow. _J. R. Stat. Soc. A_, 182: 389-402.
\doi:10.1111/rssa.12378. ([journal version](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12378),
[arXiv preprint](https://arxiv.org/abs/1709.01449),
[code on GitHub](https://github.com/jgabry/bayes-vis-paper))
<a id="gabry2019"></a>
Gelman, A. and Hill, J. (2007). _Data Analysis Using Regression and
Multilevel/Hierarchical Models._ Cambridge University Press, Cambridge, UK.
Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model
evaluation using leave-one-out cross-validation and WAIC. _Statistics and
Computing_. 27(5), 1413--1432. \doi:10.1007/s11222-016-9696-4.
[online](https://link.springer.com/article/10.1007/s11222-016-9696-4),
[arXiv preprint arXiv:1507.04544](https://arxiv.org/abs/1507.04544).
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2024).
Pareto smoothed importance sampling. *Journal of Machine Learning Research*,
25(72):1-58. [PDF](https://jmlr.org/papers/v25/19-556.html)