/
betareg.Rmd
250 lines (211 loc) · 9.33 KB
/
betareg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: "Modeling Rates/Proportions using Beta Regression with rstanarm"
author: "Imad Ali, Jonah Gabry and Ben Goodrich"
date: "`r Sys.Date()`"
output:
html_vignette:
toc: yes
---
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{stan_betareg: Models for Rate/Proportion Data}
-->
```{r, child="children/SETTINGS-knitr.txt"}
```
```{r, child="children/SETTINGS-gg.txt"}
```
# Introduction
This vignette explains how to model continuous outcomes on the open unit
interval using the `stan_betareg` function in the __rstanarm__ package.
```{r, child="children/four_steps.txt"}
```
Steps 3 and 4 are covered in more depth by the vignette entitled ["How to Use
the __rstanarm__ Package"](rstanarm.html). This vignette focuses on Step 1 when
the likelihood is the product of beta distributions.
# Likelihood
Beta regression uses the beta distribution as the likelihood for the data,
$$
f(y_i | a, b) = \frac{y_i^{(a-1)}(1-y_i)^{(b-1)}}{B(a,b)}
$$
where $B(\cdot)$ is the beta function. The shape parameters for the distribution
are $a$ and $b$ and enter into the model according to the following transformations,
$$
a = \mu\cdot\phi \\
b = (1-\mu)\cdot\phi
$$
Let $g_1(\cdot)$ be some link function. Then, in the specification of the shape
parameters above, $\mu = g_1^{-1}(\mathbf{X}\boldsymbol{\beta})$, where $\boldsymbol{X}$
is a $N\times K$ dimensional matrix of predictors, and $\boldsymbol{\beta}$ is a $K$
dimensional vector of parameters associated with each predictor.
In the simplest case (with only one set of regressors), $\phi$ is a scalar
parameter. Alternatively, it is possible to model $\phi$ using a second set of
regressors $\mathbf{Z}$. In this context let $g_2(\cdot)$ be some link function that is not
necessarily identical to $g_1(\cdot)$. Then $\phi = g_2^{-1}(\mathbf{Z}\boldsymbol{\gamma})$,
where $\boldsymbol{\gamma}$ is a $J$ dimensional vector of parameters associated with
the $N\times J$ dimensional matrix of predictors $\mathbf{Z}$.
After substituting the shape parameter values in, the likelihood used in beta regression takes the
following form,
$$
f(y_i | \mu, \phi) = \frac{y_i^{(\mu\phi-1)}(1-y_i)^{((1-\mu)\phi-1)}}{B(\mu\phi,(1-\mu)\phi)}
$$
# Priors
A full Bayesian analysis requires specifying prior distributions
$f(\boldsymbol{\beta})$ and $f(\phi)$ for the vector of
regression coefficients and $\phi$. When using `stan_betareg`, these
distributions can be set using the `prior_intercept`, `prior`, and `prior_phi`
arguments. The `stan_betareg` function supports a variety of prior distributions,
which are explained in the __rstanarm__ documentation
(`help(priors, package = 'rstanarm')`).
When modeling $\phi$ with a linear predictor a full Bayesian analysis requires
specifying the prior distributions $f(\boldsymbol{\beta})$ and $f(\boldsymbol{\gamma})$.
In `stan_betareg` the prior distributions on $\boldsymbol{\gamma}$ can be set using
the `prior_intercept_z` and `prior_z` arguments.
As an example, suppose we have $K$ predictors and believe --- prior to seeing
the data --- that $\beta_1, \dots, \beta_K$ and $\phi$ are as likely to be positive
as they are to be negative, but are highly unlikely to be far from zero. These
beliefs can be represented by normal distributions with mean zero and a small
scale (standard deviation). To give $\phi$ and each of the $\beta$s this prior
(with a scale of 1, say), in the call to `stan_betareg` we would include the
arguments `prior_intercept = normal(0,1)`, `prior = normal(0,1)`, and
`prior_phi = normal(0,1)`.
If, on the other hand, we have less a priori confidence that the parameters will
be close to zero then we could use a larger scale for the normal distribution
and/or a distribution with heavier tails than the normal like the Student t
distribution. __Step 1__ in the "How to Use the __rstanarm__ Package" vignette
discusses one such example.
After fitting the model we can use the `prior_summary` function to print
information about the prior distributions used when fitting the model.
# Posterior
When using only a *single set of regressors*, the posterior distribution of
$\boldsymbol{\beta}$ and $\phi$ is proportional to the product of the likelihood
contributions, the $K$ priors on the $\beta_k$ parameters, and $\phi$,
$$
f(\boldsymbol{\beta},\phi|\mathbf{y},\mathbf{X}) \propto
\prod_{i=1}^N f(y_i | a, b) \times
\prod_{k=1}^K f(\beta_k) \times
f(\phi)
$$
When using *two sets of regressors*, the posterior distribution of $\boldsymbol{\beta}$ and
$\boldsymbol{\gamma}$ is proportional to the product of the likelihood contribution, the $K$
priors on the $\beta_k$ parameters, and the $J$ priors on the $\gamma_j$ parameters,
$$
f(\boldsymbol{\beta},\boldsymbol{\gamma}|\mathbf{y},\mathbf{X}) \propto
\prod_{i=1}^N f(y_i | a, b) \times
\prod_{k=1}^K f(\beta_k) \times
\prod_{j=1}^J f(\gamma_j)
$$
# An Example Using Simulated Data
In this example the outcome variable $\mathbf{y}$ is simulated in a way that
warrants the use of beta regression. It is worth mentioning that the data
generation process is quite convoluted, which is apparent in the identification
of the likelihood above.
The data simulated below uses the logistic link function on the first set of
regressors and the log link function on the second set of regressors.
```{r simulated-data, fig.height=5}
SEED <- 1234
set.seed(SEED)
eta <- c(1, -0.2)
gamma <- c(1.8, 0.4)
N <- 200
x <- rnorm(N, 2, 2)
z <- rnorm(N, 0, 2)
mu <- binomial(link = logit)$linkinv(eta[1] + eta[2]*x)
phi <- binomial(link = log)$linkinv(gamma[1] + gamma[2]*z)
y <- rbeta(N, mu * phi, (1 - mu) * phi)
dat <- data.frame(cbind(y, x, z))
hist(dat$y, col = "darkgrey", border = F, main = "Distribution of Outcome Variable", xlab = "y", breaks = 20, freq = F)
```
The model can be fit by calling `stan_betareg`, using the appropriate link functions.
```{r simulated-fit, results = "hide"}
library(rstanarm)
fit1 <- stan_betareg(y ~ x | z, data = dat, link = "logit", link.phi = "log",
cores = 2, seed = 12345)
fit2 <- stan_betareg(y ~ -1 + x , data = dat, link = "logit", link.phi = "log",
cores = 2, seed = 12345)
round(coef(fit1), 2)
round(coef(fit2), 2)
```
``` {r simulated-fit-print, echo=FALSE}
round(coef(fit1), 2)
round(coef(fit2), 2)
```
For clarity we can use `prior_summary` to print the information about the prior
distributions used to fit the models. The priors used in `fit1` are provided below.
``` {r print-priors}
prior_summary(fit1)
```
The usual posterior analyses are available in **rstanarm**. The plots below
illustrate simulated values of the outcome variable. The incorrect model
noticeably fails to capture the top of the distribution consistently in
comparison to the true model.
```{r simulated-analysis, fig.height=5}
library(ggplot2)
library(bayesplot)
bayesplot_grid(
pp_check(fit1), pp_check(fit2),
xlim = c(0,1),
ylim = c(0,4),
titles = c("True Model: y ~ x | z", "False Model: y ~ x - 1"),
grid_args = list(ncol = 2)
)
```
We can also compare models by evaluating the expected log pointwise predictive
density (`elpd`), which can be calculated using the `loo` method, which provides
an interface for __rstanarm__ models to the functionality in the __loo__
package.
``` {r simulated-loo}
loo1 <- loo(fit1)
loo2 <- loo(fit2)
loo_compare(loo1, loo2)
```
The difference in `elpd` is negative indicating that the expected predictive
accuracy for the first model is higher.
# An Example Using Gasoline Data
In some applied contexts it may be necessary to work with an outcome variable
that is a proportion. If the proportion is bound on the open unit interval then
beta regression can be considered a reasonable estimation method. The `betareg`
package provides a dataset on the proportion of crude oil converted to gasoline
after distillation and fractionation. This variable is defined as yield. Below
`stan_betareg` is used to model yield as a function of temperature, pressure,
and the batch of conditions.
```{r, gas-fit, results="hide"}
library(rstanarm)
data("GasolineYield", package = "betareg")
gas_fit1 <- stan_betareg(yield ~ temp + batch, data = GasolineYield, link = "logit",
seed = 12345)
gas_fit2 <- stan_betareg(yield ~ temp + batch | pressure,
data = GasolineYield, link = "logit",
seed = 12345)
round(coef(gas_fit1), 2)
round(coef(gas_fit2), 2)
```
``` {r, gas-print, echo=FALSE}
round(coef(gas_fit1), 2)
round(coef(gas_fit2), 2)
```
The plots below illustrate simulated values of gasoline yield. While the first
model accounts for variation in batch conditions its predictions looks somewhat
uniform rather than resembling the peaked and right-skewed behavior of the true data.
The second model does a somewhat better job at capturing the shape of the
distribution, however its location is off as it is centered around 0.50 rather
than 0.20.
```{r gas-analysis, fig.height=5}
library(ggplot2)
bayesplot_grid(
pp_check(gas_fit1), pp_check(gas_fit2),
xlim = c(0,1),
ylim = c(0,5),
titles = c("gas_fit1", "gas_fit2"),
grid_args = list(ncol = 2)
)
```
``` {r, gas-loo}
gas_loo1 <- loo(gas_fit1)
gas_loo2 <- loo(gas_fit2)
loo_compare(gas_loo1, gas_loo2)
```
Evaluating the expected log predictive distribution using `loo`
reveals that the second of the two models is preferred.
# References
Ferrari, SLP and Cribari-Neto, F (2004) "Beta Regression for Modeling Rates
and Proportions". _Journal of Applied Statistics._ Vol. 31, No. 07, p799-815.