-
Notifications
You must be signed in to change notification settings - Fork 2
/
mr_mash_intro.Rmd
219 lines (181 loc) · 8.12 KB
/
mr_mash_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
title: "Introduction to mr.mash"
author: Peter Carbonetto & Fabio Morgante
date: "`r Sys.Date()`"
output:
html_document:
toc: no
toc_float: no
highlight: textmate
theme: readable
vignette: >
%\VignetteIndexEntry{Introduction to mr.mash}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
The aim of this vignette is to introduce the basic steps of a
mr.mash analysis through a toy example. To learn more about
mr.mash, please see the [paper][mr-mash-biorxiv].
```{r knitr-opts, include=FALSE}
knitr::opts_chunk$set(comment = "#",collapse = TRUE,
fig.align = "center",dpi = 120)
```
First, we set the seed to make the results more easily reproducible,
and we load the "mr.mash.alpha" package.
```{r load-pkgs}
library(mr.mash.alpha)
set.seed(12)
```
We illustrate the application of mr.mash to a data set simulated from
a multivariate, multiple linear regression with 5 responses in which
the coefficients are the same for all responses. In the target
application considered in the paper---prediction of multi-tissue gene
expression from genotypes---this would correspond to the situation in
which we would like to predict expression of a single gene in 5
different tissues from genotype data at multiple SNPs, and the SNPs
have the same effects on gene expression in all 5 tissues. (In
multi-tissue gene expression we would normally like to predict
expression of many genes, but to simplify this vignette here we
illustrate the key ideas with a single gene.)
Although this simulation is not particularly realistic, this is
meant to illustrate the benefits of mr.mash: by modeling the sharing
of effects across tissues, mr.mash is able to more accurately estimate
the effects in multiple tissues, and therefore is able to obtain
better predictions.
We start by simulating 150 samples from a multivariate, multiple
linear regression model in which 5 out of the 800 variables (SNPs)
affect the 5 responses (expression levels).
```{r sim-data-1, results="hide"}
dat <- simulate_mr_mash_data(n = 150,p = 800,p_causal = 5,r = 5,pve = 0.5,
V_cor = 0.25)
```
Next we split the samples into a training set (with 100 samples) and
test set (with 50 samples).
```{r sim-data-2}
ntest <- 50
Ytrain <- dat$Y[-(1:ntest),]
Xtrain <- dat$X[-(1:ntest),]
Ytest <- dat$Y[1:ntest,]
Xtest <- dat$X[1:ntest,]
```
Define the mr.mash prior
------------------------
To run mr.mash, we need to first specify the covariances in the
mixture of normals prior. The idea is that the chosen collection of
covariance matrices should include a variety of potential effect
sharing patterns, and in the model fitting stage the prior should then
assign most weight to the sharing patterns that are present in the
data, and little or no weight on patterns that are inconsistent with
the data. In general, we recommend learning
["data-driven" covariance matrices][mashr-dd-vignette]. But here, for
simplicity, we instead use "canonical" covariances which are not
adaptive, but nonetheless well suited for this toy example since the
true effects are the same across responses/tissues.
```{r choose-prior-1}
S0 <- compute_canonical_covs(r = 5,singletons = TRUE,
hetgrid = seq(0,1,0.25))
```
This gives a mixture of 10 covariance matrices capturing a variety of
"canonical" effect-sharing patterns:
```{r choose-prior-2}
names(S0)
```
To illustrate the benefits of modeling a variety of effect-sharing
patterns, we also try out mr.mash with a simpler mixture of covariance
matrices in which the effects are effectively independent across
tissues. Although this may seem to be a very poor choice of prior,
particularly for this example, it turns out that several multivariate
regression methods assume, implicitly or explicitly, this prior.
```{r choose-prior-3}
S0_ind <- compute_canonical_covs(r = 5,singletons = FALSE,
hetgrid = c(0,0.001,0.01))
names(S0_ind)
```
Regardless of the covariance matrices are chosen, it is recommended to
also consider a variety of effect scales in the prior. This is
normally achieved in mr.mash by expanding the mixture across a
specifed grid of scaling factors. Here we choose this grid in an
adaptive fashion based on the data:
```{r choose-prior-4}
univ_sumstats <- compute_univariate_sumstats(Xtrain,Ytrain,standardize = TRUE)
scaling_grid <- autoselect.mixsd(univ_sumstats,mult = sqrt(2))^2
S0 <- expand_covs(S0,scaling_grid)
S0_ind <- expand_covs(S0_ind,scaling_grid)
```
Fit a mr.mash model to the data
-------------------------------
Having specified the mr.mash prior, we are now ready to fit a mr.mash
model to the training data (this may take a few minutes to run). Given
that the majority of the SNPs have no effect, we initialize the mixture
weights with 99\% of the weight on the null component and the rest of the
weight split equally across the remaining components.
```{r fit-mr-mash-1, results="hide"}
null_weight <- 0.99
non_null_weight <- 1-null_weight
w0_init <- c(null_weight, rep(non_null_weight/(length(S0)-1), (length(S0)-1)))
fit <- mr.mash(X=Xtrain,Y=Ytrain,S0=S0, w0=w0_init, update_V = TRUE)
```
And for comparison we fit a second mr.mash model using the simpler and
less flexible prior:
```{r fit-mr-mash-2, results="hide"}
w0_init <- c(null_weight, rep(non_null_weight/(length(S0_ind)-1), (length(S0_ind)-1)))
fit_ind <- mr.mash(X=Xtrain,Y=Ytrain,S0=S0_ind,w0=w0_init,update_V = TRUE)
```
(Notice that the less complex model also takes less time to fit.)
For prediction, the key output is the posterior mean estimates of the
regression coefficients, stored in the "mu1" output. Let's compare the
estimates to the ground truth:
```{r plot-coefs, fig.height=3.5, fig.width=6}
par(mfrow = c(1,2))
plot(dat$B,fit_ind$mu1,pch = 20,xlab = "true",ylab = "estimated",
main = sprintf("cor = %0.3f",
cor(as.vector(dat$B),as.vector(fit_ind$mu1))))
abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
plot(dat$B,fit$mu1,pch = 20,xlab = "true",ylab = "estimated",
main = sprintf("cor = %0.3f",
cor(as.vector(dat$B),as.vector(fit$mu1))))
abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
```
As expected, the coefficients on the left-hand side obtained using an
"independent effects" prior are not as accurate as the the
coefficients estimated using the more flexible prior (right-hand side).
While perhaps not of primary interest, for diagnostic purposes it is
often helpful to examine the estimated mixture weights in the prior as
well as the estimated residual covariance matrix.
Inspecting the top prior mixture weights from the better model, it is
helpful to see that the "null" and "shared1" components are among the
top components by weight. (The top component is the null component
because most of the SNPs have no effect on gene expression.)
```{r prior-mixture-weights}
head(sort(fit$w0,decreasing = TRUE),n = 10)
```
Also, reassuringly, the estimated residual variance-covariance matrix
is close to the matrix used to simulate the data:
```{r resid-var}
dat$V
fit$V
```
Use the fitted mr.mash model to make predictions
------------------------------------------------
We can use the fitted mr.mash model to predict gene expression from
a genotype sample, including a sample not included in the training
set. This is implemented by the "predict" method. Let's compare the
predictions from the two mr.mash models:
```{r plot-pred-test, fig.height=3.5, fig.width=6}
par(mfrow = c(1,2))
Ypred <- predict(fit,Xtest)
Ypred_ind <- predict(fit_ind,Xtest)
plot(Ytest,Ypred_ind,pch = 20,col = "darkblue",xlab = "true",
ylab = "predicted",
main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred_ind))))
abline(a = 0,b = 1,col = "magenta",lty = "dotted")
plot(Ytest,Ypred,pch = 20,col = "darkblue",xlab = "true",
ylab = "predicted",
main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred))))
abline(a = 0,b = 1,col = "magenta",lty = "dotted")
```
Indeed, mr.mash with the more flexible prior (right-hand plot)
produces more accurate predictions than mr.mash with the "independent
effects" prior.
[mr-mash-biorxiv]: https://doi.org/10.1101/2022.11.22.517471
[mashr-dd-vignette]: https://stephenslab.github.io/mashr/articles/intro_mash_dd.html