diff --git a/docs/articles/mr_mash_intro.html b/docs/articles/mr_mash_intro.html index dd33009..360c3d1 100644 --- a/docs/articles/mr_mash_intro.html +++ b/docs/articles/mr_mash_intro.html @@ -33,7 +33,7 @@
@@ -42,6 +42,9 @@vignettes/mr_mash_intro.Rmd
mr_mash_intro.Rmd
The aim of this vignette is to introduce the basic steps of a -mr.mash analysis, fitting the mr.mash model then using -it to make predictions.
-First, we set the seed and load the mr.mash.alpha
R
-package.
The aim of this vignette is to introduce the basic steps of a mr.mash +analysis through a toy example. To learn more about mr.mash, please see +the paper.
+First, we set the seed to make the results more easily reproducible, +and we load the “mr.mash.alpha” package.
-library(mr.mash.alpha)
-set.seed(123)
We start by simulating a data set with 800 individuals, 1000 -predictors and 5 responses. We then 5 causal variables (randomly sampled -from the total 1000) are assigned equal effects across responses and -explain 20% of the total per-response variance. This would be roughly -equivalent to one gene in the “equal effects” scenario in the -mr.mash (with the difference being that genotypes are simulated -here).
+library(mr.mash.alpha)
+set.seed(123)
We illustrate the application of mr.mash to a data set simulated from +a multivariate, multiple linear regression with 5 responses in which the +coefficients are the same for all responses. In the target application +considered in the paper—prediction of multi-tissue gene expression from +genotypes—this would correspond to the situation in which we would like +to predict expression of a single gene in 5 different tissues from +genotype data at multiple SNPs, and the SNPs have the same effects on +gene expression in all 5 tissues. (In multi-tissue gene expression we +would normally like to predict expression of many genes, but to simplify +this vignette here we illustrate the key ideas with a single gene.)
+Although this simulation is not a particularly realistic, this is +meant to illustrate the benefits of mr.mash: by modeling the sharing of +effects across tissues, mr.mash is able to more accurately estimate the +effects in multiple tissues, and therefore is able to obtain better +predictions.
+We start by simulating 150 samples from a multivariate, multiple +linear regression model in which 5 out of the 800 variables (SNPs) +affect the 5 responses (expression levels).
-n <- 800
-p <- 1000
-p_causal <- 5
-r <- 5
-pve <- 0.2
-B_cor <- 1
-out <- simulate_mr_mash_data(n, p, p_causal, r, pve=pve, B_cor=B_cor,
- B_scale=1, X_cor=0, X_scale=1, V_cor=0)
We then split the data into a training set and a test set.
+dat <- simulate_mr_mash_data(n = 150,p = 800,p_causal = 5,r = 5,pve = 0.5,
+ V_cor = 0.25)
Next we split the samples into a training set (with 100 samples) and +test set (with 50 samples).
-Ytrain <- out$Y[-c(1:200),]
-Xtrain <- out$X[-c(1:200),]
-Ytest <- out$Y[c(1:200),]
-Xtest <- out$X[c(1:200),]
ntest <- 50
+Ytrain <- dat$Y[-(1:ntest),]
+Xtrain <- dat$X[-(1:ntest),]
+Ytest <- dat$Y[1:ntest,]
+Xtest <- dat$X[1:ntest,]
To run mr.mash, we need to first specify the covariances in
-the mixture-of-normals prior, which are supposed to capture the effect
-sharing patterns across responses. In this example, we use a mixture of
-“canonical” covariances computed using
-compute_canonical_covs()
. However, “data-driven”
-covariances can also be used – here’s an example
-of how to compute these matrices. Regardless of the type of covariance
-matrices, these are each multiplied by a grid of scaling factors
-computed using autoselect.mixsd()
, which are supposed to
-capture the magnitude of the effects. The expansion is done using
-expand_covs()
, which also adds a matrix of all zeros (our
-spike) when requested. The grid is derived from the regression
-coefficients and their standard errors from univariate simple linear
-regression which can be computed using
-compute_univariate_sumstats()
.
To run mr.mash, we need to first specify the covariances in the +mixture of normals prior. The idea is that the chosen collection of +covariance matrices should include a variety of potential effect sharing +patterns, and in the model fitting stage the prior should then assign +most weight to the sharing patterns that are present in the data, and +little or no weight on patterns that are inconsistent with the data. In +general, we recommend learning “data-driven” +covariance matrices. But here, for simplicity, we instead use +“canonical” covariances which are not adaptive, but nonetheless well +suited for this toy example since the true effects are the same across +responses/tissues.
-univ_sumstats <- compute_univariate_sumstats(Xtrain, Ytrain,
- standardize=TRUE, standardize.response=FALSE)
-grid <- autoselect.mixsd(univ_sumstats, mult=sqrt(2))^2
-S0 <- compute_canonical_covs(ncol(Ytrain), singletons=TRUE,
- hetgrid=c(0, 0.25, 0.5, 0.75, 1))
-S0 <- expand_covs(S0, grid, zeromat=TRUE)
S0 <- compute_canonical_covs(r = 5,singletons = TRUE,
+ hetgrid = seq(0,1,0.25))
This gives a mixture of 10 covariance matrices capturing a variety of +“canonical” effect-sharing patterns:
+
+names(S0)
+# [1] "singleton1" "singleton2" "singleton3" "singleton4" "singleton5"
+# [6] "independent" "shared0.25" "shared0.5" "shared0.75" "shared1"
To illustrate the benefits of modeling a variety of effect-sharing +patterns, we also try out mr.mash with a simpler mixture of covariance +matrices in which the effects are effectively independent across +tissues. Although this may seem to be a very poor choice of prior, +particularly for this example, it turns out that several multivariate +regression methods assume, implicitly or explicitly, this prior.
+
+S0_ind <- compute_canonical_covs(r = 5,singletons = FALSE,
+ hetgrid = c(0,0.001,0.01))
+names(S0_ind)
+# [1] "independent" "shared0.001" "shared0.01"
Regardless of the covariance matrices are chosen, it is recommended +to also consider a variety of effect scales in the prior. This is +normally achieved in mr.mash by expanding the mixture across a specifed +grid of scaling factors. Here we choose this grid in an adaptive fashion +based on the data:
+
+univ_sumstats <- compute_univariate_sumstats(Xtrain,Ytrain,standardize = TRUE)
+scaling_grid <- autoselect.mixsd(univ_sumstats,mult = sqrt(2))^2
+S0 <- expand_covs(S0,scaling_grid)
+S0_ind <- expand_covs(S0_ind,scaling_grid)
Now we are ready to fit a mr.mash model to the training data using
-mr.mash()
, to estimate the posterior mean of the regression
-coefficients.
-fit <- mr.mash(Xtrain, Ytrain, S0, update_V=TRUE, verbose=FALSE)
Having specified the mr.mash prior, we are now ready to fit a mr.mash +model to the training data (this may take a few minutes to run):
+
+fit <- mr.mash(Xtrain,Ytrain,S0,update_V = TRUE)
And for comparison we fit a second mr.mash model using the simpler +and less flexible prior:
+
+fit_ind <- mr.mash(Xtrain,Ytrain,S0_ind,update_V = TRUE)
(Notice that the less complex model also takes less time to fit.)
+For prediction, the key output is the posterior mean estimtes of the +regression coefficients, stored in the “mu1” output. Let’s compare the +estimates to the ground truth:
+
+par(mfrow = c(1,2))
+plot(dat$B,fit_ind$mu1,pch = 20,xlab = "true",ylab = "estimated",
+ main = sprintf("cor = %0.3f",
+ cor(as.vector(dat$B),as.vector(fit_ind$mu1))))
+abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
+plot(dat$B,fit$mu1,pch = 20,xlab = "true",ylab = "estimated",
+ main = sprintf("cor = %0.3f",
+ cor(as.vector(dat$B),as.vector(fit$mu1))))
+abline(a = 0,b = 1,col = "royalblue",lty = "dotted")
As expected, the coefficients on the left-hand side obtained using an +“independent effects” prior are not as accurate as the the coefficients +estimated using the more flexible prior (right-hand side).
+While perhaps not of primary interest, for diagnostic purposes it is +often helpfl to examine the estimated mixture weights in the prior as +well as the estimated residual covariance matrix.
+Inspecting the top prior mixture weights from the better model, it is +helpful to see that the “null” and “shared1” components are among the +top components by weight. (The top component is the null component +because most of the SNPs have no effect on gene expression.)
+
+head(sort(fit$w0,decreasing = TRUE),n = 10)
+# null singleton2_grid1 singleton1_grid1 singleton5_grid1
+# 0.04741471 0.04243749 0.04202594 0.04141742
+# singleton4_grid1 singleton3_grid1 singleton2_grid2 shared1_grid1
+# 0.04038929 0.04022118 0.03798986 0.03788223
+# singleton1_grid2 singleton5_grid2
+# 0.03724834 0.03618774
Also, reassuringly, the estimated residual variance-covariance matrix +is close to the matrix used to simulate the data:
+
+dat$V
+# [,1] [,2] [,3] [,4] [,5]
+# [1,] 3.5145375 0.8786344 0.8786344 0.8786344 0.8786344
+# [2,] 0.8786344 3.5145373 0.8786343 0.8786343 0.8786343
+# [3,] 0.8786344 0.8786343 3.5145373 0.8786343 0.8786343
+# [4,] 0.8786344 0.8786343 0.8786343 3.5145373 0.8786343
+# [5,] 0.8786344 0.8786343 0.8786343 0.8786343 3.5145373
+fit$V
+# [,1] [,2] [,3] [,4] [,5]
+# [1,] 3.2407233 0.8502217 0.8841572 1.3563118 1.0189386
+# [2,] 0.8502217 3.5174997 1.2133736 0.6212035 0.1021557
+# [3,] 0.8841572 1.2133736 2.5068761 0.6631672 1.0174394
+# [4,] 1.3563118 0.6212035 0.6631672 2.7869990 0.7854292
+# [5,] 1.0189386 0.1021557 1.0174394 0.7854292 2.9687632
We then use the fitted model from step 4 to predict the response -values in the test set. In this plot, we compare the true and predicted -values
-
-Ytest_est <- predict(fit,Xtest)
-plot(Ytest_est,Ytest,pch = 20,col = "darkblue",xlab = "true",
- ylab = "predicted")
-abline(a = 0,b = 1,col = "magenta",lty = "dotted")
However, we also want to assess prediction accuracy more formally. -Here, we do that in terms of \(R^2\) -which is easy to interpret – its maximum value would be the proportion -of variance explained, 0.2 in this case).
-
-r2 <- vector("numeric", r)
-for(i in 1:r){
- fit_acc <- lm(Ytest[, i] ~ Ytest_est[, i])
- r2[i] <- summary(fit_acc)$r.squared
-}
-
-r2
-# [1] 0.1426443 0.1937896 0.1248826 0.1756781 0.2181555
We can see that the predictions are pretty accurate.
+We can use the fitted mr.mash model to predict gene expression from a +genotype sample, including a sample not included in the training set. +This is implemented by the “predict” method. Let’s compare the +predictions from the two mr.mash models:
+
+par(mfrow = c(1,2))
+Ypred <- predict(fit,Xtest)
+Ypred_ind <- predict(fit_ind,Xtest)
+plot(Ytest,Ypred_ind,pch = 20,col = "darkblue",xlab = "true",
+ ylab = "predicted",
+ main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred_ind))))
+abline(a = 0,b = 1,col = "magenta",lty = "dotted")
+plot(Ytest,Ypred,pch = 20,col = "darkblue",xlab = "true",
+ ylab = "predicted",
+ main = sprintf("cor = %0.3f",cor(as.vector(Ytest),as.vector(Ypred))))
+abline(a = 0,b = 1,col = "magenta",lty = "dotted")
Indeed, mr.mash with the more flexible prior (right-hand plot) +produces more accurate predictions than mr.mash with the “independent +effects” prior.