From ab7758b77f40a8db903443de12b57ebda66e7814 Mon Sep 17 00:00:00 2001 From: Peter Carbonetto Date: Fri, 19 May 2023 15:31:38 -0500 Subject: [PATCH] More improvements to the introductory vignette. --- vignettes/mr_mash_intro.Rmd | 213 ++++++++++++++++++++++-------------- 1 file changed, 132 insertions(+), 81 deletions(-) diff --git a/vignettes/mr_mash_intro.Rmd b/vignettes/mr_mash_intro.Rmd index e7807af..63fd66c 100644 --- a/vignettes/mr_mash_intro.Rmd +++ b/vignettes/mr_mash_intro.Rmd @@ -15,124 +15,175 @@ vignette: > --- The aim of this vignette is to introduce the basic steps of a -*mr.mash* analysis through a toy example. +mr.mash analysis through a toy example. To learn more about +mr.mash, please see the [paper][mr-mash-biorxiv]. ```{r knitr-opts, include=FALSE} knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold", fig.align = "center",dpi = 120) ``` -First, we set the seed and load the `mr.mash.alpha` R package. +First, we set the seed to make the results more easily reproducible, +and we load the "mr.mash.alpha" package. -```{r set-seed} +```{r load-pkgs} library(mr.mash.alpha) set.seed(123) ``` -## Step 1 -- Simulate example data - -We start by simulating a data set with 800 individuals, 1000 -predictors and 5 responses. We then 5 causal variables (randomly -sampled from the total 1000) are assigned equal effects across -responses and explain 20\% of the total per-response variance. This -would be roughly equivalent to one gene in the "equal effects" -scenario in the *mr.mash* (with the difference being that genotypes -are simulated here). +We illustrate the application of mr.mash to a data set simulated from +a multivariate, multiple linear regression with 5 responses in which +the coefficients are the same for all responses. In the target +application considered in the paper---prediction of multi-tissue gene +expression from genotypes---this would correspond to the situation in +which we would like to predict expression of a single gene in 5 +different tissues from genotype data at multiple SNPs, and the SNPs +have the same effects on gene expression in all 5 tissues. (In +multi-tissue gene expression we would normally like to predict +expression of many genes, but to simplify this vignette here we +illustrate the key ideas with a single gene.) + +Although this simulation is not a particularly realistic, this is +meant to illustrate the benefits of mr.mash: by modeling the sharing +of effects across tissues, mr.mash is able to more accurately estimate +the effects in multiple tissues, and therefore is able to obtain +better predictions. + +We start by simulating 150 samples from a multivariate, multiple +linear regression model in which 5 out of the 800 variables (SNPs) +affect the 5 responses (expression levels). ```{r sim-data-1, results="hide"} -n <- 300 -p <- 1000 -p_causal <- 5 -r <- 5 -pve <- 0.2 -B_cor <- 1 -out <- simulate_mr_mash_data(n, p, p_causal, r, pve=pve, B_cor=B_cor, - B_scale=1, X_cor=0, X_scale=1, V_cor=0) +dat <- simulate_mr_mash_data(n = 150,p = 800,p_causal = 5,r = 5,pve = 0.5, + V_cor = 0.25) ``` -## Step 2 -- Split the data into training and test sets - -We then split the data into a training set and a test set. +Next we split the samples into a training set (with 100 samples) and +test set (with 50 samples). ```{r sim-data-2} -ntest <- 200 -Ytrain <- out$Y[-(1:ntest),] -Xtrain <- out$X[-(1:ntest),] -Ytest <- out$Y[1:ntest,] -Xtest <- out$X[1:ntest,] +ntest <- 50 +Ytrain <- dat$Y[-(1:ntest),] +Xtrain <- dat$X[-(1:ntest),] +Ytest <- dat$Y[1:ntest,] +Xtest <- dat$X[1:ntest,] ``` -## Step 3 -- Define the mixture prior - -To run *mr.mash*, we need to first specify the covariances in the -mixture-of-normals prior, which are supposed to capture the effect -sharing patterns across responses. In this example, we use a mixture of -"canonical" covariances computed using `compute_canonical_covs()`. -However, "data-driven" covariances can also be used -- here's an -[example](https://stephenslab.github.io/mashr/articles/intro_mash_dd.html) -of how to compute these matrices. -Regardless of the type of covariance matrices, these are each multiplied -by a grid of scaling factors computed using `autoselect.mixsd()`, which -are supposed to capture the magnitude of the effects. The expansion is done -using `expand_covs()`, which also adds a matrix of all zeros (our spike) -when requested. The grid is derived from the regression coefficients and -their standard errors from univariate simple linear regression which can be -computed using `compute_univariate_sumstats()`. - -```{r specify-prior} -univ_sumstats <- compute_univariate_sumstats(Xtrain, Ytrain, - standardize=TRUE, standardize.response=FALSE) -grid <- autoselect.mixsd(univ_sumstats, mult=sqrt(2))^2 -S0 <- compute_canonical_covs(ncol(Ytrain), singletons=TRUE, - hetgrid=c(0, 0.25, 0.5, 0.75, 1)) -S0 <- expand_covs(S0, grid, zeromat=TRUE) +Define the mr.mash prior +------------------------ + +To run mr.mash, we need to first specify the covariances in the +mixture of normals prior. The idea is that the chosen collection of +covariance matrices should include a variety of potential effect +sharing patterns, and in the model fitting stage the prior should then +assign most weight to the sharing patterns that are present in the +data, and little or no weight on patterns that are inconsistent with +the data. In general, we recommend learning +["data-driven" covariance matrices][mashr-dd-vignette]. But here, for +simplicity, we instead use "canonical" covariances which are not +adaptive, but nonetheless well suited for this toy example since the +true effects are the same across responses/tissues. + +```{r choose-prior-1} +S0 <- compute_canonical_covs(r = 5,singletons = TRUE, + hetgrid = seq(0,1,0.25)) ``` -## Step 4 -- Fit *mr.mash* to the training data +This gives a mixture of 10 covariance matrices capturing a variety of +"canonical" effect-sharing patterns: + +```{r choose-prior-2} +names(S0) +``` -Now we are ready to fit a mr.mash model to the training data using `mr.mash()`, -to estimate the posterior mean of the regression coefficients. +To illustrate the benefits of modeling a variety of effect-sharing +patterns, we also try out mr.mash with a simpler mixture of covariance +matrices in which the effects are effectively independent across +tissues. Although this may seem to be a very poor choice of prior, +particularly for this example, it turns out that several multivariate +regression methods assume, implicitly or explicitly, this prior. -```{r fit-mr-mash, results="hide"} -S0_ind <- compute_canonical_covs(ncol(Ytrain),singletons = FALSE, +```{r choose-prior-3} +S0_ind <- compute_canonical_covs(r = 5,singletons = FALSE, hetgrid = c(0,0.001,0.01)) -S0_ind <- expand_covs(S0_ind,grid,zeromat = TRUE) -fit_ind <- mr.mash(Xtrain,Ytrain,S0_ind,update_V = TRUE) +names(S0_ind) +``` + +Regardless of the covariance matrices are chosen, it is recommended to +also consider a variety of effect scales in the prior. This is +normally achieved in mr.mash by expanding the mixture across a +specifed grid of scaling factors. Here we choose this grid in an +adaptive fashion based on the data: + +```{r choose-prior-4} +univ_sumstats <- compute_univariate_sumstats(Xtrain,Ytrain,standardize = TRUE) +scaling_grid <- autoselect.mixsd(univ_sumstats,mult = sqrt(2))^2 +S0 <- expand_covs(S0,scaling_grid) +S0_ind <- expand_covs(S0_ind,scaling_grid) +``` + +Fit a mr.mash model to the data +------------------------------- + +Having specified the mr.mash prior, we are now ready to fit a mr.mash +model to the training data (this may take a few minutes to run): + +```{r fit-mr-mash-1} fit <- mr.mash(Xtrain,Ytrain,S0,update_V = TRUE) +``` + +And for comparison we fit a second mr.mash model using the simpler and +less flexible prior: + +```{r fit-mr-mash-2} +fit_ind <- mr.mash(Xtrain,Ytrain,S0_ind,update_V = TRUE) +``` + +(Notice that the less complex model also takes less time to fit.) + +For prediction, the key output is the posterior mean estimtes of the +regression coefficients, stored in the "mu1" output. Let's compare the +estimates to the ground truth: + +```{r plot-coefs, fig.height=3.5, fig.width=6} par(mfrow = c(1,2)) -plot(out$B,fit_ind$mu1,pch = 20,xlim = c(-1.75,0.75),ylim = c(-1.75,0.75), - main = cor(as.vector(out$B),as.vector(fit_ind$mu1))) +plot(dat$B,fit_ind$mu1,pch = 20,xlab = "true",ylab = "estimated", + main = sprintf("cor = %0.3f", + cor(as.vector(dat$B),as.vector(fit_ind$mu1)))) abline(a = 0,b = 1,col = "royalblue",lty = "dotted") -plot(out$B,fit$mu1,pch = 20,xlim = c(-1.75,0.75),ylim = c(-1.75,0.75), - main = cor(as.vector(out$B),as.vector(fit$mu1))) +plot(dat$B,fit$mu1,pch = 20,xlab = "true",ylab = "estimated", + main = sprintf("cor = %0.3f", + cor(as.vector(dat$B),as.vector(fit$mu1)))) abline(a = 0,b = 1,col = "royalblue",lty = "dotted") ``` +As expected, the coefficients on the left-hand side obtained using an +"independent effects" prior are not as accurate as the the +coefficients estimated using the more flexible prior (right-hand side). + +While perhaps not of primary interest, for diagnostic purposes it is +often helpfl to examine the estimated mixture weights in the prior as +well as the estimated residual covariance matrix. + +```{r prior-mixture-weights} + +``` + ## Step 5 -- Predict responses in the test We then use the fitted model from step 4 to predict the response values in the test set. In this plot, we compare the true and predicted values ```{r plot-pred-test, fig.height=5, fig.width=5} +par(mfrow = c(1,2)) +Ytest_est <- predict(fit_ind,Xtest) +plot(Ytest_est,Ytest,pch = 20,col = "darkblue",xlab = "true", + ylab = "predicted",main = cor(as.vector(Ytest_est),as.vector(Ytest))) +abline(a = 0,b = 1,col = "magenta",lty = "dotted") Ytest_est <- predict(fit,Xtest) plot(Ytest_est,Ytest,pch = 20,col = "darkblue",xlab = "true", - ylab = "predicted") + ylab = "predicted",main = cor(as.vector(Ytest_est),as.vector(Ytest))) abline(a = 0,b = 1,col = "magenta",lty = "dotted") ``` -However, we also want to assess prediction accuracy more -formally. Here, we do that in terms of $R^2$ which is easy to -interpret -- its maximum value would be the proportion of variance -explained, 0.2 in this case). - -```{r pred-acc-test} -r2 <- vector("numeric", r) -for(i in 1:r){ - fit_acc <- lm(Ytest[, i] ~ Ytest_est[, i]) - r2[i] <- summary(fit_acc)$r.squared -} - -r2 -``` - -We can see that the predictions are pretty accurate. - +[mr-mash-biorxiv]: https://doi.org/10.1101/2022.11.22.517471 +[mashr-dd-vignette]: https://stephenslab.github.io/mashr/articles/intro_mash_dd.html