/
intro_correlations.Rmd
141 lines (116 loc) · 5.24 KB
/
intro_correlations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
title: "Accounting for correlations among measurements"
author: "Matthew Stephens"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{mashr intro with correlations}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,comment = "#",fig.width = 5,
fig.height = 4,fig.align = "center",
eval = TRUE)
```
# Introduction
In some settings measurements and tests in different conditions may be
correlated with one another. For example, in eQTL applications
this can occur due to sample overlap among the different conditions.
Failure to deal with such correlations can cause false positives
in a `mashr` analysis.
To deal with these correlations `mashr` allows the user to specify
a correlation matrix $V$ when setting up the data in `mash_set_data`.
We introduce two methods to estimate this correlation matrix.
The first method is simple and fast. It estimates the correlation matrix
using `estimate_null_correlation_simple`, which,
as its name suggests, uses the null tests (specifically, tests
without a strong $z$ score) to estimate the correlations.
The second method may provide a better `mash` fit. It estimates
the correlations using `mash_estimate_corr_em`, which uses
an ad hoc EM algorithm.
# Method 1
The method is described in [Urbut et al.][urbut-biorxiv]
Here we simulate data with correlations.
```{r}
library(mashr)
set.seed(1)
simdata = simple_sims(500,5,1)
V = matrix(0.5,5,5)
diag(V) = 1
simdata$Bhat = simdata$B + mvtnorm::rmvnorm(2000, sigma = V)
```
Read in the data, and estimate correlations:
```{r}
data = mash_set_data(simdata$Bhat, simdata$Shat)
V.simple = estimate_null_correlation_simple(data)
data.Vsimple = mash_update_data(data, V=V.simple)
```
Now we have two mash data objects, one (`data.Vsimple`) with correlations specified,
and one without (`data`). So analyses using `data.Vsimple` will allow for correlations,
whereas analyses using `data` will assume measurements are independent.
Here, for illustration purposes,
we proceed to analyze the data with correlations,
using just the simple canonical covariances
as in the initial [introductory vignette](intro_mash.html).
```{r}
U.c = cov_canonical(data.Vsimple)
m.Vsimple = mash(data.Vsimple, U.c) # fits with correlations because data.V includes correlation information
print(get_loglik(m.Vsimple),digits=10) # log-likelihood of the fit with correlations set to V
```
We can also compare with the original analysis.
(Note that the canonical covariances
do not depend on the correlations, so we can use the same `U.c` here
for both analyses. If we used data-driven covariances we might prefer to
estimate these separately for each analysis as the correlations would
affect them.)
```{r}
m.orig = mash(data, U.c) # fits without correlations because data object was set up without correlations
print(get_loglik(m.orig),digits=10)
```
```{r}
loglik = c(get_loglik(m.orig), get_loglik(m.Vsimple))
significant = c(length(get_significant_results(m.orig)), length(get_significant_results(m.Vsimple)))
false_positive = c(sum(get_significant_results(m.orig) < 501),
sum(get_significant_results(m.Vsimple) < 501))
tb = rbind(loglik, significant, false_positive)
colnames(tb) = c('without cor', 'V simple')
row.names(tb) = c('log likelihood', '# significance', '# False positive')
tb
```
The log-likelihood with correlations is higher than without correlations.
The false positives reduce.
# Method 2
The method is described in Yuxin Zou's thesis.
To estimate the residual correlations using EM method, it requires
covariance matrices for the signals. We proceed with the simple canonical
covariances.
With `details = TRUE` in `mash_estimate_corr_em`, it returns
the estimates residual correlation matrix with the mash fit.
```{r}
V.em = mash_estimate_corr_em(data, U.c, details = TRUE)
m.Vem = V.em$mash.model
print(get_loglik(m.Vem),digits=10) # log-likelihood of the fit
```
```{r}
loglik = c(get_loglik(m.orig), get_loglik(m.Vsimple), get_loglik(m.Vem))
significant = c(length(get_significant_results(m.orig)), length(get_significant_results(m.Vsimple)),
length(get_significant_results(m.Vem)))
false_positive = c(sum(get_significant_results(m.orig) < 501),
sum(get_significant_results(m.Vsimple) < 501),
sum(get_significant_results(m.Vem) < 501))
tb = rbind(loglik, significant, false_positive)
colnames(tb) = c('without cor', 'V simple', 'V EM')
row.names(tb) = c('log likelihood', '# significance', '# False positive')
tb
```
Comparing with Method 1, the log likelihood from Method 2 is higher.
The EM updates in `mash_estimate_corr_em` needs some time to converge.
There are several things we can do to reduce the running time.
First of all, we can set the number of iterations to a small number.
Because there is a large improvement in the log-likelihood within
the first few iterations, running the algorithm with small number of
iterations provides estimates of correlation matrix that is better
than the initial value. Moreover, we can estimate the correlation matrix
using a random subset of genes, not the whole observed genes.
[urbut-biorxiv]: https://www.biorxiv.org/content/10.1101/096552v4