-
Notifications
You must be signed in to change notification settings - Fork 11
/
example_4.Rmd
252 lines (194 loc) · 9.6 KB
/
example_4.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
---
title: "Example 4: RSS-BVSR Fitting via Variational Bayes Methods"
author: Xiang Zhu
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
[Carbonetto and Stephens (2012)]: https://projecteuclid.org/euclid.ba/1339616726
[Zhu and Stephens (2017)]: https://projecteuclid.org/euclid.aoas/1507168840
[Zhu and Stephens (2018)]: https://www.nature.com/articles/s41467-018-06805-x
[`enrich_datamaker.m`]: https://github.com/stephenslab/rss/blob/master/misc/enrich_datamaker.m
[Supplementary Note]: https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-06805-x/MediaObjects/41467_2018_6805_MOESM1_ESM.pdf
[Supplementary Figure 1]: https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-018-06805-x/MediaObjects/41467_2018_6805_MOESM1_ESM.pdf
[`example4.m`]: https://github.com/stephenslab/rss/blob/master/examples/example4.m
## Overview
This example illustrates how to fit an RSS-BVSR model using
variational Bayes (VB) approximation, and compares the results with
previous work based on individual-level data, notably, [Carbonetto and Stephens (2012)][].
This example is closely related to Section "Connection with enrichment
analysis of individual-level data" of [Zhu and Stephens (2018)][].
Proposition 2.1 of [Zhu and Stephens (2017)][] gives conditions
under which regression likelihood based on individual-level
data is equivalent to regression likelihood based on summary-level data.
Under the same conditions, we can show that variational inferences
based on individual-level data and summary-level data are also equivalent,
in the sense that they have the same fix point iteration
scheme and lower bound in variational approximations.
See [Supplementary Note][] of [Zhu and Stephens (2018)][] for proofs.
This example serves as a sanity check for our theoretical work.
The GWAS individual-level and summary-level data are simulated by [`enrich_datamaker.m`][],
which contains an enrichment signal of a target gene set.
Specifically, SNPs outside the target gene set are selected
to be causal ones with log10 odds ratio $\theta_0$,
whereas SNPs inside this gene set are selected with a higher
log10 odds ratio $\theta_0+\theta$ ($\theta>0$).
For more details, see [Supplementary Figure 1][] of [Zhu and Stephens (2018)][].
Next, we feed the simulated individual-level and summary-level data to
[`varbvs`](https://github.com/pcarbo/varbvs) and
[`rss-varbvsr`](https://github.com/stephenslab/rss/tree/master/src_vb) respectively,
and then compare the VB output from these two methods.
To reproduce results of Example 4,
please please read the step-by-step guide below and use [`example4.m`][].
Before running [`example4.m`][], please first install the
[VB subroutines](https://github.com/stephenslab/rss/tree/master/src_vb) of RSS
Please find installation instructions [here](setup_vb.html).
## Step-by-step illustration
**Step 1**. [Download](https://projects.rcc.uchicago.edu/mstephens/rss_wiki/example4/readme)
the input data `genotype.mat` and `AH_chr16.mat` for the function `enrich_datamaker.m`.
Please contact me if you have trouble accessing these files<sup>1</sup>.
**Step 2**. Install the MATLAB implementation of [`varbvs`](https://github.com/pcarbo/varbvs).
After the installation, add varbvs to the search path as follows.
```matlab
addpath('/home/xiangzhu/varbvs-master/varbvs-MATLAB/');
```
**Step 3**. Extract SNPs that inside the target gene set.
This step is where we need the input data
[`AH_chr16.mat`](https://projects.rcc.uchicago.edu/mstephens/rss_wiki/example4/readme).
The index of SNPs inside the gene set is stored as `snps`.
```matlab
AH = matfile('AH_chr16.mat');
H = AH.H; % hypothesis matrix 3323x3160
A = AH.A; % annotation matrix 12758x3323
paths = find(H(:,end)); % signal transduction (Biosystem, Reactome)
snps = find(sum(A(:,paths),2)); % index of variants inside the pathway
```
**Step 4**. Simulate the enrichment dataset.
In addition to `genotype.mat` and `AH_chr16.mat`,
four more input variables are required in order to run `example4.m`.
You can provide them through keyboard.
```matlab
% set the number of replications
prompt = 'What is the number of replications? ';
Nrep = input(prompt);
% set the genome-wide log-odds
prompt1 = 'What is the genome-wide log-odds? ';
theta0 = input(prompt1);
% set the log-fold enrichment
prompt2 = 'What is the log-fold enrichment? ';
theta = input(prompt2);
% set the true pve
prompt3 = 'What is the pve (between 0 and 1)? ';
pve = input(prompt3);
```
With this in place, the data generation is all set.
```matlab
for i = 1:Nrep
myseed = 617 + i;
% generate data under enrichment hypothesis
[true_para,individual_data,summary_data] = enrich_datamaker(C,thetatype,pve,myseed,snps);
... ...
end
```
**Step 5**. Ensure that `varbvs` and `rss-varbvsr` run in an almost identical environment.
We fix all the hyper-parameters as their true values,
and give the same parameter initialization.
```matlab
% fix all hyper-parameters as their true values
sigma = true_para{4}^2; % the true residual variance
sa = 1/sigma; % sa*sigma being the true prior variance of causal effects
sigb = 1; % the true prior SD of causal effects
theta0 = thetatype(1); % the true genome-wide log-odds (base 10)
theta = thetatype(2); % the true log-fold enrichment (base 10)
% initialize the variational parameters
rng(myseed, 'twister');
alpha0 = rand(p,1);
alpha0 = alpha0 ./ sum(alpha0);
rng(myseed+1, 'twister');
mu0 = randn(p,1);
```
**Step 6**. Run `varbvs` on the individual-level genotypes and phenotypes.
The VB analysis involves two steps: first fitting the baseline model assuming no enrichment;
second fitting the enrichment model assuming the gene set is enriched.
We then calculate a Bayes factor (ratio of marginal likelihoods) for these two models,
and use it to test whether the gene set is enriched for genetic associations.
```matlab
fit_null = varbvs(X,[],y,[],'gaussian',options_n);
fit_gsea = varbvs(X,[],y,[],'gaussian',options_e);
bf_b = exp(fit_gsea.logw - fit_null.logw);
```
**Step 7**. Run `rss-varbvsr` on the single-SNP association summary statistics.
Here we do not specify `se` and `R` as we actually do in our real data analyses.
Instead, we set their values such that `rss-varbvsr`
and `varbvs` are expected to produce the same output.
(This is because we want to verify our mathematical proofs in silico.)
Based on the proofs in [Zhu and Stephens (2018)][],
we find that `rss-varbvsr` is equivalent to `varbvs` if and only if
1. the variance of phenotype is the same as residual variance `sigma.^2`;
2. the input LD matrix `R` is the same as the sample correlation matrix of cohort genotypes `X`.
Reassuringly, these two assumptions are exactly the same
assumptions in Proposition 2.1 of [Zhu and Stephens (2017)][]
that guarantees that the RSS likelihood is equivalent to
the regression likelihood for individual-level data.
These two assumptions are implemented as follows.
```matlab
% set the summary-level data for a perfect match b/t rss-varbvsr and varbvs
betahat = summary_data{1};
se = sqrt(sigma ./ diag(X'*X)); % condition 1 for perfect matching
Si = 1 ./ se(:);
R = corrcoef(X); % condition 2 for perfect matching
SiRiS = sparse(repmat(Si, 1, p) .* R .* repmat(Si', p, 1));
```
Now we perform the VB analysis of summary data via `rss-varbvsr`.
```matlab
[logw_nr,alpha_nr,mu_nr,s_nr] = rss_varbvsr(betahat,se,SiRiS,sigb,logodds_n,options);
[logw_er,alpha_er,mu_er,s_er] = rss_varbvsr(betahat,se,SiRiS,sigb,logodds_e,options);
bf_r = exp(logw_er-logw_nr);
```
**Step 8**. Compare VB output from `varbvs` and `rss-varbvsr`.
Both `varbvs` and `rss-varbvsr` output an estimated posterior distribution of
$\boldsymbol\beta$ (multiple regression coefficients, or, multiple-SNP effect sizes).
Specifically, both estimated distributions have the following analytic form
(Equations 6-7 in [Carbonetto and Stephens (2012)][]).
Hence, it suffices to compare the estimated variational parameters `[alpha, mu, s]`,
and the estimated Bayes factors based on the VB output.
Below is the result when I only run `example4.m` with one replication.
```matlab
>> run example4.m
What is the number of replications? 1
What is the genome-wide log-odds? -4
What is the log-fold enrichment? 2
What is the pve (between 0 and 1)? 0.3
```
The log10 maximum absolute differences of `[alpha, mu, s]`
between `varbvs` and `rss-varbvsr`, under the baseline model:
```matlab
>> disp(log10([alpha_n_diff mu_n_diff s_n_diff]))
-4.7084 -4.5409 -7.6421
```
The log10 maximum absolute differences of `[alpha, mu, s]`
between `varbvs` and `rss-varbvsr`, under the enrichment model:
```matlab
>> disp(log10([alpha_e_diff mu_e_diff s_e_diff]))
-3.9865 -4.0157 -7.6421
```
The log10 Bayes factors estimated from `varbvs` and `rss-varbvsr`
are 4.0193 and 4.0193 respectively, and the log10 relative difference between them is -6.8345.
## More simulations
To see how `example4.m` behave on average,
we need run more replications. Below are the results from 100 replications.
```matlab
>> run example4.m
What is the number of replications? 100
What is the genome-wide log-odds? -4
What is the log-fold enrichment? 2
What is the pve (between 0 and 1)? 0.3
```
<center>
<img src="images/rss_example4_rep100.png" width="600">
</center>
--------
**Footnotes:**
1. Currently these files are locked, since they contain individual-level genotypes
from Wellcome Trust Case Control Consortium (WTCCC, <https://www.wtccc.org.uk/>).
You need to get permission from WTCCC before we can share these files with you.