/
gene_set.Rmd
206 lines (166 loc) · 5.4 KB
/
gene_set.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
title: "Overview of 4,026 pre-processed gene sets"
author: "Xiang Zhu"
date: "2018-09-16"
output: workflowr::wflow_html
---
[Zhu and Stephens (2018)]: https://www.nature.com/articles/s41467-018-06805-x
[`xiangzhu/rss-gsea`]: https://github.com/xiangzhu/rss-gsea/tree/master/data
[`biological_pathway`]: https://github.com/xiangzhu/rss-gsea/tree/master/data/biological_pathway
[`tissue_set`]: https://github.com/xiangzhu/rss-gsea/tree/master/data/tissue_set
[zenodo-geneset]: https://zenodo.org/badge/latestdoi/55633948
All 4,026 gene sets used in [Zhu and Stephens (2018)][] are freely available at [`xiangzhu/rss-gsea`],
where the folder [`biological_pathway`](https://github.com/xiangzhu/rss-gsea/tree/master/data/biological_pathway)
contains 3,913 biological pathways,
and the folder [`tissue_set`](https://github.com/xiangzhu/rss-gsea/tree/master/data/tissue_set)
contains 113 GTEx tissue-based gene sets.
These gene sets can be referenced in a journal's "Data availability" section
as [![DOI](https://zenodo.org/badge/55633948.svg)][zenodo-geneset].
```zsh
$ tree -L 2 data/
data/
├── README.md
├── biological_pathway
│ ├── gene_37.3.mat
│ └── pathway.mat
└── tissue_set
├── de_genes
├── he_genes
└── se_genes
5 directories, 3 files
```
## Biological pathways
The 3,913 public biological pathway used in [Zhu and Stephens (2018)][]
are available in the folder [`biological_pathway`][],
which are represented by two files `gene_37.3.mat` and `pathway.mat`.
The file `gene_37.3.mat` contains basic information of genes.
```
>> load gene_37.3.mat
>> gene
gene =
struct with fields:
id: [18732x1 double]
symbol: {18732x1 cell}
chr: [18732x1 double]
desc: {18732x1 cell}
start: [18732x1 double]
stop: [18732x1 double]
>> [gene.id(10) gene.chr(10) gene.start(10) gene.stop(10)]
ans =
18 16 8768444 8878432
>> gene.symbol(10)
ans =
1x1 cell array
{'ABAT'}
>> gene.desc(10)
ans =
1x1 cell array
{'4-aminobutyrate aminotransferase'}
```
Note that only 18,313 genes mapped to reference sequence were used in our analyses.
```
>> [min(gene.start) min(gene.stop)]
ans =
-1 -1
>> inref_genes = ~(gene.start == -1 | gene.stop == -1);
>> sum(inref_genes)
ans =
18313
```
The file `pathway.mat` contains basic information of pathways.
```
>> load pathway.mat
>> pathway
pathway =
struct with fields:
label: {4076x1 cell}
database: {4076x1 cell}
source: {4076x1 cell}
genes: [18732x4076 double]
synonyms: {4076x1 cell}
>> pathway.label(100)
ans =
1x1 cell array
{'Activation of NOXA and translocation to mitochondria'}
>> pathway.database(100)
ans =
1x1 cell array
{'PC'}
>> pathway.source(100)
ans =
1x1 cell array
{'reactome'}
```
The gene-pathway information is represented as a sparse zero-one matrix `pathway.genes`,
where `genes(i,j)==1` if gene `i` is a member of pathway `j` and `genes(i,j)==0` otherwise.
```
>> genes = pathway.genes;
>> whos genes
Name Size Bytes Class Attributes
genes 18732x4076 3257512 double sparse
>> genes(:,100)
ans =
(1243,1) 1
(3410,1) 1
(4567,1) 1
(4668,1) 1
```
Finally, our analyses only used 3,913 of 4,076 pathways that
- include 2-499 RefSeq-mapped genes;
- have clear `database` and `source` definitions;
- exclude one pathway `Viral RNP Complexes in the Host Cell Nucleus (PC, reactome)` (because no HapMap3 SNP was mapped to this pathway).
```
>> numgenes = pathway.genes' * inref_genes;
>> size(numgenes)
ans =
4076 1
>> paths = find(numgenes > 1 & numgenes < 500);
>> size(paths)
ans =
3916 1
>> database = pathway.database;
>> source = pathway.source;
>> database_na = find(not(cellfun('isempty', strfind(database, 'NA'))));
>> source_na = find(not(cellfun('isempty', strfind(source, 'NA'))));
>> length(union(database_na, source_na))
ans =
2
>> label = pathway.label;
>> pathway_exclude = 'Viral RNP Complexes in the Host Cell Nucleus';
>> label_include = find(cellfun('isempty', strfind(label, pathway_exclude)));
>> label_exclude = setdiff(1:4076, label_include);
>> label(label_exclude)
ans =
1x1 cell array
{'Viral RNP Complexes in the Host Cell Nucleus'}
>> database(label_exclude)
ans =
1x1 cell array
{'PC'}
>> source(label_exclude)
ans =
1x1 cell array
{'reactome'}
```
## Tissue-based gene sets
The 113 GTEx tissue-based gene sets used in [Zhu and Stephens (2018)][]
are available in the folder [`tissue_set`][].
There are 44 "highly expressed" (HE) gene sets, 49 "selectively expressed" (SE) gene sets
and 20 "distinctively expressed" (DE) gene sets.
The creation of SE sets uses a method described in [Yang et al (2018)](https://doi.org/10.1101/311563).
The creation of DE sets uses a method described in [Dey et al (2017)](https://www.ncbi.nlm.nih.gov/pubmed/28333934).
```{r, engine='bash'}
ls data/tissue_set/he_genes | wc -l
```
```{r, engine='bash'}
ls data/tissue_set/se_genes | wc -l
```
```{r, engine='bash'}
ls data/tissue_set/de_genes | wc -l
```
Each of the tissue-based gene sets has the following format.
```{r, engine='bash'}
head -n 5 data/tissue_set/he_genes/gene_names_expr_LIVER.txt
```
Note that the gene information of tissue-based sets was provided by GTEx,
which may not be the same as `gene_37.3.mat` above.