/
dorothea.Rmd
196 lines (160 loc) · 6.07 KB
/
dorothea.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
title: "DoRothEA regulons."
author:
- name: Pau Badia i Mompel
affiliation: Institute for Computational Biomedicine, Heidelberg University
email: pau.badia@uni-heidelberg.de
package: dorothea
output:
BiocStyle::html_document
vignette: |
%\VignetteIndexEntry{DoRothEA regulons.}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
DoRothEA is a gene regulatory network (GRN) containing signed transcription factor
(TF) - target gene interactions. DoRothEA regulons, the collection of TFs and
their transcriptional targets, were curated and collected from different types of
evidence for both human and mouse.
<img src="../man/figures/overview.png" align="center" width="600">
For each TF-target interaction we assigned a confidence level based on the
number of supporting evidence. The confidence assignment comprises five levels,
ranging from A (highest confidence) to E (lowest confidence). Interactions that
are supported by all four lines of evidence, manually curated by experts in
specific reviews, or supported both in at least two curated resources are
considered to be highly reliable and were assigned an A level. Level B-D are
reserved for curated and/or ChIP-seq interactions with different levels of
additional evidence. Finally, E level is used for interactions that are
uniquely supported by computational predictions. To provide the most confident
regulon for each TF, we aggregated the TF-target interactions with the highest
possible confidence score that resulted in a regulon size equal to or greater
than ten targets. The final confidence level assigned to the TF regulon is the
lowest confidence score of its component targets.
# Activity estimation
DoRothEA regulons can be coupled with any statistical method to infer TF
activities from bulk or single-cell transcriptomics.
In this vignette we just show how to access these regulons and some of their
properties. To infer TF activities, please check out
[decoupleR](https://doi.org/10.1093/bioadv/vbac016), available in
[R](https://saezlab.github.io/decoupleR/) or
[python](https://github.com/saezlab/decoupler-py).
# Load
First we load the necessary packages:
```{r load, message=FALSE}
library(dorothea)
library(decoupleR)
library(ggplot2)
library(dplyr)
```
Here is how to retrieve all regulons from human:
```{r model}
net <- decoupleR::get_dorothea(levels = c('A', 'B', 'C', 'D'))
head(net)
```
Here we can observe some of the target genes for the TF ADNP. We can see their
confidence level, in this case D, and their mode of regulation, in this case
positive. To better estimate TF activities, we recommend to select regulons from
the confidence levels A, B and C.
# Exploration
We can observe the total number of genes per TF:
```{r n_genes}
n_genes <- net %>%
group_by(source) %>%
summarize(n = n())
ggplot(data=n_genes, aes(x=n)) +
geom_density() +
theme(text = element_text(size=12)) +
xlab('Number of target genes') +
ylab('densities') +
theme_bw() +
theme(legend.position = "none")
```
The majority of TFs have around 20 target genes, but there are some that reach
more than 1000.
Additionally, we can visualize how many edges each confidence level adds:
```{r n_edges}
n_edges <- net %>%
group_by(confidence) %>%
summarize(n = n())
ggplot(data=n_edges, aes(x=confidence, y=log10(n), color=confidence, fill=confidence)) +
geom_bar(stat="identity") +
theme(text = element_text(size=12)) +
xlab('Confidence') +
ylab('log10(Number of edges)') +
theme_bw() +
theme(legend.position = "none")
```
Each confidence level contributes around 10,000 TF - target relationships, B
and E being the exceptions.
We can also check how many TFs are repressors, TFs with most of their edges with
negative mode of regulation (`mor`), and how many are activators, TFs with most
of their edges with positive `mor`:
```{r prop}
prop <- net %>%
mutate(mor = case_when(mor < 0 ~ -1, mor > 0 ~ 1)) %>%
group_by(source, mor) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n)) %>%
filter(mor == 1)
ggplot(data=prop, aes(x=freq)) +
geom_density() +
theme(text = element_text(size=12)) +
xlab('% of positive edges') +
ylab('densities') +
theme_bw() +
theme(legend.position = "none")
```
Most TFs in DoRothEA are activators, but there are also some of them that
are repressors.
# CollecTRI
Recently, we have released a new literature based GRN with increased coverage and better
performance at identifying perturbed TFs, called [CollecTRI](https://github.com/saezlab/CollecTRI).
We encourage users to use CollecTRI instead of DoRothEA. Vignettes on how to
obtain activities are available at the [decoupleR package](https://saezlab.github.io/decoupleR/).
Here's how to access it. The argument `split_complexes` keeps complexes or
splits them into subunits, by default we recommend to keep complexes together.
```{r collectri}
net <- decoupleR::get_collectri(split_complexes = FALSE)
head(net)
```
We can observe the total number of genes per TF:
```{r n_genes_collectri}
n_genes <- net %>%
group_by(source) %>%
summarize(n = n())
ggplot(data=n_genes, aes(x=n)) +
geom_density() +
theme(text = element_text(size=12)) +
xlab('Number of target genes') +
ylab('densities') +
theme_bw() +
theme(legend.position = "none")
```
Similarly to DoRothEA, the majority of TFsin CollecTRI have around 20 target genes, but
there are some that reach more than 1000.
We can also check how many TFs are repressors, TFs with most of their edges with
negative mode of regulation (`mor`), and how many are activators, TFs with most
of their edges with positive `mor`:
```{r prop_collectri}
prop <- net %>%
mutate(mor = case_when(mor < 0 ~ -1, mor > 0 ~ 1)) %>%
group_by(source, mor) %>%
summarize(n = n()) %>%
mutate(freq = n / sum(n)) %>%
filter(mor == 1)
ggplot(data=prop, aes(x=freq)) +
geom_density() +
theme(text = element_text(size=12)) +
xlab('% of positive edges') +
ylab('densities') +
theme_bw() +
theme(legend.position = "none")
```
As seen in DoRothEA, Most TFs in CollecTRI are also activators.