/
an-introduction.Rmd
467 lines (375 loc) · 15.3 KB
/
an-introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
---
title: "Getting started with the plyranges package"
author: "Stuart Lee"
package: plyranges
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc_float: true
vignette: >
%\VignetteIndexEntry{Introduction}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
---
# `Ranges` revisited
In Bioconductor there are two classes, `IRanges` and `GRanges`, that are
standard data structures for representing genomics data. Throughout this
document I refer to either of these classes as `Ranges` if an operation can be
performed on either class, otherwise I explicitly mention if a function is
appropriate for an `IRanges` or `GRanges`.
`Ranges` objects can either represent sets of integers as `IRanges` (which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as `GRanges`. In addition,
both types of `Ranges` can store information about their intervals as metadata
columns (for example GC content over a genomic interval).
`Ranges` objects follow the tidy data principle: each row of a `Ranges` object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).
Consequently, `Ranges` objects provide a powerful representation for reasoning
about genomic data. In this vignette, you will learn more about `Ranges`
objects and how via grouping, restriction and summarisation you can perform
common data tasks.
# Constructing `Ranges`
To construct an `IRanges` we require that there are at least two columns that
represent at either a starting coordinate, finishing coordinate or the width of
the interval.
```{r}
suppressPackageStartupMessages(library(plyranges))
set.seed(100)
df <- data.frame(start=c(2:-1, 13:15),
width=c(0:3, 2:0))
# produces IRanges
rng <- df %>% as_iranges()
rng
```
To construct a `GRanges` we require a column that represents that sequence name
( contig or chromosome id), and an optional column to represent the
strandedness of an interval.
```{r}
# seqname is required for GRanges, metadata is automatically kept
grng <- df %>%
transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
strand = sample(c("+", "-"), 7, replace = TRUE),
gc = runif(7)) %>%
as_granges()
grng
```
# Arithmetic on Ranges
Sometimes you want to modify a genomic interval by altering the width of the
interval while leaving the start, end or midpoint of the coordinates unaltered.
This is achieved with the `mutate` verb along with `anchor_*` adverbs.
The act of anchoring fixes either the start, end, center coordinates of the
`Range` object, as shown in the figure and code below and anchors are used in
combination with either `mutate` or `stretch`.
By default, the start coordinate will be anchored, so regardless of strand.
For behavior similar to `GenomicRanges::resize`, use `anchor_5p`.
```{r anchor_fig, echo = FALSE, out.width="400px", fig.align="center"}
knitr::include_graphics("anchors.png", dpi = 150)
```
```{r}
rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8)))
grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8),
seqnames = "seq1",
strand = c("+", "*", "-")))
mutate(rng, width = 10)
mutate(anchor_start(rng), width = 10)
mutate(anchor_end(rng), width = 10)
mutate(anchor_center(rng), width = 10)
mutate(anchor_3p(grng), width = 10) # leave negative strand fixed
mutate(anchor_5p(grng), width = 10) # leave positive strand fixed
```
Similarly, you can modify the width of an interval using the `stretch` verb.
Without anchoring, this function will extend the interval in either direction
by an integer amount. With anchoring, either the start, end or midpoint are
preserved.
```{r}
rng2 <- stretch(anchor_center(rng), 10)
rng2
stretch(anchor_end(rng2), 10)
stretch(anchor_start(rng2), 10)
stretch(anchor_3p(grng), 10)
stretch(anchor_5p(grng), 10)
```
`Ranges` can be shifted left or right. If strand information is available we
can also shift upstream or downstream.
```{r}
shift_left(rng, 100)
shift_right(rng, 100)
shift_upstream(grng, 100)
shift_downstream(grng, 100)
```
# Grouping `Ranges`
`plyranges` introduces a new class of `Ranges` called `RangesGrouped`, this is
a similar idea to the grouped `data.frame\tibble` in `dplyr`.
Grouping can act on either the core components or the metadata columns of a
`Ranges` object.
It is most effective when combined with other verbs such as `mutate()`,
`summarise()`, `filter()`, `reduce_ranges()` or `disjoin_ranges()`.
```{r}
grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
strand = sample(c("+", "-"), 7, replace = TRUE),
gc = runif(7),
start = 1:7,
width = 10) %>%
as_granges()
grng_by_strand <- grng %>%
group_by(strand)
grng_by_strand
```
# Restricting `Ranges`
The verb `filter` can be used to restrict rows in the `Ranges`. Note that
grouping will cause the `filter` to act within each group of the data.
```{r}
grng %>% filter(gc < 0.3)
# filtering by group
grng_by_strand %>% filter(gc == max(gc))
```
We also provide the convenience methods `filter_by_overlaps` and
`filter_by_non_overlaps` for restricting by any overlapping `Ranges`.
```{r}
ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>%
as_iranges()
ir1 <- data.frame(start = 2:6, width = 3:7) %>%
as_iranges()
ir0
ir1
ir0 %>% filter_by_overlaps(ir1)
ir0 %>% filter_by_non_overlaps(ir1)
```
# Summarising `Ranges`
The `summarise` function will return a `DataFrame` because the information
required to return a `Ranges` object is lost. It is often most useful to use
`summarise()` in combination with the `group_by()` family of functions.
```{r}
ir1 <- ir1 %>%
mutate(gc = runif(length(.)))
ir0 %>%
group_by_overlaps(ir1) %>%
summarise(gc = mean(gc))
```
# Joins, or another way at looking at overlaps between `Ranges`
A join acts on two GRanges objects, a query and a subject.
```{r}
query <- data.frame(seqnames = "chr1",
strand = c("+", "-"),
start = c(1, 9),
end = c(7, 10),
key.a = letters[1:2]) %>%
as_granges()
subject <- data.frame(seqnames = "chr1",
strand = c("-", "+"),
start = c(2, 6),
end = c(4, 8),
key.b = LETTERS[1:2]) %>%
as_granges()
```
```{r olap, echo = FALSE, fig.cap = "Query and Subject Ranges", message = FALSE}
library(ggplot2)
query_df <- as.data.frame(query)[, -6]
query_df$key <- "Query"
subject_df <- as.data.frame(subject)[, -6]
subject_df$key <- "Subject"
melted_ranges <- rbind(query_df, subject_df)
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
geom_rect() +
facet_grid(key ~ .) +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
xlab("Position") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
```
The join operator is relational in the sense that metadata from the query and
subject ranges is retained in the joined range. All join operators in the
`plyranges` DSL generate a set of hits based on overlap or proximity of ranges
and use those hits to merge the two datasets in different ways. There are four
supported matching algorithms: _overlap_, _nearest_, _precede_, and _follow_.
We can further restrict the matching by whether the query is completely
_within_ the subject, and adding the _directed_ suffix ensures that matching
ranges have the same direction (strand).
```{r olaps-diagram, echo = FALSE, out.width="600px"}
knitr::include_graphics("olaps.png")
```
The first function, `join_overlap_intersect()` will return a `Ranges` object
where the start, end, and width coordinates correspond to the amount of any
overlap between the left and right input `Ranges`. It also returns any
metadatain the subject range if the subject overlaps the query.
```{r}
intersect_rng <- join_overlap_intersect(query, subject)
intersect_rng
```
```{r intersect-join, echo = FALSE, fig.cap="Intersect Join"}
intersect_df <- as.data.frame(intersect_rng)[, -c(6,7)]
intersect_df$key <- "Intersect Join"
melted_ranges <- rbind(query_df, subject_df, intersect_df)
melted_ranges$key <- factor(melted_ranges$key,
levels = c("Query", "Subject", "Intersect Join"))
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
geom_rect() +
facet_grid(key ~ .) +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
xlab("Position") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
```
The `join_overlap_inner()` function will return the `Ranges` in the query that
overlap any `Ranges` in the subject. Like the `join_overlap_intersect()`
function metadata of the subject `Range` is returned if it overlaps the query.
```{r}
inner_rng <- join_overlap_inner(query, subject)
inner_rng
```
```{r inner-join, echo = FALSE, fig.cap = "Inner Join", message = FALSE}
inner_df <- as.data.frame(inner_rng)[, -c(6,7)]
inner_df$ymin <- c(1,4)
inner_df$ymax <- c(3,6)
inner_df$key <- "Inner Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, inner_df)
melted_ranges$key <- factor(melted_ranges$key,
levels = c("Query", "Subject", "Inner Join"))
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
geom_rect() +
facet_grid(key ~ ., scales = "free_y") +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
xlab("Position") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
```
We also provide a convenience method called `find_overlaps` that computes the
same result as `join_overlap_inner()`.
```{r}
find_overlaps(query, subject)
```
The `join_overlap_left()` method will perform an outer left join.
First any overlaps that are found will be returned similar to
`join_overlap_inner()`. Then any non-overlapping ranges will be returned, with
missing values on the metadata columns.
```{r}
left_rng <- join_overlap_left(query, subject)
left_rng
```
```{r olap-left, echo = FALSE, fig.cap = "Left Join", message = FALSE}
left_df <- as.data.frame(left_rng)[, -c(6,7)]
left_df$ymin <- c(1,4, 1)
left_df$ymax <- c(3,6, 3)
left_df$key <- "Left Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, left_df)
melted_ranges$key <- factor(melted_ranges$key,
levels = c("Query", "Subject", "Left Join"))
ggplot(melted_ranges,
aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
geom_rect() +
facet_grid(key ~ ., scales = "free_y") +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
xlab("Position") +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank())
```
Compared with `filter_by_overlaps()` above, the overlap left join expands the
`Ranges` to give information about each interval on the query `Ranges` that
overlap those on the subject `Ranges` as well as the intervals on the left that
do not overlap any range on the right.
## Finding your neighbours
We also provide methods for finding nearest, preceding or following `Ranges`.
Conceptually this is identical to our approach for finding overlaps, except the
semantics of the join are different.
```{r neigbours, echo = FALSE, out.width="600px"}
knitr::include_graphics("neighbours.png")
```
```{r}
join_nearest(ir0, ir1)
join_follow(ir0, ir1)
join_precede(ir0, ir1) # nothing precedes returns empty `Ranges`
join_precede(ir1, ir0)
```
## Example: dealing with multi-mapping
This example is taken from the Bioconductor support
[site](https://support.bioconductor.org/p/100046/).
We have two `Ranges` objects. The first contains single nucleotide positions
corresponding to an intensity measurement such as a ChiP-seq experiment, while
the other contains coordinates for two genes of interest.
We want to identify which positions in the `intensities` `Ranges` overlap the
genes, where each row corresponds to a position that overlaps a single gene.
First we create the two `Ranges` objects
```{r ex1}
intensities <- data.frame(seqnames = "VI",
start = c(3320:3321,3330:3331,3341:3342),
width = 1) %>%
as_granges()
intensities
genes <- data.frame(seqnames = "VI",
start = c(3322, 3030),
end = c(3846, 3338),
gene_id=c("YFL064C", "YFL065C")) %>%
as_granges()
genes
```
Now to find where the positions overlap each gene, we can perform an overlap
join. This will automatically carry over the gene_id information as well as
their coordinates (we can drop those by only selecting the gene_id).
```{r}
olap <- join_overlap_inner(intensities, genes) %>%
select(gene_id)
olap
```
Several positions match to both genes. We can count them using `summarise` and
grouping by the `start` position:
```{r}
olap %>%
group_by(start) %>%
summarise(n = n())
```
## Grouping by overlaps
It's also possible to group by overlaps. Using this approach we can count the
number of overlaps that are greater than 0.
```{r}
grp_by_olap <- ir0 %>%
group_by_overlaps(ir1)
grp_by_olap
grp_by_olap %>%
mutate(n_overlaps = n())
```
Of course we can also add overlap counts via the `count_overlaps()` function.
```{r}
ir0 %>%
mutate(n_overlaps = count_overlaps(., ir1))
```
# Data Import/Output
We provide convenience functions via `rtracklayer` and `GenomicAlignments` for
reading/writing the following data formats to/from `Ranges` objects.
| `plyranges` functions | File Format |
|-----------------------|-------------|
| `read_bam()` | BAM |
| `read_bed()`/`write_bed()` | BED |
| `read_bed_graph()`/ `write_bed_graph()` | BEDGraph |
| `read_narrowpeaks()`/`write_narrowpeaks()` | narrowPeaks |
| `read_gff()` / `write_gff()` | GFF(1-3)/ GTF |
| `read_bigwig()` / `write_bigwig()` | BigWig |
| `read_wig()` /`write_wig()` | Wig |
# Learning more
There are many other resources and workshops available to learn to use
`plyranges` and related Bioconductor packages, especially for more realistic
analyses than the ones covered here:
- The [fluentGenomics](https://bioconductor.org/packages/release/workflows/html/fluentGenomics.html) workflow package is an end-to-end workflow package for integrating differential
expression results with differential accessibility results.
- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.
- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed
walk through of using plyranges for coverage analysis.
- The [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html) to follow
up on interesting hits from a combined RNA-seq and ATAC-seq analysis.
- The [journal article](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1597-8) ([preprint here](https://www.biorxiv.org/content/early/2018/05/23/327841)) has
details about the overall philosophy and design of plyranges.
# Appendix
```{r}
sessionInfo()
```