vignettes/an-introduction.Rmd

---
title: "Getting started with the plyranges package"
author: "Stuart Lee"
package: plyranges
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Introduction}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

# `Ranges` revisited

In Bioconductor there are two classes, `IRanges` and `GRanges`, that are
standard data structures for representing genomics data.  Throughout this
document I refer to either of these classes as `Ranges` if an operation can be
performed on either class, otherwise I explicitly mention if a function is
appropriate for an `IRanges` or `GRanges`.

`Ranges` objects can either represent sets of integers as `IRanges` (which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as `GRanges`. In addition,
both types of `Ranges` can store information about their intervals as metadata
columns (for example GC content over a genomic interval).

`Ranges` objects follow the tidy data principle: each row of a `Ranges` object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).

Consequently, `Ranges` objects provide a powerful representation for reasoning
about genomic data. In this vignette, you will learn more about `Ranges`
objects and how via grouping, restriction and summarisation you can perform
common data tasks.


# Constructing `Ranges`

To construct an `IRanges` we require that there are at least two columns that
represent at either a starting coordinate, finishing coordinate or the width of
the interval.

```{r}
suppressPackageStartupMessages(library(plyranges))
set.seed(100)
df <- data.frame(start=c(2:-1, 13:15),
                 width=c(0:3, 2:0))

# produces IRanges
rng <- df %>% as_iranges()
rng

```

To construct a `GRanges` we require a column that represents that sequence name
( contig or chromosome id), and an optional column to represent the
strandedness of an interval.

```{r}
# seqname is required for GRanges, metadata is automatically kept
grng <- df %>%
  transform(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7)) %>%
  as_granges()

grng
```

# Arithmetic on Ranges

Sometimes you want to modify a genomic interval by altering the width of the
interval while leaving the start, end or midpoint of the coordinates unaltered.
This is achieved with the `mutate` verb along with `anchor_*` adverbs.

The act of anchoring fixes either the start, end, center coordinates of the
`Range` object, as shown in the figure and code below and anchors are used in
combination with either `mutate` or `stretch`.
By default, the start coordinate will be anchored, so regardless of strand. 
For behavior similar to `GenomicRanges::resize`, use `anchor_5p`.

```{r anchor_fig, echo = FALSE, out.width="400px", fig.align="center"}
knitr::include_graphics("anchors.png", dpi = 150)
```

```{r}
rng <- as_iranges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8)))
grng <- as_granges(data.frame(start=c(1, 2, 3), end=c(5, 2, 8),
                          seqnames = "seq1",
                          strand = c("+", "*", "-")))
mutate(rng, width = 10)
mutate(anchor_start(rng), width = 10)
mutate(anchor_end(rng), width = 10)
mutate(anchor_center(rng), width = 10)
mutate(anchor_3p(grng), width = 10) # leave negative strand fixed
mutate(anchor_5p(grng), width = 10) # leave positive strand fixed
```

Similarly, you can modify the width of an interval using the `stretch` verb.
Without anchoring, this function will extend the interval in either direction
by an integer amount. With anchoring, either the start, end or midpoint are
preserved.

```{r}
rng2 <- stretch(anchor_center(rng), 10)
rng2
stretch(anchor_end(rng2), 10)
stretch(anchor_start(rng2), 10)
stretch(anchor_3p(grng), 10)
stretch(anchor_5p(grng), 10)
```

`Ranges` can be shifted left or right. If strand information is available we
can also shift upstream or downstream.

```{r}
shift_left(rng, 100)
shift_right(rng, 100)
shift_upstream(grng, 100)
shift_downstream(grng, 100)
```

# Grouping `Ranges`

`plyranges` introduces a new class of `Ranges` called `RangesGrouped`, this is
a similar idea to the grouped `data.frame\tibble` in `dplyr`.

Grouping can act on either the core components or the metadata columns of a
`Ranges` object.

It is most effective when combined with other verbs such as `mutate()`,
`summarise()`, `filter()`, `reduce_ranges()` or `disjoin_ranges()`.
```{r}
grng <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
         strand = sample(c("+", "-"), 7, replace = TRUE),
         gc = runif(7),
         start = 1:7,
         width = 10) %>%
  as_granges()

grng_by_strand <- grng %>%
  group_by(strand)

grng_by_strand
```

# Restricting `Ranges`

The verb `filter` can be used to restrict rows in the `Ranges`. Note that
grouping will cause the `filter` to act within each group of the data.
```{r}
grng %>% filter(gc < 0.3)
# filtering by group
grng_by_strand %>% filter(gc == max(gc))
```

We also provide the convenience methods `filter_by_overlaps` and
`filter_by_non_overlaps` for restricting by any overlapping `Ranges`.

```{r}
ir0 <- data.frame(start = c(5,10, 15,20), width = 5) %>%
  as_iranges()
ir1 <- data.frame(start = 2:6, width = 3:7) %>%
  as_iranges()
ir0
ir1
ir0 %>% filter_by_overlaps(ir1)
ir0 %>% filter_by_non_overlaps(ir1)
```

# Summarising `Ranges`

The `summarise` function will return a `DataFrame` because the information
required to return a `Ranges` object is lost. It is often most useful to use
`summarise()` in combination with the `group_by()` family of functions.

```{r}
ir1 <- ir1 %>%
  mutate(gc = runif(length(.)))

ir0 %>%
  group_by_overlaps(ir1) %>%
  summarise(gc = mean(gc))
```

# Joins, or another way at looking at overlaps between `Ranges`

A join acts on two GRanges objects, a query and a subject.

```{r}
query <- data.frame(seqnames = "chr1",
               strand = c("+", "-"),
               start = c(1, 9),
               end =  c(7, 10),
               key.a = letters[1:2]) %>%
  as_granges()

subject <- data.frame(seqnames = "chr1",
               strand = c("-", "+"),
               start = c(2, 6),
               end = c(4, 8),
               key.b = LETTERS[1:2]) %>%
  as_granges()
```

```{r olap, echo = FALSE, fig.cap = "Query and Subject Ranges", message = FALSE}
library(ggplot2)
query_df <- as.data.frame(query)[, -6]
query_df$key <- "Query"
subject_df <- as.data.frame(subject)[, -6]
subject_df$key <- "Subject"
melted_ranges <- rbind(query_df, subject_df)
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
  geom_rect() +
  facet_grid(key ~ .) +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

The join operator is relational in the sense that metadata from the query and
subject ranges is retained in the joined range.  All join operators in the
`plyranges` DSL generate a set of hits based on overlap or proximity of ranges
and use those hits to merge the two datasets in different ways.  There are four
supported matching algorithms: _overlap_, _nearest_, _precede_, and _follow_.
We can further restrict the matching by whether the query is completely
_within_ the subject, and adding the _directed_ suffix ensures that matching
ranges have the same direction (strand).


```{r olaps-diagram, echo = FALSE, out.width="600px"}
knitr::include_graphics("olaps.png")
```

The first function, `join_overlap_intersect()` will return a `Ranges` object
where the start, end, and width coordinates correspond to the amount of any
overlap between the left and right input `Ranges`. It also returns any
metadatain the subject range if the subject overlaps the query.

```{r}
intersect_rng <- join_overlap_intersect(query, subject)
intersect_rng
```

```{r intersect-join, echo = FALSE, fig.cap="Intersect Join"}
intersect_df <- as.data.frame(intersect_rng)[, -c(6,7)]
intersect_df$key <- "Intersect Join"
melted_ranges <- rbind(query_df, subject_df, intersect_df)
melted_ranges$key <- factor(melted_ranges$key,
                              levels = c("Query", "Subject", "Intersect Join"))
ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = 1, ymax = 3)) +
  geom_rect() +
  facet_grid(key ~ .) +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```


The `join_overlap_inner()` function will return the `Ranges` in the query that
overlap any `Ranges` in the subject. Like the `join_overlap_intersect()`
function metadata of the subject `Range` is returned if it overlaps the query.

```{r}
inner_rng <- join_overlap_inner(query, subject)
inner_rng
```


```{r inner-join, echo = FALSE, fig.cap = "Inner Join", message = FALSE}
inner_df <- as.data.frame(inner_rng)[, -c(6,7)]
inner_df$ymin <- c(1,4)
inner_df$ymax <- c(3,6)
inner_df$key <- "Inner Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, inner_df)
melted_ranges$key <- factor(melted_ranges$key,
                              levels = c("Query", "Subject", "Inner Join"))

ggplot(melted_ranges, aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
  geom_rect() +
  facet_grid(key ~ ., scales = "free_y") +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

We also provide a convenience method called `find_overlaps` that computes the
same result as `join_overlap_inner()`.

```{r}
find_overlaps(query, subject)
```

The `join_overlap_left()` method will perform an outer left join.


First any overlaps that are found will be returned similar to
`join_overlap_inner()`. Then any non-overlapping ranges will be returned, with
missing values on the metadata columns.

```{r}
left_rng <- join_overlap_left(query, subject)
left_rng
```

```{r olap-left, echo = FALSE, fig.cap = "Left Join", message = FALSE}
left_df <- as.data.frame(left_rng)[, -c(6,7)]
left_df$ymin <- c(1,4, 1)
left_df$ymax <- c(3,6, 3)
left_df$key <- "Left Join"
melted_ranges <- rbind(query_df, subject_df)
melted_ranges$ymin <- 1
melted_ranges$ymax <- 3
melted_ranges <- rbind(melted_ranges, left_df)
melted_ranges$key <- factor(melted_ranges$key,
                              levels = c("Query", "Subject", "Left Join"))

ggplot(melted_ranges,
       aes(xmin = start, xmax = end, ymin = ymin, ymax = ymax)) +
  geom_rect() +
  facet_grid(key ~ ., scales = "free_y") +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) +
  xlab("Position") +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.title.y = element_blank())
```

Compared with `filter_by_overlaps()` above, the overlap left join expands the
`Ranges` to give information about each interval on the query `Ranges` that
overlap those on the subject `Ranges` as well as the intervals on the left that
do not overlap any range on the right.

## Finding your neighbours
We also provide methods for finding nearest, preceding or following `Ranges`.
Conceptually this is identical to our approach for finding overlaps, except the
semantics of the join are different.

```{r neigbours, echo = FALSE, out.width="600px"}
knitr::include_graphics("neighbours.png")
```

```{r}
join_nearest(ir0, ir1)
join_follow(ir0, ir1)
join_precede(ir0, ir1) # nothing precedes returns empty `Ranges`
join_precede(ir1, ir0)
```


## Example: dealing with multi-mapping
This example is taken from the Bioconductor support
[site](https://support.bioconductor.org/p/100046/).

We have two `Ranges` objects. The first contains single nucleotide positions
corresponding to an intensity measurement such as a ChiP-seq experiment, while
the other contains coordinates for two genes of interest.


We want to identify which positions in the `intensities` `Ranges` overlap the
genes, where each row corresponds to a position that overlaps a single gene.

First we create the two `Ranges` objects
```{r ex1}
intensities <- data.frame(seqnames = "VI",
                          start = c(3320:3321,3330:3331,3341:3342),
                          width = 1) %>%
  as_granges()

intensities

genes <- data.frame(seqnames = "VI",
                    start = c(3322, 3030),
                    end = c(3846, 3338),
                    gene_id=c("YFL064C", "YFL065C")) %>%
  as_granges()

genes
```

Now to find where the positions overlap each gene, we can perform an overlap
join. This will automatically carry over the gene_id information as well as
their coordinates (we can drop those by only selecting the gene_id).
```{r}
olap <- join_overlap_inner(intensities, genes) %>%
  select(gene_id)
olap
```

Several positions match to both genes. We can count them using `summarise` and
grouping by the `start` position:

```{r}
olap %>%
  group_by(start) %>%
  summarise(n = n())
```

## Grouping by overlaps

It's also possible to group by overlaps. Using this approach we can count the
number of overlaps that are greater than 0.

```{r}
grp_by_olap <- ir0 %>%
  group_by_overlaps(ir1)
grp_by_olap
grp_by_olap %>%
  mutate(n_overlaps = n())
```

Of course we can also add overlap counts via the `count_overlaps()` function.

```{r}
ir0 %>%
  mutate(n_overlaps = count_overlaps(., ir1))
```

# Data Import/Output

We provide convenience functions via `rtracklayer` and `GenomicAlignments` for
reading/writing the following data formats to/from `Ranges` objects.

| `plyranges` functions | File Format |
|-----------------------|-------------|
| `read_bam()` | BAM | 
| `read_bed()`/`write_bed()` | BED | 
| `read_bed_graph()`/ `write_bed_graph()` | BEDGraph | 
| `read_narrowpeaks()`/`write_narrowpeaks()` | narrowPeaks | 
| `read_gff()` / `write_gff()` | GFF(1-3)/ GTF | 
| `read_bigwig()` / `write_bigwig()` | BigWig | 
| `read_wig()` /`write_wig()` | Wig |


# Learning more

There are many other resources and workshops available to learn to use
`plyranges` and related Bioconductor packages, especially for more realistic
analyses than the ones covered here:

- The [fluentGenomics](https://bioconductor.org/packages/release/workflows/html/fluentGenomics.html) workflow package is an end-to-end workflow package for integrating differential
expression results with differential accessibility results.
- The [Bioc 2018 Workshop book](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using `plyranges` to analyse publicly available genomics data.
- The [extended vignette in the plyrangesWorkshops package](https://github.com/sa-lee/plyrangesWorkshops) has a detailed
walk through of using plyranges for coverage analysis.
- The [case study](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love using plyranges with [tximeta](https://bioconductor.org/packages/release/bioc/html/tximeta.html) to follow
up on interesting hits from a combined RNA-seq and ATAC-seq analysis.
- The [journal article](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1597-8) ([preprint here](https://www.biorxiv.org/content/early/2018/05/23/327841)) has
details about the overall philosophy and design of plyranges.

# Appendix

```{r}
sessionInfo()
```