# Genomic ranges for organizing and interrogating genome-scale data

Period 2 has the following basic outline.  We want to understand the basic IRanges and GRanges infrastructure components and then use them to organize and interrogate genomic experiments.

```
Period II. Working with general genomic features using GenomicRanges
  IRanges introduced
  Intra-range operations
  Inter-range operations
  GRanges
  Calculating overlaps
Range-oriented solutions for current experimental paradigms
  SummarizedExperiment: for RNA-seq and 450k methylation
  External storage for very large assays
  GenomicFiles for families of BAM or BED
  DNA Variants: VCF handling with VariantAnnotation and VariantTools
  ```


## Introducing IRanges

The following schematic diagram should be read from the bottom up.  The horizontal scale can be regarded
as genomic base positions.  

### Intra-range operations

We are working with the positions in the interval [5, 10].  We will learn how to interpret the methods
`shift`, `narrow`, `flank`, `resize`, and various arithmetic operations.

<img src="iranges.png" height="450" width="450" />

We create our basic IRanges instance:

In [None]:
suppressPackageStartupMessages({
    library(IRanges)
    library(Homo.sapiens)
    library(GenomicRanges)
    })
ir = IRanges(5, 10)
ir

Now function calls for selected 'intra-range' operations.

In [None]:
shift(ir, -2)

In [None]:
resize(ir, 1)

### Multi-range objects

We can create a family of ranges using vector inputs to the IRanges method.

In [None]:
ir <- IRanges(c(3, 8, 14, 15, 19, 34, 40),
  width = c(12, 6, 6, 15, 6, 2, 7))
ir

This range set is displayed in the figure below.  The intra-range operations will be applied elementwise.

In [None]:
resize(ir,1)  # leftmost width-1 position 

### Inter-range operations

Information about inter-range operations can be obtained using `?"inter-range-methods"`.  For example, for a multi-range instance `ir`, `reduce(ir)`
produces a new IRanges instance representing the merging of all locations occupied by any range.

In [None]:
reduce(ir)

<img src="multirange.png" height="500" width="500" />

### Metadata and indexing for ranges

We can give names to ranges, associate multiple fields of metadata to each range (using `mcols`), and use bracket-style indexing.

In [None]:
names(ir) = letters[1:7]
ir[c("a", "d")]

In [None]:
mcols(ir) = mtcars[1:7,1:3]
ir

In [None]:
resize(ir,1) # metadata are propagated for intra-range operations

In [None]:
gaps(ir) # not for inter-range operations

`IRanges` is the name of a formal class, and we can enumerate all known methods on this class:

In [None]:
length(methods(class="IRanges"))

Clearly there is substantial infrastructure defined for this concept.  The roles of some of these
methods in genome-scale analysis becomes clearer in the next section.

## GRanges to handle the context of genomic coordinates

Base positions and intervals on genomic sequences can be modeled using IRanges, but it is essential
to add metadata that establish a number of contextual details.  It is typical to maintain information
about chromosome identity and chromosome length, along with labels for genome build and origin.
We saw one example early on: apply the `genes` method to `Homo.sapiens`.

In [None]:
library(Homo.sapiens)
hg = genes(Homo.sapiens)
hg

There is an obligatory metadata construct called `seqnames` that gives the chromosome occupied by the gene whose start and end positions are modeled by the associated `IRanges`.  Strand is also recorded.  

Plus strand features have the biological direction from left to right on the number line, and minus strand features have the biological direction from right to left. In terms of the IRanges, plus strand features go from start to end, and minus strand features go from end to start. This is required because width is defined as end - start + 1, and negative width ranges are not allowed. Because DNA has two strands, which have an opposite directionality, strand is necessary for uniquely referring to DNA.

Strand may have values `+`, `-`, or `*` for unspecified.  `seqinfo` collects information on the chromosome names, lengths, circularity, and reference build.

### Vector operations

GRanges can be treated as any standard vector.

In [None]:
hg[1:4] # first four in the lexical ordering of `names(hg)`

In [None]:
sort(hg)[1:4]  # physical ordering on plus strand

In [None]:
savestrand = strand(hg)
strand(hg) = "*"
sort(hg)[1:4] # different!

In [None]:
strand(hg) = savestrand  # restore

### Multichromosome context

`seqinfo` is an important method for/component of well-annotated GenomicRanges instances.

In [None]:
seqinfo(hg)

In [None]:
sum(isCircular(hg), na.rm=TRUE) # how many circular chromosomes?

In [None]:
seqinfo(hg)["chrM"]

In [None]:
# table(seqnames(hg)) # counts of genes per chromosome (or random/unmapped contig)

In [None]:
hg[ which(seqnames(hg)=="chr22") ]

In [None]:
hgs = keepStandardChromosomes(hg, pruning.mode="coarse") # eliminate random/unmapped
hgs

In [None]:
table(seqnames(hgs))

### GRangesList for grouped genomic elements

Exons are elements of gene models.  The `exons` method gives a flat sequence of GRanges recording exon positions.  `exonsBy` organizes the exons into genes, yielding a special structure called `GRangesList`.

In [None]:
ebg = exonsBy(Homo.sapiens, by = "gene")
ebg

In [None]:
elementNROWS(ebg)[1:10] # number of exons recorded per gene

In [None]:
# length(methods(class="GRangesList"))

In [None]:
# keepStandardChromosomes(ebg, pruning.mode="coarse")

In [None]:
plotGRanges = function (x, xlim = x, col = "black", sep = 0.5, xlimits = c(0, 
    60), ...) 
{
    main = deparse(substitute(x))
    ch = as.character(seqnames(x)[1])
    x = ranges(x)
    height <- 1
    if (is(xlim, "Ranges")) 
        xlim <- c(min(start(xlim)), max(end(xlim)))
    bins <- disjointBins(IRanges(start(x), end(x) + 1)) 
    plot.new()
    plot.window(xlim = xlimits, c(0, max(bins) * (height + sep)))
    ybottom <- bins * (sep + height) - height
    rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, 
        col = col, ...)
    title(main, xlab = ch) 
    axis(1)
}
    par(mfrow=c(4,1), mar=c(4,2,2,2))
    library(GenomicRanges)
gir = GRanges(seqnames="chr1", ir, strand=c(rep("+", 4), rep("-",3)))
plotGRanges(gir, xlim=c(0,60))
plotGRanges(resize(gir,1), xlim=c(0,60),col="green")
plotGRanges(flank(gir,3), xlim=c(0,60), col="purple")
plotGRanges(flank(gir,2,start=FALSE), xlim=c(0,60), col="brown")
