# Brown Univ. Introduction to Bioconductor 2018, Period 3

## Genomic annotation with Bioconductor

### A hierarchy of annotation concepts

Bioconductor includes many different types of genomic annotation.
We can think of these annotation resources in a hierarchical structure.

- At the base is the __reference genomic sequence__ for an organism.
This is always arranged into chromosomes, specified by linear
sequences of nucleotides.
- Above this is the organization of chromosomal sequence into
__regions of interest__.  The most prominent regions of interest are
genes, but other structures like SNPs or CpG sites are
annotated as well.  Genes have internal structure,
with parts that are transcribed and parts that are not,
and "gene models" define the ways in which
these structures are labeled and laid out in genomic coordinates.
- Within this concept of __regions of interest__ we also identify
__platform-oriented annotation__.  This type of annotation is typically
provided first by the manufacturer of an assay, but then refined
as research identifies ambiguities or updates to
initially declared roles for assay probe elements.  The
[brainarray project](http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp) 
at University of Michigan illustrates this process for affymetrix array annotation.  We 
address this topic of platform-oriented annotation at the very end of this chapter.
- Above this is the organization of regions (most often
genes or gene products) into
__groups with shared structural or functional properties__.  Examples
include pathways, groups of genes found together in cells, or
identified as cooperating in biological processes.

In [None]:
suppressPackageStartupMessages({
    library(BSgenome)
    library(DT)
    library(Homo.sapiens)
    library(TxDb.Hsapiens.UCSC.hg19.knownGene)
    library(org.Hs.eg.db)
    library(ensembldb)
    library(EnsDb.Hsapiens.v75)
    library(AnnotationHub)
})

### Discovering available reference genomes

Bioconductor's collection of annotation packages brings
all elements of this hierarchy into a programmable environment.
Reference genomic sequences are managed using the infrastructure
of the Biostrings and BSgenome packages, and the `available.genomes`
function lists the reference genome build for humans and
various model organisms now available.

In [None]:
library(BSgenome)
library(DT)
ag = available.genomes()
datatable(data.frame(packs=ag))

### Reference build versions are important

The reference build for an organism is created de novo
and then refined as algorithms and sequenced data improve.
For humans, the Genome Research Consortium signed off on
build 37 in 2009, and on build 38 in 2013.

Once a reference build is completed, it becomes easy to
perform informative genomic sequence analysis on individuals, because one can
focus on regions that are known to harbor allelic diversity.

Note that the genome sequence packages have long names
that include build versions.  It is very important to avoid
mixing coordinates from different reference builds.
In the liftOver video we show how to convert genomic coordinates of
features between different reference builds, using the UCSC
"liftOver" utility interfaced to R in the
rtracklayer package.

To help users avoid mixing up data collected on incompatible
genomic coordinate systems from different reference builds, we
include a "genome" tag that can be filled out for most objects
that hold sequence information.  We'll see some examples of
this shortly.  Software for sequence comparison can check
for compatible tags on the sequences
being compared, and thereby help to ensure meaningful results.

<a name="hsap"></a>

## A reference genomic sequence for H. sapiens

The reference sequence for *Homo sapiens* is acquired by installing
and attaching
a single package.  This is in contrast to downloading and parsing
FASTA files.  The package defines an object `Hsapiens`
that is the source of chromosomal sequence, but when
evaluated on its own
provides a report of the origins of the sequence data that
it contains.

In [None]:
library(BSgenome.Hsapiens.UCSC.hg19)
Hsapiens

We acquire a chromosome's sequence using the `$` operator.

In [None]:
Hsapiens$chr17

<a name="txUCSCnENSEMBLE"></a>

## The transcripts and genes for a reference sequence

### UCSC annotation

The `TxDb` family of packages and data objects manages
information on transcripts and gene models.  We consider
those derived from annotation tables prepared for the
UCSC genome browser.

In [None]:
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene # abbreviate
txdb

We can use `genes()` to get the addresses of genes using
Entrez Gene IDs.

In [None]:
ghs = genes(txdb)
ghs

Filtering is supported, with suitable identifiers.
Here we select all exons identified for two
different genes, identified by their Entrez Gene ids:

In [None]:
eForTwo = exons(txdb, columns=c("EXONID", "TXNAME", "GENEID"),
                  filter=list(gene_id=c(100, 101)))
eForTwo

In [None]:
split(eForTwo, unlist(eForTwo$GENEID)) #notice that GENEID is a CharacterList

### ENSEMBL annotation

From the [Ensembl home page](http://www.ensembl.org/index.html):
"Ensembl creates, integrates and distributes reference datasets and
analysis tools that enable genomics".  This project is lodged
at the [European Molecular Biology Lab](https://www.ebi.ac.uk/),
which has been supportive of general interoperation of
annotation resources with
Bioconductor.

The [ensembldb](http://www.bioconductor.org/packages/ensembldb) package includes a vignette
with the following commentary:

_The ensembldb package provides functions to create and use
transcript centric annotation databases/packages. The annotation for the
databases are
directly fetched from Ensembl 1 using their Perl
API. The functionality and data is similar to
that of the TxDb packages from the  GenomicFeatures
package, but, in addition to retrieve all gene/transcript models
and annotations from the database, the
ensembldb package provides also a filter framework allowing
to retrieve annotations for specific entries like
genes encoded on a chromosome region or transcript
models of lincRNA genes. From version 1.7 on,
EnsDb databases created by the ensembldb package contain
also protein annotation data
(see [Section 11](http://bioconductor.org/packages/release/bioc/vignettes/ensembldb/inst/doc/ensembldb.html#org35014ed) for
the database layout and an
overview of available attributes/columns). For more information
on the use of the protein annotations refer to the proteins vignette._

In [None]:
library(ensembldb)
library(EnsDb.Hsapiens.v75)
names(listTables(EnsDb.Hsapiens.v75))

As an illustration:

In [None]:
edb = EnsDb.Hsapiens.v75  # abbreviate
txs <- transcripts(edb, filter = GenenameFilter("ZBTB16"),
                   columns = c("protein_id", "uniprot_id", "tx_biotype"))
txs

## AnnotationHub -- curated access to reference annotation

From the [AnnotationHub](http://www.bioconductor.org/packages/AnnotationHub) vignette:

_The AnnotationHub server provides easy R / Bioconductor access to large collections of publicly available whole genome resources, e.g,. ENSEMBL genome fasta or gtf files, UCSC chain resources, ENCODE data tracks at UCSC, etc._

We will get a general overview and then carry out a detailed query.  We start by loading the package
and obtaining a hub object.

In [None]:
library(AnnotationHub)
ah = AnnotationHub()
ah

Note that there is a specific snapshot date.  The `mcols` method produces metadata about
the various resources.  The `$` shortcut also works.

In [None]:
dim(mcols(ah))

The `rdataclass` field of the metadata tells us what kinds of representations are available.

In [None]:
table(ah$rdataclass)

In [None]:
mcols(ah)[which(ah$rdataclass=="VcfFile"),]