Permalink
Fetching contributors…
Cannot retrieve contributors at this time
376 lines (262 sloc) 12.4 KB
---
title: 'valr overview'
date: '`r Sys.Date()`'
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 3
vignette: >
%\VignetteIndexEntry{valr-overview}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r knitr_opts, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.align = "center"
)
```
```{r init, echo = FALSE, message = FALSE}
library(valr)
library(dplyr)
library(ggplot2)
library(tibble)
```
## Why `valr`?
**Why another tool set for interval manipulations?** There are several [other software packages](#related_work) available for genome interval analysis. However, based on our experiences teaching genome analysis, we were motivated to develop a toolset that:
* Combines analysis and visualization in RStudio.
* Can be used to generate reports with Rmarkdown.
* Is highly extensible. New tools are quickly implemented on the R side.
* Leverages the "modern R" syntax, using `dplyr` and the pipe operator from `magrittr` (`%>%`).
* Maximizes speed by implementing compute-intensive algorithms in `Rcpp`.
* Facilitates interactive visulaizations with [`shiny`][13].
`valr` can currently be used for analysis of pre-processed data in BED and related formats. We plan to support BAM and VCF files soon via tabix indexes.
### Familiar tools, natively in R
The functions in `valr` have similar names to their `BEDtools` counterparts, and so will be familiar to users coming from the `BEDtools` suite. Similar to [`pybedtools`](https://daler.github.io/pybedtools/#why-pybedtools), `valr` has a terse syntax:
```{r syntax_demo, message = FALSE}
library(valr)
library(dplyr)
snps <- read_bed(valr_example('hg19.snps147.chr22.bed.gz'), n_fields = 6)
genes <- read_bed(valr_example('genes.hg19.chr22.bed.gz'), n_fields = 6)
# find snps in intergenic regions
intergenic <- bed_subtract(snps, genes)
# distance from intergenic snps to nearest gene
nearby <- bed_closest(intergenic, genes)
nearby %>%
select(starts_with('name'), .overlap, .dist) %>%
filter(abs(.dist) < 1000)
```
### Input data
`valr` assigns common column names to facilitate comparisons between tbls. All tbls will have `chrom`, `start`, and `end` columns, and some tbls from multi-column formats will have additional pre-determined column names. See the `read_bed()` documentation for details.
```{r file_io}
bed_file <- valr_example("3fields.bed.gz")
read_bed(bed_file) # accepts filepaths or URLs
```
`valr` can also operate on BED-like data.frames already constructed in R, provided that columns named `chrom`, `start` and `end` are present. New tbls can also be contructed using `trbl_interval()`.
```{r trbl_ivls}
bed <- trbl_interval(
~chrom, ~start, ~end,
"chr1", 1657492, 2657492,
"chr2", 2501324, 3094650
)
bed
```
### Interval coordinates
`valr` adheres to the BED [format](http://genome.ucsc.edu/FAQ/FAQformat#format1) which specifies that the start position for an interval is zero based and the end position is one-based. The first position in a chromosome is 0. The end position for a chromosome is one position passed the last base, and is not included in the interval. For example:
```{r zero-based}
# a chromosome 100 basepairs in length
chrom <- trbl_interval(
~chrom, ~start, ~end,
"chr1", 0, 100
)
chrom
# single basepair intervals
bases <- trbl_interval(
~chrom, ~start, ~end,
"chr1", 0, 1, # first base of chromosome
"chr1", 1, 2, # second base of chromosome
"chr1", 99, 100 # last base of chromosome
)
bases
```
### Remote databases
Remote databases can be accessed with `db_ucsc()` (to access the UCSC Browser) and `db_ensembl()` (to access Ensembl databases).
```{r db, warning = FALSE, eval = FALSE}
# access the `refGene` tbl on the `hg38` assembly.
if(require(RMySQL)) {
ucsc <- db_ucsc('hg38')
tbl(ucsc, 'refGene')
}
```
### Visual documentation
The `bed_glyph()` tool illustrates the results of operations in `valr`, similar to those found in the `BEDtools` documentation. This glyph shows the result of intersecting `x` and `y` intervals with `bed_intersect()`:
```{r intersect_glyph}
x <- tibble::tribble(
~chrom, ~start, ~end,
'chr1', 25, 50,
'chr1', 100, 125
)
y <- tibble::tribble(
~chrom, ~start, ~end,
'chr1', 30, 75
)
bed_glyph(bed_intersect(x, y))
```
And this glyph illustrates `bed_merge()`:
```{r merge_glyph}
x <- tibble::tribble(
~chrom, ~start, ~end,
'chr1', 1, 50,
'chr1', 10, 75,
'chr1', 100, 120
)
bed_glyph(bed_merge(x))
```
### Reproducible reports
`valr` can be used in RMarkdown documents to generate reproducible work-flows for data processing. Because `valr` is reasonably fast (see the [benchmarks](#benchmarks)), we now use it in lieu of other tools for exploratory analysis of genomic data sets in R.
Command-line tools like `BEDtools` and `bedops` can be used in reproducible workflows (e.g., with [`snakemake`][12]), but it is cumbersome to move from command-line tools to exploratory analysis and plotting software. [`pybedtools`][4] can be used within `ipython notebooks` to accomplish a similar goal, but others have pointed out [issues with this approach][11], including clunky version control. Because RMarkdown files are text files, they are readily kept under version control. Moreover, new features in RStudio (e.g. notebook viewing) enable similar functionality to `ipython`.
### Grouping data
The `group_by` function in dplyr can be used to perform fuctions on subsets of single and multiple `data_frame`s. Functions in `valr` leverage grouping to enable a variety of comparisons. For example, intervals can be grouped by `strand` to perform comparisons among intervals on the same strand.
```{r strand}
x <- tibble::tribble(
~chrom, ~start, ~end, ~strand,
'chr1', 1, 100, '+',
'chr1', 50, 150, '+',
'chr2', 100, 200, '-'
)
y <- tibble::tribble(
~chrom, ~start, ~end, ~strand,
'chr1', 50, 125, '+',
'chr1', 50, 150, '-',
'chr2', 50, 150, '+'
)
# intersect tbls by strand
x <- group_by(x, strand)
y <- group_by(y, strand)
bed_intersect(x, y)
```
Comparisons between intervals on opposite strands are done using the `flip_strands()` function:
```{r strand_opp}
x <- group_by(x, strand)
y <- flip_strands(y)
y <- group_by(y, strand)
bed_intersect(x, y)
```
Both single set (e.g. `bed_merge()`) and multi set operations will respect groupings in the input intervals.
### Column specification
Columns in `BEDtools` are referred to by position:
```bash
# calculate the mean of column 6 for intervals in `b` that overlap with `a`
bedtools map -a a.bed -b b.bed -c 6 -o mean
```
In `valr`, columns are referred to by name and can be used in multiple name/value expressions for summaries.
```{r NSE, eval = FALSE}
# calculate the mean and variance for a `value` column
bed_map(a, b, .mean = mean(value), .var = var(value))
# report concatenated and max values for merged intervals
bed_merge(a, .concat = concat(value), .max = max(value))
```
## Getting started
### Meta-analysis
This demonstration illustrates how to use `valr` tools to perform a "meta-analysis" of signals relative to genomic features. Here we to analyze the distribution of histone marks surrounding transcription start sites.
First we load libraries and relevant data.
```{r demo-tss, warning = FALSE, message = FALSE}
# `valr_example()` identifies the path of example files
bedfile <- valr_example('genes.hg19.chr22.bed.gz')
genomefile <- valr_example('hg19.chrom.sizes.gz')
bgfile <- valr_example('hela.h3k4.chip.bg.gz')
genes <- read_bed(bedfile, n_fields = 6)
genome <- read_genome(genomefile)
y <- read_bedgraph(bgfile)
```
Then we generate 1 bp intervals to represent transcription start sites (TSSs). We focus on `+` strand genes, but `-` genes are easily accomodated by filtering them and using `bed_makewindows()` with `reverse`d window numbers.
```{r tss}
# generate 1 bp TSS intervals, `+` strand only
tss <- genes %>%
filter(strand == '+') %>%
mutate(end = start + 1)
# 1000 bp up and downstream
region_size <- 1000
# 50 bp windows
win_size <- 50
# add slop to the TSS, break into windows and add a group
x <- tss %>%
bed_slop(genome, both = region_size) %>%
bed_makewindows(win_size)
x
```
Now we use the `.win_id` group with `bed_map()` to caluclate a sum by mapping `y` signals onto the intervals in `x`. These data are regrouped by `.win_id` and a summary with `mean` and `sd` values is calculated.
```{r map}
# map signals to TSS regions and calculate summary statistics.
res <- bed_map(x, y, win_sum = sum(value, na.rm = TRUE)) %>%
group_by(.win_id) %>%
summarize(win_mean = mean(win_sum, na.rm = TRUE),
win_sd = sd(win_sum, na.rm = TRUE))
res
```
Finally, these summary statistics are used to construct a plot that illustrates histone density surrounding TSSs.
```{r plot, warning = FALSE, message = FALSE, fig.align='center', fig.width=6}
library(ggplot2)
x_labels <- seq(-region_size, region_size, by = win_size * 5)
x_breaks <- seq(1, 41, by = 5)
sd_limits <- aes(ymax = win_mean + win_sd, ymin = win_mean - win_sd)
ggplot(res, aes(x = .win_id, y = win_mean)) +
geom_point() + geom_pointrange(sd_limits) +
scale_x_continuous(labels = x_labels, breaks = x_breaks) +
xlab('Position (bp from TSS)') + ylab('Signal') +
ggtitle('Human H3K4me3 signal near transcription start sites') +
theme_classic()
```
## API
Function names are similar to their their [BEDtools][1] counterparts, with some additions.
### Data types
* Create interval sets with `tbl_interval()` and `tbl_genome()`, which enforce strict column naming.
### Reading data
* Read BED and related files with `read_bed()`, `read_bed12()`, `read_bedgraph()`, `read_narrowpeak()` and `read_broadpeak()`.
* Read genome files containing chromosome name and size information with `read_genome()`.
* Load VCF files with `read_vcf()`.
* Access remote databases with `db_ucsc()` and `db_ensembl()`.
### Transforming single interval sets
* Adjust interval coordinates with `bed_slop()` and `bed_shift()`, and create new flanking intervals with `bed_flank()`.
* Combine nearby intervals with `bed_merge()` and identify nearby intervals with `bed_cluster()`.
* Generate intervals not covered by a query with `bed_complement()`.
* Order intervals with `dplyr::arrange()`.
### Comparing multiple interval sets
* Find overlaps between sets of intervals with `bed_intersect()`.
* Apply functions to overlapping sets of intervals with `bed_map()`.
* Remove intervals based on overlaps with `bed_subtract()`.
* Find overlapping intervals within a window with `bed_window()`.
* Find closest intervals independent of overlaps with `bed_closest()`.
### Randomizing intervals
* Generate random intervals with `bed_random()`.
* Shuffle the coordinates of intervals with `bed_shuffle()`.
* Sample input intervals with `dplyr::sample_n()` and `dplyr::sample_frac()`.
### Interval statistics
* Calculate significance of overlaps between sets of intervals with `bed_fisher()` and `bed_projection()`.
* Quantify relative and absolute distances between sets of intervals with `bed_reldist()` and `bed_absdist()`.
* Quantify extent of overlap between sets of intervals with `bed_jaccard()`.
### Utilities
* Create features from BED12 files with `create_introns()`, `create_tss()`, `create_utrs5()`, and `create_utrs3()`.
* Visualize the actions of valr functions with `bed_glyph()`.
* Constrain intervals to a genome reference with `bound_intervals()`.
* Subdivide intervals with `bed_makewindows()`.
* Convert BED12 to BED6 format with `bed12_to_exons()`.
* Calculate spacing between intervals with `interval_spacing()`.
## Related work
* Command-line tools [BEDtools][1] and [bedops][5].
* The Python library [pybedtools][4] wraps BEDtools.
* The R packages [GenomicRanges][6], [bedr][7], [IRanges][8] and [GenometriCorr][9] provide similar capability with a different philosophy.
[1]: http://bedtools.readthedocs.org/en/latest/
[2]: https://github.com/hadley/dplyr
[3]: http://www.rcpp.org/
[4]: https://pythonhosted.org/pybedtools/
[5]: http://bedops.readthedocs.org/en/latest/index.html
[6]: https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html
[7]: https://CRAN.R-project.org/package=bedr
[8]: https://bioconductor.org/packages/release/bioc/html/IRanges.html
[9]: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002529
[10]: http://rmarkdown.rstudio.com/
[11]: https://www.r-bloggers.com/why-i-dont-like-jupyter-fka-ipython-notebook/
[12]: https://bitbucket.org/snakemake/snakemake/wiki/Home
[13]: http://shiny.rstudio.com/