Sequential lm for methylation data
R
Latest commit 9f9bfe2 Aug 4, 2016 @raivokolde committed on GitHub Added citation to README
Permalink
Failed to load latest commit information.
R Fixed a bug Jul 15, 2016
data Added example data Oct 14, 2013
man Now importing S4Vectors Oct 28, 2015
DESCRIPTION Now importing S4Vectors Oct 28, 2015
NAMESPACE Now importing S4Vectors Oct 28, 2015
NEWS Added the necessary files to create a package Oct 10, 2013
README.md Added citation to README Aug 4, 2016

README.md

seqlm

An R package for identification of differentially methylated regions (DMRs) from high density chip, for example Illumina 450K, data.

Citation

R Kolde*, K Märtens*, K Lokk, S Laur, J Vilo 2016. “Seqlm: an MDL Based Method for Identifying Differentially Methylated Regions in High Density Methylation Array Data.” Bioinformatics (Oxford, England): btw304.

Installation

The most convenient way to install the package is by using the devtools package.

library(devtools)
install_github("raivokolde/seqlm")

To start using the package just load it as any other package.

library(seqlm)

Usage

For running the algorithm one has to have three objects:

  • a matrix with methylation values;
  • a vector specifying the classes of columns (only two-class case is supported currently) or a continuous variable;
  • location information about the methylation probes in GRanges format (a file for Illumina 450K platform can be downloaded from here).

All the work is done by one command seqlm that takes as input all the objects described above and also parameters:

  • max_block_length - that determines the maximal number of CpG sites in a region and is used to speed up the computations (default 50)
  • max_dist - that determines the maximal allowed genomic distance between two consecutive probes in a region (default 1000bp)

For more information use ?seqlm in R.

Optionally one can visualise the results using seqlmreport command that visualises the identified regions and creates a html page to show them. The main input for this function is the result of seqlm and the three objects described above. There are additional options to control the appearance of the plots, see ?seqlmreport for details.

Example

An example dataset tissue_small is included in the package. It contains data comparing adipose tissue and brainstem, from chromosomes 17 and 18. Finding regions in this data can be accomplished using commands:

data(tissue_small)
segments = seqlm(values = tissue_small$values, genome_information = tissue_small$genome_information, annotation =  tissue_small$annotation)

The result of the analysis is a GRanges object containing the locations of the regions and associated statistics.

The analysis can be time consuming, if the whole genome is analysed at once. If the computer has multicore capabilities it is easy to parallelize the calculations. We use the foreach framework by Revolution Computing for parallelization. To enable the parallelization one has to register the parallel backend before and this will be used by seqlm. Ideally, the next commands should take roughly half the time compared to the previous.

library(doParallel)
registerDoParallel(cores = 2)
segments = seqlm(values = tissue_small$values, genome_information = tissue_small$genome_information, annotation =  tissue_small$annotation)

To visualise the results it is possible to plot the most imortant sites and generate a HTML report

temp = tempdir()
seqlmreport(segments[1:10], tissue_small$values, tissue_small$genome_information, tissue_small$annotation, dir = temp)

Here is an example of the resulting file.

Method

The method is described in a poster, that was presented at the Epigenetics of Common Diseases conference. Briefly, the seqlm method works in three stages.

Stage 1: The genome is divided into smaller pieces based on a genomic distance cutoff.

Stage 2: In each piece probes are segmented into regions that have approximately constant difference between the groups of interest. Example of the segmentation and its process is shown in schema.

  • In sliding windows with variable sizes we fit a linear models to the data.
  • For each model we record the description length - the amount of bits needed to describe the data using the model
  • Using dynamic programming we find the segmentation that minimizes total description length

Stage 3: We assess the relevance of each segment, by using a mixed model where the classes are a fixed effect and a sample is a random effect. This model takes into account the repeated nature of the consecutive methylation measurements. The segments are ordered by their significance.

Example of seqlm segmentation