Skip to content

Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny

License

Notifications You must be signed in to change notification settings

yuchaojiang/MARATHON

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MARATHON

Copy number variation is an important and abundant source of variation in the human genome, which has been associated with a number of diseases, especially cancer. Massively parallel next-generation sequencing allows copy number profiling with fine resolution. Such efforts, however, have met with mixed successes, with setbacks arising partly from the lack of reliable analytical methods to meet the diverse and unique challenges arising from the myriad experimental designs and study goals in genetic studies. In cancer genomics, detection of somatic copy number changes and profiling of allele-specific copy number (ASCN) are complicated by experimental biases and artifacts as well as normal cell contamination and cancer subclone admixture. Furthermore, careful statistical modeling is warranted to reconstruct tumor phylogeny by both somatic ASCN changes and single nucleotide variants. Here we describe a flexible computational pipeline, MARATHON (copy nuMber vARiAtion and Tumor pHylOgeNy), which integrates multiple related statistical software for copy number profiling and downstream analyses in disease genetic studies.

Manuscript

Urrutia E, Chen H, Zhou Z, Zhang NR, Jiang Y. Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny. Bioinformatics, 34 (12), 2126-2128, 2018. (link)

Questions & Problems

If you have any questions or problems when using MARATHON, you can: (i) open a new issue here; (ii) post in our Google user group https://groups.google.com/d/forum/marathon_genomics or email us at marathon_genomics@googlegroups.com; (iii) email the maintainers of the corresponding packages -- the contact information is shown under Developers & Maintainers. The first two contact options are preferred and we will try our best to reply as soon as possible.

Installation

Installation Option 1: Docker Image - Good for ease of installation

A docker image is available here. This image is an Rstudio GUI built on rocker/tidyverse with MARATHON as well as all of its dependent packages and datasets pre-installed. Note that this can take a while to download the human reference genome as well as the toy sequencing dataset. Instructions for using Docker can be found here.

docker pull lzeppelini/marathon

Installation Option 2: Install to R/RStudio - Good for performance

Install all packages in the latest version of R.

install.packages(c("falcon", "falconx", "devtools"))
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("WES.1KG.WUGSC", "GenomeInfoDbData", "GenomeInfoDb", "VariantAnnotation"))
devtools::install_github(c("yuchaojiang/CODEX/package", "yuchaojiang/CODEX2/package", "yuchaojiang/Canopy/package", "zhouzilu/iCNV", "yuchaojiang/MARATHON/package"))

Pipeline overview

The possible analysis scenarios are listed in Table 1. Figure 1 gives an outline for the relationship between the software: CODEX and CODEX2 perform read depth normalization for total copy number profiling; read depth normalized by CODEX/CODEX2 is received by iCNV, which combines it with allele-specific read counts and microarray data to detect CNVs; FALCON and FALCON-X perform ASCN analysis; and Canopy receives input from FALCON/FALCON-X to perform tumor phylogeny reconstruction.

Figure 1. A flowchart outlining the procedures for profiling CNV, ASCN, and reconstructing tumor phylogeny. CNVs with common and rare population frequencies can be profiled by CODEX and CODEX2, with and without negative control samples. iCNV integrates sequencing and microarray data for CNV detection. ASCNs can be profiled by FALCON and FALCON-X using allelic read counts at germline heterozygous loci. Canopy infers tumor phylogeny using somatic SNVs and ASCNs.

Table 1. Analysis scenarios and pipeline design. The last column shows the sequence of software that should be used for each analysis scenario. * By “normal” we mean samples that are not derived from tumor tissue, which are not expected to carry chromosome-level copy number changes.

Running MARATHON

R notebook with step-by-step demonstration and rich display is available here. Corresponding Rmd script is available here.

Citation

Please cite MARATHON as well as all the dependent packages that you use.

Developers & Maintainers

  • Gene Urrutia (gene dot urrutia at gmail dot com)
    Innovation, Hill-Rom Corp.

  • Yuchao Jiang (yuchaoj at email dot unc dot edu)
    Department of Biostatistics & Department of Genetics, UNC-Chapel Hill

  • Hao Chen (hxchen at ucdavis dot edu)
    Department of Statistics, UC Davis

  • Zilu Zhou (zhouzilu at pennmedicine dot upenn dot edu)
    Genomics and Computational Biology Graduate Group, UPenn

  • Nancy R. Zhang (nzh at wharton dot upenn dot edu)
    Department of Statistics, UPenn

About

Integrative pipeline for profiling DNA copy number and inferring tumor phylogeny

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages