Personal Cancer Genome Reporter (PCGR)
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.

README.md

Personal Cancer Genome Reporter (PCGR)- variant interpretation report for precision oncology

Overview

The Personal Cancer Genome Reporter (PCGR) is a stand-alone software package for functional annotation and translation of individual cancer genomes for precision oncology. It interprets both somatic SNVs/InDels and copy number aberrations. The software extends basic gene and variant annotations from the Ensembl’s Variant Effect Predictor (VEP) with oncology-relevant, up-to-date annotations retrieved flexibly through vcfanno, and produces interactive HTML reports intended for clinical interpretation.

PCGR overview

News

  • Nov 12th 2018:
    • Dev version is available for download (installation instructions below)
    • Reporting germline variants for cancer predisposition? Check out github.com/sigven/cpsr
  • May 14th 2018: 0.6.2.1 release
  • May 9th 2018: 0.6.2 release
    • Fixed various bugs reported by users (see CHANGELOG)
    • Data bundle update (ClinVar, KEGG, CIViC, UniProt, DiseaseOntology)
  • May 2nd 2018: 0.6.1 release
    • Fixed bugs in tier assignment
  • April 25th 2018: 0.6.0 release
    • Updated data sources
    • Enabling specification of tumor type of input sample
    • New tier system for classification of variants (ACMG-like)
    • VCF validation can be turned off
    • Tumor DP/AF presets
    • JSON dump of report content
    • GRCh38 support
    • Runs under Python3
  • November 29th 2017: 0.5.3 release
    • Fixed bug with propagation of default options
  • November 23rd 2017: 0.5.2 release
  • November 15th 2017: 0.5.1 pre-release
    • Bug fixing (VCF validation)
  • November 14th 2017: 0.5.0 pre-release
    • Updated version of VEP (v90)
    • Updated versions of ClinVar, Uniprot KB, CIViC, CBMDB
    • Removal of ExAC (replaced by gnomAD), removal of COSMIC due to licensing restrictions
    • Users can analyze samples run without matching control (i.e. tumor-only)
    • PCGR pipeline is now configured through a TOML-based configuration file
    • Bug fixes / general speed improvements
    • Work in progress: Export of report data through JSON

Example reports

PCGR documentation

Documentation Status

IMPORTANT: If you use PCGR, please cite the publication:

Sigve Nakken, Ghislain Fournous, Daniel Vodák, Lars Birger Aaasheim, Ola Myklebost, and Eivind Hovig. Personal Cancer Genome Reporter: variant interpretation report for precision oncology (2017). Bioinformatics. 34(10):1778–1780. doi:10.1093/bioinformatics/btx817

Annotation resources included in PCGR

  • VEP v94 - Variant Effect Predictor (GENCODE v28/v19 as the gene reference dataset)
  • CIViC - Clinical interpretations of variants in cancer (November 12th 2018)
  • ClinVar - Database of variants with clinical significance (November 2018)
  • DoCM - Database of curated mutations (v3.2, April 2016)
  • CBMDB - Cancer Biomarkers database (January 17th 2018)
  • IntOGen catalog of driver mutations - (May 2016)
  • DisGeNET - Database of gene-tumor type associations (May 2017)
  • Cancer Hotspots - Resource for statistically significant mutations in cancer (v2 - 2017)
  • dBNSFP v3.5 - Database of non-synonymous functional predictions (August 2017)
  • TCGA release 13 - somatic mutations discovered across 33 tumor type cohorts (The Cancer Genome Atlas)
  • UniProt/SwissProt KnowledgeBase 2018_09 - Resource on protein sequence and functional information (October 2018)
  • Pfam v32 - Database of protein families and domains (September 2018)
  • DGIdb - Database of targeted cancer drugs (v3.0.2, January 2018)
  • ChEMBL - Manually curated database of bioactive molecules (v24.1, June 2018)
  • CancerMine v6 - Literature-derived database of tumor suppressor genes/proto-oncogenes (November 2018)

Getting started

STEP 0: Python

An installation of Python (version 3.6) is required to run PCGR. Check that Python is installed by typing python --version in your terminal window. In addition, a Python library for parsing configuration files encoded with TOML is needed. To install, simply run the following command:

pip install toml

STEP 1: Installation of Docker

  1. Install the Docker engine on your preferred platform
    • installing Docker on Linux
    • installing Docker on Mac OS
    • NOTE: We have not yet been able to perform enough testing on the Windows platform, and we have received feedback that particular versions of Docker/Windows do not work with PCGR (an example being mounting of data volumes)
  2. Test that Docker is running, e.g. by typing docker ps or docker images in the terminal window
  3. Adjust the computing resources dedicated to the Docker, i.e.:

STEP 2: Download PCGR and data bundle

Development version

a. Clone the PCGR GitHub repository (includes run script and configuration file): git clone https://github.com/sigven/pcgr.git

b. Download and unpack the latest data bundles in the PCGR directory

c. Pull the PCGR Docker image (dev) from DockerHub (approx 5.1Gb):

  • docker pull sigven/pcgr:dev (PCGR annotation engine)
Latest release

a. Download and unpack the latest software release (0.6.2.1)

b. Download and unpack the assembly-specific data bundle in the PCGR directory

A _data/_ folder within the _pcgr-X.X_ software folder should now have been produced

c. Pull the PCGR Docker image (0.6.2.1) from DockerHub (approx 5.1Gb):

  • docker pull sigven/pcgr:0.6.2.1 (PCGR annotation engine)

STEP 3: Input preprocessing

The PCGR workflow accepts two types of input files:

  • An unannotated, single-sample VCF file (>= v4.2) with called somatic variants (SNVs/InDels)
  • A copy number segment file

PCGR can be run with either or both of the two input files present.

  • We strongly recommend that the input VCF is compressed and indexed using bgzip and tabix
  • If the input VCF contains multi-allelic sites, these will be subject to decomposition
  • Variants used for reporting should be designated as 'PASS' in the VCF FILTER column

The tab-separated values file with copy number aberrations MUST contain the following four columns:

  • Chromosome
  • Start
  • End
  • Segment_Mean

Here, Chromosome, Start, and End denote the chromosomal segment, and Segment_Mean denotes the log(2) ratio for a particular segment, which is a common output of somatic copy number alteration callers. Note that coordinates must be one-based (i.e. chromosomes start at 1, not 0). Below shows the initial part of a copy number segment file that is formatted correctly according to PCGR's requirements:

Chromosome	Start	End	Segment_Mean
1 3218329 3550598 0.0024
1 3552451 4593614 0.1995
1 4593663 6433129 -1.0277

STEP 4: Configure PCGR

The PCGR configuration file, formatted using TOML (an easy to read file format) enables the user to configure a number of options in the PCGR workflow, related to the following:

  • Tumor type of input sample
  • Tier model
  • Sequencing depth/allelic support thresholds
  • MSI prediction
  • Mutational signatures analysis
  • Mutational burden analysis (e.g. target size)
  • VCF to MAF conversion
  • Tumor-only analysis options (i.e. exclusion of germline variants/enrichment for somatic calls)
  • VEP/vcfanno options
  • Log-ratio thresholds for gains/losses in CNA analysis

See here for more details about the exact usage of the configuration options.

The PCGR software bundle comes with a default configuration file (pcgr.toml), to be used as a starting point for runnning the PCGR workflow.

STEP 5: Run example

A tumor sample report is generated by calling the Python script pcgr.py, which takes the following arguments and options:

usage: pcgr.py [-h] [--input_vcf INPUT_VCF] [--input_cna INPUT_CNA]
		[--force_overwrite] [--version] [--basic]
		[--docker-uid DOCKER_USER_ID] [--no-docker]
		pcgr_dir output_dir {grch37,grch38} configuration_file
		sample_id

Personal Cancer Genome Reporter (PCGR) workflow for clinical interpretation of
somatic nucleotide variants and copy number aberration segments

positional arguments:
pcgr_dir              PCGR base directory with accompanying data directory,
			    e.g. ~/pcgr-0.6.2.1
output_dir            Output directory
{grch37,grch38}       Genome assembly build: grch37 or grch38
configuration_file    PCGR configuration file (TOML format)
sample_id             Tumor sample/cancer genome identifier - prefix for
			    output files

optional arguments:
-h, --help            show this help message and exit
--input_vcf INPUT_VCF
			    VCF input file with somatic query variants
			    (SNVs/InDels). (default: None)
--input_cna INPUT_CNA
			    Somatic copy number alteration segments (tab-separated
			    values) (default: None)
--force_overwrite     By default, the script will fail with an error if any
			    output file already exists. You can force the
			    overwrite of existing result files by using this flag
			    (default: False)
--version             show program's version number and exit
--basic               Run functional variant annotation on VCF through
			    VEP/vcfanno, omit other analyses (i.e. CNA, MSI,
			    report generation etc. (STEP 4) (default: False)
--docker-uid DOCKER_USER_ID
			    Docker user ID. Default is the host system user ID. If
			    you are experiencing permission errors, try setting
			    this up to root (`--docker-uid root`) (default: None)
--no-docker           Run the PCGR workflow in a non-Docker mode (see
			    install_no_docker/ folder for instructions (default:
			    False)

The examples folder contain input files from two tumor samples sequenced within TCGA (GRCh37 only). It also contains PCGR configuration files customized for these cases. A report for a colorectal tumor case can be generated by running the following command in your terminal window:

python pcgr.py --input_vcf ~/pcgr-0.6.2.1/examples/tumor_sample.COAD.vcf.gz --input_cna ~/pcgr-0.6.2.1/examples/tumor_sample.COAD.cna.tsv ~/pcgr-0.6.2.1 ~/pcgr-0.6.2.1/examples grch37 ~/pcgr-0.6.2.1/examples/pcgr_conf.COAD.toml tumor_sample.COAD

This command will run the Docker-based PCGR workflow and produce the following output files in the examples folder:

  1. tumor_sample.COAD.pcgr_acmg.grch37.html - An interactive HTML report for clinical interpretation
  2. tumor_sample.COAD.pcgr_acmg.grch37.pass.vcf.gz - Bgzipped VCF file with rich set of annotations for precision oncology
  3. tumor_sample.COAD.pcgr_acmg.grch37.pass.tsv.gz - Compressed vcf2tsv-converted file with rich set of annotations for precision oncology
  4. tumor_sample.COAD.pcgr_acmg.grch37.snvs_indels.tiers.tsv - Tab-separated values file with variants organized according to tiers of functional relevance
  5. tumor_sample.COAD.pcgr_acmg.grch37.json.gz - Compressed JSON dump of HTML report content
  6. tumor_sample.COAD.pcgr_acmg.grch37.cna_segments.tsv.gz - Compressed tab-separated values file with annotations of gene transcripts that overlap with somatic copy number aberrations

Contact

sigven@ifi.uio.no