Split from consensus caller repo. This is a set of QC procedures used in validating consensus caller performace.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data
python
test
.gitignore
README.md
cges_qc.xml

README.md

CGES QC

Quality summary pipeline for the consensus caller. The program takes four VCF files as inputs, and produces a variety of plots as output. The four input files refer to VCF files generated with Atlas, GATK Unified Genotyper, Freebayes, and a consensus set generated by CGES. The tool itself is written for easy integration with galaxy.

Plots produced:

  • Mendel inconsistencies
  • Sample Missingness
  • Sample F-statistics (summarizing heterozygosity)
  • Minor allele frequency spectra
  • Transition-transversion mutation ratio

Extensions:

There are a few missing plots that are reported in the manuscript:

  • Site missingness spectra
  • Site/Sample concordance plots for consensus
  • Rediscovery plots. (( the issue here is packaging reference sets ))

Options:

-h, --help            show this help message and exit
--cges-vcf=CGES       File path for CGES VCF for which to generate QC
                      metrics.
--atlas-vcf=ATLAS     File path for ATLAS VCF for which to generate QC
                      metrics.
--gatk-vcf=GATK       File path for GATK VCF for which to generate QC
                      metrics.
--freebayes-vcf=FREEBAYES
                      File path for Freebayes VCF for which to generate QC
                      metrics.
--ped-file=PEDFILE    Pedigree file for samples (Optional).
--tstv-out=TSTVOUT    Output file location for TsTv plots PDF.
--het-out=HETOUT      Output file location for heterozygosity plots PDF.
--maf-out=MAFOUT      Output file location for minor allele frequency plots
                      PDF.
--miss-out=MISSOUT    Output file location for missingess plots PDF.
--rediscover-out=REDISCOVEROUT
                      Output file location for rediscovery rate plots PDF.
--mendel-out=MENDELOUT
                      Output file location for Mendel inconsistency plots
                      PDF.
--temp-dir=TEMPDIR    Directory for writing intermediate analysis files.

Usage:

Note: This tool wraps a number of other programs: PLINK, vcftools, and R code files run with Rscript. Should you want to run this, you must have these programs in your search path.

This is a simple python executable where input VCF files and output PDFs of plots are specified explicitly. There is also the requirement of a temporary directory. This is where intermediate files produced by PLINK and vcftools are written.

A full summary can be generated with the command:

python python/qc.pipeline.py \
  --atlas-vcf "test/atlas.test.vcf" \
  --gatk-vcf "test/gatk.test.vcf" \
  --freebayes-vcf "test/freebayes.test.vcf" \
  --cges-vcf "test/cges.test.vcf" \
  --ped-file "test/test.pedigree.txt" \
  --tstv-out "test/tstv.dat" \
  --het-out "test/het.dat" \
  --maf-out "test/maf.dat" \
  --miss-out "test/miss.dat" \
  --mendel-out "test/mendel.dat" \
  --temp-dir "test/"