Skip to content

Latest commit

 

History

History
72 lines (56 loc) · 2.99 KB

README.md

File metadata and controls

72 lines (56 loc) · 2.99 KB

CGES QC

Quality summary pipeline for the consensus caller. The program takes four VCF files as inputs, and produces a variety of plots as output. The four input files refer to VCF files generated with Atlas, GATK Unified Genotyper, Freebayes, and a consensus set generated by CGES. The tool itself is written for easy integration with galaxy.

Plots produced:

  • Mendel inconsistencies
  • Sample Missingness
  • Sample F-statistics (summarizing heterozygosity)
  • Minor allele frequency spectra
  • Transition-transversion mutation ratio

Extensions:

There are a few missing plots that are reported in the manuscript:

  • Site missingness spectra
  • Site/Sample concordance plots for consensus
  • Rediscovery plots. (( the issue here is packaging reference sets ))

Options:

-h, --help            show this help message and exit
--cges-vcf=CGES       File path for CGES VCF for which to generate QC
                      metrics.
--atlas-vcf=ATLAS     File path for ATLAS VCF for which to generate QC
                      metrics.
--gatk-vcf=GATK       File path for GATK VCF for which to generate QC
                      metrics.
--freebayes-vcf=FREEBAYES
                      File path for Freebayes VCF for which to generate QC
                      metrics.
--ped-file=PEDFILE    Pedigree file for samples (Optional).
--tstv-out=TSTVOUT    Output file location for TsTv plots PDF.
--het-out=HETOUT      Output file location for heterozygosity plots PDF.
--maf-out=MAFOUT      Output file location for minor allele frequency plots
                      PDF.
--miss-out=MISSOUT    Output file location for missingess plots PDF.
--rediscover-out=REDISCOVEROUT
                      Output file location for rediscovery rate plots PDF.
--mendel-out=MENDELOUT
                      Output file location for Mendel inconsistency plots
                      PDF.
--temp-dir=TEMPDIR    Directory for writing intermediate analysis files.

Usage:

Note: This tool wraps a number of other programs: PLINK, vcftools, and R code files run with Rscript. Should you want to run this, you must have these programs in your search path.

This is a simple python executable where input VCF files and output PDFs of plots are specified explicitly. There is also the requirement of a temporary directory. This is where intermediate files produced by PLINK and vcftools are written.

A full summary can be generated with the command:

python python/qc.pipeline.py \
  --atlas-vcf "test/atlas.test.vcf" \
  --gatk-vcf "test/gatk.test.vcf" \
  --freebayes-vcf "test/freebayes.test.vcf" \
  --cges-vcf "test/cges.test.vcf" \
  --ped-file "test/test.pedigree.txt" \
  --tstv-out "test/tstv.dat" \
  --het-out "test/het.dat" \
  --maf-out "test/maf.dat" \
  --miss-out "test/miss.dat" \
  --mendel-out "test/mendel.dat" \
  --temp-dir "test/"