GitHub - szpiech/garlic: Genomic Autozygosity Regions Likelihood Inference and Classification

szpiech / garlic Public
Notifications You must be signed in to change notification settings
Fork 2
Star 6
Genomic Autozygosity Regions Likelihood Inference and Classification
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
bin		bin
example		example
include		include
lib		lib
manual		manual
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README		README
Repository files navigation

GARLIC is a program for calling runs of homozygosity in genotype data.  It implements the ROH calling method of Pemberton et al. AJHG (2012) and Blant et al. (2017).

Citations:

A Blant, et al. (2017) BMC genomics 18 (1), 928.
ZA Szpiech, et al. (2017) Bioinformatics 33 (13), 2059-2062.
TJ Pemberton, et al. (2012) AJHG, 91: 275–292

**NOTE TO WINDOWS USERS: In order to successfully run GARLIC you must have ann_figtree_version.dll and figtree.dll in the same folder as garlic.exe 

This ROH calling pipeline has four phases.

I.		Window-based LOD score calculation
II.		Kernel Density Estimation of LOD score distribution
III.		Assembly of ROH
IV.		Gaussian Mixture Modeling of ROH size distribution

Input files:

tped/tfam (required): 
	http://zzz.bwh.harvard.edu/plink/data.shtml#tr

map (required when --weighted or --cm are set):
	<chr> <snpid> <genetic pos> <physical pos>
	http://zzz.bwh.harvard.edu/plink/data.shtml#map 
	(negative positions will cause errors)

tgls (optional):
	<chr> <snpid> <unused> <physical pos> <GL ind1> <GL ind2> ... <GL indN>

	Per-genotype likelihoods should be given as either a phred-scaled (--gl-type PL) or a log10-scaled (--gl-type GL) representing the probability that the genotype is correct or a phred-scaled likelihood that the genotype is wrong (--gl-type GQ).  For example, if p = 0.999 is the probability that the genotype is correct, then PL = -10*log10(p) = 0.00434511774018, GL = log10(p) = -0.000434511774018, and GQ = -10*log10(1-p) = 30.

freq (optional, output during computation, can be generated without running whole pipeling with --freq-only): 
	CHR SNP POS ALLELE FREQ
	<chr> <locus ID> <allele> <freq>
	
centromere file (required if no --build is set):
	<chr> <centromere start> <centromere end>

GARLIC depends on the following libraries, which are included in this repository.

GNU GSL: http://www.gnu.org/software/gsl/ 
FIGTree: http://www.umiacs.umd.edu/~morariu/figtree/
zlib: http://zlib.net/

---CHANGES---
05JUN2020 - v1.1.6a Fixes a bug when using --auto-overlap-frac that can cause segmentation faults when not using chromosomes/contigs with numbers. ***I'm having trouble getting mscOS versions to compile, so the current binary for mscOS is still v1.6.0 and still has this bug.***

24AUG2017 - v1.1.6 Modified behavior of --cm flag based on user feedback. When this flag is set ROH lengths and length classifications will be reported in cM but ROH boundaries will be reported in bps.

18JUL2017 - v1.1.5 Introduced --ld-subsample argument. If set > 0, will use only a subset of individuals for computing LD statistics when calculating wLOD scores. 

18MAY2017 - v1.1.4 If a freq file is provided, GARLIC now ensures that internal coding of alleles is consistent with the allele specified in the file. Improved handling of loci with 100% missing data. Improved reading of gzipped freq files. Improved lod score cutoff discovery. Removed default setting for --winsize, which must now be specified. Now allows --auto-winsize to be used with --winsize-multi, will choose best from the list.

12MAY2017 - v1.1.3 adds a new option for --gl-type called GQ, which is a phred-scaled probability that the genotype is wrong. If p = 0.999 is the probability that the genotype is correct, then GQ = -10*log10(1-p) = 30.  The default for --gl-type is now set to none, so if you provide a TGLS file you will be required to choose between GQ, PL, and GL.

08MAY2017 - v1.1.2b fixes a bug that causes a crash when --overlap-frac is set to 0. Setting to zero now chooses the smallest sensible overlap fraction (1/winsize).  If you wish to let garlic automatically guess overlap fraction, use the new command line option: --auto-overlap-frac. Also includes an interation counter to monitor progress of GMM portion of the code.

07MAY2017 - v1.1.2 fixes a bug that causes a crash when a tgls file indicates a genotype has 0 likelihood. Add progress bars for LD calculations and LOD score calculations. Improve efficiency of tgls file loading.

06MAY2017 - v1.1.1a fixes a bug introduced with 1.1.1 that caused crashed when --resample is used.

05MAY2017 - Update to version 1.1.1. For some large datasets (e.g. WGS), KDE can take extremely long to complete, even when only a subset of individuals are considered.  Here we introduce some changes to speed this calculation up.  Instead of passing all LOD score windows to the KDE function, we now pass only non-overlapping windows.  This reduces the number of points sent to the function by a factor of WINSIZE. If you wish to retain all LOD scores, you may set the --no-kde-thinning flag. We also change the default number of random individuals for KDE (--kde-subsample) to 20, which still may be set to 0 to use all individuals. 

We also introduce a lower bound on the size of ROH reported.  Now, an ROH will not be called unless it contains at least WINSIZE*OVERLAP_FRAC number of SNPs. For example, for a WINSIZE (--winsize) of 100 and an OVERLAP_FRAC (--overlap-frac) of 0.25 the smallest possible ROH that will be reported will be 25 SNPs long.

Finally, if --overlap-frac is set to 0, GARLIC will attempt to guess a good choice.

02MAY2017 - Update to version 1.1.0a. Bug fixed that caused crashes when using TGLS file. 

28APR2017 - Update to version 1.1.0.  Includes several updates, including a method for reweighting LOD scores by gaps between SNPs, the option to report ROH lengths in cM, and the ability to provide per-genotype error rates.

25OCT2016 - Update to version 1.0.1.  This update fixes a bug where ROH that extended upto the end of the chromosome failed to get assembled and reported.
This update also introduces a new command-line flag --overlap-frac which is designed to reduce false positive calls by requiring a SNP be covered by at 
least OVERLAP_FRAC (default 0.25) proportion of high scoring windows to be included in an ROH call.  This helps to reduce false positive calls near the 
boundaries of true ROH.

garlic v1.1.4 -- a program to call runs of homozygosity in genetic data.
Source code and binaries can be found at <https://www.github.com/szpiech/garlic>.

Citations:

ZA Szpiech, et al. (2017) Bioinformatics, doi: 10.1093/bioinformatics/btx102.
TJ Pemberton, et al. (2012) AJHG, 91: 275–292.

----------Command Line Arguments----------

--M <int>: The expected number of meioses since a recent common ancestor for --weighted calcualtion.
	Default: 7

--auto-overlap-frac <bool>: If set, GARLIC attempts to guess based on marker density.
	Default: false

--auto-winsize <bool>: If --weighted is set, guesses the best window size based on SNP density, otherwise
	initiates an ad hoc method for automatically selecting the # of SNPs in which to
	calculate LOD scores. Starts at the value specified by --winsize and increases
	by <step size> SNPs until finished.
	Default: false

--auto-winsize-step <int>: Step size for automatic window selection algorithm.
	Default: 10

--build <string>: Choose which genome build to use for centromere locations (hg18, hg19, or hg38).
	Default: none

--centromere <string>: Provide custom centromere boundaries. Format <chr> <start> <end>.
	Default: none

--cm <bool>: Construct ROH in genetic distance units. This requires a mapfile.
	Default: false

--error <double>: The assumed genotyping error rate.
	Default: -1.000000e+00

--freq-file <string>: A file specifying allele frequencies for
	each population for all variants. File format:
	CHR SNP POS ALLELE FREQ
	<chr> <locus ID> <allele> <freq>
	By default, this is calculated automatically
	from the provided data.
	Default: none

--freq-only <bool>: If set, calculates a freq file from provided data and then exits. Uses minimal RAM.
	Default: false

--gl-type <string>: Specify the form of the genotype likelihood data: GQ, GL, or PL.
	GQ is a phred-scaled likelihood of the genotype being incorrect.
	PL is a phred-scaled likelihood of the genotype being correct.
	GL is a log10-scaled likelihood of the genotype being correct.
	Default: none

--help <bool>: Prints this help dialog.
	Default: false

--kde-subsample <int>: The number of individuals to randomly sample for LOD score KDE. If there
	are fewer individuals in the population all are used.
Set <= 0 to use all individuals (may use large amounts of RAM).
	Default: 20

--lod-cutoff <double>: For LOD based ROH calling, specify a single LOD score cutoff
	above which ROH are called in all populations.  By default, this is chosen
	automatically with KDE.
	Default: -9.999990e+05

--map <string>: Provide a scaffold genetic map, sites that aren't present within this file are interpolated.
	Sites outside the bounds are filtered. This is required for wLOD calcualtions
	and any runs for which you wish to report ROH in units of cM.
	Default: none

--max-gap <int>: A LOD score window is not calculated if the gap (in bps)
	between two loci is greater than this value.
	Default: 200000

--mu <double>: Mutation rate per bp per generation for --weighted calculation.
	Default: 1.000000e-09

--nclust <int>: Set number of clusters for GMM classification of ROH lengths.
	Default: 3

--no-kde-thinning <bool>: Set this flag to send all LOD score data to KDE function. This may dramatically
	increase runtime.
	Default: false

--out <string>: The base name for all output files.
	Default: outfile

--overlap-frac <double>: The minimum fraction of overlapping windows above the LOD cutoff required
	to begin constructing a run. ROH will have a lower bound size threshold of WINSIZE*OVERLAP_FRAC.
	If set to 0, GARLIC sets the value to the lowest sensible value: 1/winsize.
	Default: 2.500000e-01

--phased <bool>: Set if data are phased and you want to calculate r2 instead of hr2 while --weighted is set.
	Uses extra RAM. Has no effect on computations without --weighted.
	Default: false

--raw-lod <bool>: If set, LOD scores will be output to gzip compressed files.
	Default: false

--resample <int>: Number of resamples for estimating allele frequencies.
	When set to 0 (default), garlic will use allele
	frequencies as calculated from the data.
	Default: 0

--size-bounds <double1> ... <doubleN>: Specify the size class boundaries
	ROH boundaries.  By default, this is chosen automatically
	with a 3-component GMM.  Must provide numbers in increasing order.
	Default: -1.000000

--tfam <string>: A tfam formatted file containing population and individual IDs.
	Default: none

--tgls <string>: A tgls file containing per-genotype likelihoods.
	Default: none

--threads <int>: The number of threads to spawn during weighted calculations.
	Default: 1

--tped <string>: A tped formatted file containing map and genotype information.
	Default: none

--tped-missing <char>: Single character missing data code for TPED files.
	Default: 0

--weighted <bool>: Compute LOD scores weighted by LD and probability of mutation.
	Default: false

--winsize <int>: The window size in # of SNPs in which to calculate LOD scores.
	Default: 10

--winsize-multi <int1> ... <intN>: Provide several window sizes (in # of SNPs) to calculate LOD scores.
	LOD score KDEs for each window size will be output for inspection.
	Default: -1