-
Notifications
You must be signed in to change notification settings - Fork 6
Association testing with GPAT
GPAT++ has three tools to test for population divergence at a SNV/SNP: pFst, wcFst, and bFst. A dataset from a recent publication (Shapiro et al.) is provided with GPAT++ for testing and tutorial purposes only. The file scaffold612.vcf, found in vcflib/sampls comprises genotypes of pigeons. The Ephb2 locus ~600KB is strongly associated with the head crest phenotype.
##pFst
pFst is a likelihood ratio test for allele frequency differences between populations. The parameters of the test are estimated using genotype counts or genotype likelihoods. By default genotype likelihoods are used as it can compensate for NGS genotyping errors. If genotype likelihoods are not available the "count" flag should be specified.
INFO: help
INFO: description:
pFst is a probabilistic approach for detecting differences in allele frequencies between two populations.
Output : 3 columns :
1. seqid
2. position
3. pFst probability
INFO: usage: pFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1 --type PL
INFO: required: t,target -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns
INFO: required: f,file -- argument: a properly formatted VCF.
INFO: required: y,type -- argument: genotype likelihood format ; genotypes: GP,GL or PL; pooled: PO
INFO: optional: d,deltaaf -- argument: skip sites where the difference in allele frequencies is less than deltaaf, default is zero
INFO: optional: r,region -- argument: a tabix compliant genomic range : seqid or seqid:start-end
INFO: optional: c,counts -- switch : use genotype counts rather than genotype likelihoods to estimate parameters, default false
INFO: version 1.0.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu
WARNING code blocks scroll horizontally
../bin/pFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf --counts --type PL > 612.counts
R --vanilla < ../bin/plotPfst.R --args 612.counts
The resulting output:
cd samples
../bin/pFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf --type PL > 612.nocounts
R --vanilla < ../bin/plotPfst.R --args 612.nocounts
The resulting output:
While the differences are visually subtle, genome-wide pFst run with counts has a higher signal to noise ratio. Use genotype likelihood when available!
##wcFst
Weir and Cockerham's Fst estimator is implemented in wcFst (paywal - paper link). wcFst uses the called genotypes with no error modeling.
INFO: help
INFO: description:
wcFst is Weir & Cockerham's Fst for two populations. Negative values are VALID,
they are sites which can be treated as zero Fst. For more information see Evolution, Vol. 38 N. 6 Nov 1984.
Specifically wcFst uses equations 1,2,3,4.
Output : 3 columns :
1. seqid
2. position
3. target allele frequency
4. background allele frequency
5. wcFst
INFO: usage: wcFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1 --type PL
INFO: required: t,target -- argument: a zero based comma separated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- argument: a zero based comma separated list of background individuals corrisponding to VCF columns
INFO: required: f,file -- argument: proper formatted VCF
INFO: required, y,type -- argument: genotype likelihood format; genotype : GL,PL,GP
INFO: optional: r,region -- argument: a tabix compliant genomic range: seqid or seqid:start-end
INFO: optional: d,deltaaf -- argument: skip sites where the difference in allele frequencies is less than deltaaf, default is zero
INFO: version 1.1.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu
cd samples
../bin/wcFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf > 612.wcfst.txt
R --vanilla < ../bin/plotWCfst.R --args 612.wcfst.txt
The resulting output:
##bFst
A Bayesian approach to Fst is calculated in bFST. The fundamental concept was described by Kent Holsinger (paper link). This concept has been extended to use genotype likelihoods. The major advantage of bFst is Fst credible intervals can be obtained. bFst uses Metropolis Hastings to sample the posterior of Fst and therefore it is very computationally expensive. bFst is only fit for assaying small genomic regions regions.
INFO: help:
bFst is a Bayesian approach to Fst. Importantly bFst account for genotype uncertainty in the model using genotype likelihoods.
For a more detailed description see: Holsinger et al. Molecular Ecology Vol 11, issue 7 2002. The likelihood function has been
modified to use genotype likelihoods provided by variant callers. There are five free parameters estimated in the model: each
subpopulation's allele frequency and Fis (fixation index, within each subpopulation), a free parameter for the total population's
allele frequency, and Fst.
Output : 11 columns :
1. Seqid
2. Position
3. Observed allele frequency in target.
4. Estimated allele frequency in target.
5. Observed allele frequency in background.
6. Estimated allele frequency in background.
7. Observed allele frequency combined.
8. Estimated allele frequency in combined.
9. ML estimate of Fst (mean)
10. Lower bound of the 95% credible interval
11. Upper bound of the 95% credible interval
INFO: usage: bFst --target 0,1,2,3,4,5,6,7 --background 11,12,13,16,17,19,22 --file my.vcf --deltaaf 0.1
INFO: required: t,target -- a zero based comma seperated list of target individuals corrisponding to VCF columns
INFO: required: b,background -- a zero based comma seperated list of background individuals corrisponding to VCF columns
INFO: required: f,file a -- a proper formatted VCF file. the FORMAT field MUST contain "PL"
INFO: required: d,deltaaf -- skip sites were the difference in allele frequency is less than deltaaf
INFO: version 1.0.0 ; date: April 2014 ; author: Zev Kronenberg; email : zev.kronenberg@utah.edu
cd samples
../bin/bFst --target 1,20,25,29,30,38,43,46 --background 2,3,4,5,6,7,21,22,22,23,24,26,26,28,31,32,33,34,35,36,37,39,40,41,42,44,45 --deltaaf 0.0 --file scaffold612.vcf > 612.bFst
R --vanilla < ../bin/plotBfst.R --args 612.bFst
The resulting output:
Installing GPAT++ / updating:
SNP methods:
pFst, wcFst, bFst
Basic population statistics
ABBA-BABA test
Haplotype methods:
Haplotype plotting
Linkage disequilibrium
Nucleotide diversity
xpEHH
iHS
hapLrt
Utilities:
Smoothing GPAT++ output
1KG target and background index
Permutation