-
Notifications
You must be signed in to change notification settings - Fork 2
Fst
-
How to calculate Fst: using genetic distances (Nei's) and nucleotide diversity (pi; Hohenlohe et al. 2010 methods) instead of expected heterozygostiy (HE)? Or using Mark's frequency-based formula (Fst = (p1 – p2)^2 / (4 pBar qBar)), which I will describe more thoroughly when I understand its derivation.
-
Calculate Fst at each locus (or at each window?).
-
Develop null distribution for comparison
- Should this be calcualted PER LOCUS (i.e., remove pop labels from the set of data (permute), re-draw locus data with sample sizes according to each population's size and recalculate Fst (bootstrap) 1000-100000 times to develop a per locus distribution) -- OR --
- Should this null distrib be calculated PER SLIDING WINDOW (i.e., permute
- If you calculated Fst at each window, how to compare windows to windows? Averages?
-
Call significance.
- Expected heterozygosity vs nucleotide diversity; Hohenlohe et al. 2010 says they're equivalent, but they're not, are they? They're calculated differently and the results are different. HE = 1- (p^2 + q^2); or HE = 2pq. pi is the average number of different pairwise comparisons.
- Fst: Hohenlohe et al. calculate Fst using a weighting scheme such that each sample's unequal size doesn't bias the output. However, as Mark points out, there is no need to use sample size as a proxy for population size, which is what this is really doing, when you can just use the allele frequencies instead. Yes, you are making the same assumption (that the sample is representative of the population) in both cases, so maybe it doesn't matter at all.
- Mark's method obviates the need for weighting.
- The necessity of weighting comes into play when you try to calculate either pi or expected heterozygosity and have trouble calculating Ht or pi-t. If you simply pool individuals, but have unequal numbers of individuals per pop, you'll get a super biased pool; for example, 10 alleles from pop1, 50 from pop2: here, if you call pi or HE from a pool of 60 alleles it'll be totally skewed. Instead, either weight each allele by the pop size it came from, or simply use the allele freqs from within the pool (Mark's equation that I haven't derived yet)
Notes from Hohenlohe et al. 2010:
- How many SNPs per X bp-wide window?
- How to set window size?
Notes from Implications of use of Wright’s FST for the role of probability and causation in evolution, by
Marshall Abrams:
"For diploid genetics, we can think of FST as the degree to which the overall population departs from Hardy- Weinberg equilibrium due to genes being sorted in a partially nonrandom manner into populations."
"FST (and other F-statistics: FI T, FI S, etc.) are usually defined in terms of distri- butions of alleles in populations. Generally we don’t have complete data about these distributions. Therefore we take random samples from actual populations. Call this statistical sampling (Weir, 1996; Holsinger and Weir, 2009). We then compute a func- tion of the distribution in the samples, where this function would approach the true value of FST as we collected more samples. The value of such a function is an estimate of FST. In statistical terminology, this means that FST is a parameter of group of populations; it is a property of the overall metapopulation. Parameters are contrasted with estimates, which are statistics—properties of our data. (The term “F-statistics” is thus misleading.)"
Notes from Shane's Simple Guide to F-statistics:
"Hardy-Weinberg Equilibrium: This is also a central concept in the derivation of F-statistics. The most basic point here (as illustrated below) is that, regardless of the genotype frequencies you start with in one generation, if there is completely random mating, then the genotype frequencies of the next generation will tend towards a highly predicted ratio - the HWE – that is determined entirely by the allele frequencies."
"But how is this genotype ratio derived? Very simply, it comes from the probabilities of getting each of the three types of allele pairs (genotypes) shown above. Each of these three combined probabilities comes merely from the product of the probabilities (or frequencies) of the two alleles. This is shown below in both example numbers and in terms of the generalised allele frequencies, p and q. You might recognise that the formulae for the genotype frequencies are in the form of a binomial expansion. Thus, with more than two alleles the H-W frequencies can be easily determined using the appropriate expansion."
"It can also be seen below that the formulae for calculating expected (as opposed to observed) heterozygosity and homozygosity come straight from the HWE. It is useful to note here that when there are more than 2 alleles, it is easier to calculate heterozygosity (H) using H = 1 - homozygosity"
Notes/quotes from Charlesworth 1998:
"...to point out that measures of the relative amounts of between-population and total diversity, such as FST, are not necessarily appropriate if we wish to compare loci with very different levels of within-population variation."
"Measures of population differentiation can be computed using either data on allele frequencies, as in the case of allozyme or microsatelllite data, or data on variation at the nucleotide site level. ... I will concentrate here on nucleotide site diversity values, because these were used in the above Drosophila studies, and because a simple evolutionary model (the infinite-sites model [Kimura 1971]) is plausible as an underlying mechanism. For simplicity, I will describe the measures in terms of population parameters rather than sample statistics. A diversity measure is then defined as the probability that two alleles with a defined origin differ at a random nucleotide site. In practice, an estimate of a diversity measure is obtained from the mean number of differences between a pair of alleles, normalizing by the numbers of bases in the sequence (Nei 1987, chapter 10)."
"Pairwise measures of diversity per nucleotide site for a set of n populations of the same species can be partitioned into the total diversity (piT), calculated by pooling the set of populations and averaging over pairs of distinct alleles sampled randomly from the set, and the mean within-population diversity (piS) (cf. Nei 1973; Holsinger and Mason-Gamer 1996). In general, the contribution of a population to the pool should be weighted by its size, but this is usually unknown, so either sample sizes are used as surrogates for population size, or all populations are weighted equally. "
RC Note on this above paragraph: Mark's method uses frequencies within pops so that sample size is irrelevant, if you trust that you've sampled enough genotypes to accurately estimate the true frequency in the population. But to me, it still seems like weighting is important? But difficult in practice.
"The Use of Absolute and Relative Measures of Divergence Relative measures of between-population divergence, such as FST, are inherently dependent on the ex- tent of within-population diversity (Nei 1973, 1987, p. 190). Indeed, for loci with very high levels of diversity such as microsatellites, FST is a poor measure of be- tween-population divergence even in the absence of forces that affect diversity, since FST is necessarily low even if absolute divergence is high (T. Nagylaki, personal communication). As shown above, factors which affect within-population diversity at otherwise comparable loci can cause substantial among-locus differences in these relative measures in the absence of any differences in absolute measures of between-population diversity. ... The use of absolute measures of divergence is thus necessary when comparing genomic regions with different levels of recombination or species with different breeding systems. As mentioned above, tests of significance will then require adjustment for possible differences in mutation rates, by the use of comparisons with related species (Hudson, Kreitman, and Aguade ́ 1987)."
Notes/quotes from (Niesen 2009, Darwinian and demographic forces affecting human protein coding genes)http://genome.cshlp.org/content/19/5/838.full.pdf); Hohenlohe et al. 2010 used modified equations from this paper to determine Fst based on nucleotide diversity, weighting for sample size.
"Methods for detecting selection based on allele frequency distributions include Tajima’s D (Tajima 1989), Fay and Wu’s H (Fay and Wu 2000), the use of FST (e.g., Carlson et al. 2005), the HKA test (Hudson et al. 1987), and a number of other statistics. These statistics can all be calculated as functions of the so-called two-dimensional site frequency spectrum (2D-SFS). The 2D-SFS summarizes the joint allele frequencies in the two populations in a matrix containing the number of SNPs with sample frequency i in one population and j in the other population in the (i, j)th entry of the matrix. To detect selection we devise a number of different statistics that summarize specific aspects of the 2D-SFS. This allows us to examine different aspect of the action of selection and dis- tinguish between different types of selection."