Skip to content

Ascertainment bias correction

ddarriba edited this page Jun 3, 2016 · 5 revisions

Ascertainment Bias Correction

Ascertainment bias correction is useful for binary/morphological datasets that only contain variable sites. The identical morphological features are usually not included in the alignments, hence you need to correct for this, see, e.g., Lewis (2001).

PLL currently supports 3 different algorithms for ascertainment bias correction:

  • Standard acquisition bias correction (Lewis 2001): PLL_ATTRIB_ASC_BIAS_LEWIS
  • Reconstituted likelihood (Kuhner et al. 2000, McGill et al. 2013): PLL_ATTRIB_ASC_BIAS_FELSENSTEIN This correction introduced by Joe Felsenstein allows to explicitly specify the number of invariable sites one wants to correct for.
  • Refined reconstituted likelihood correction (Leaché et al. 2015): PLL_ATRIB_ASC_BIAS_STAMATAKIS A correction introduced by Alexandros Stamatakis that requires to explicitly specify the number of invariable sites for each character (if known) one wants to correct for.

In general, the correction is transparent to the client code. There is no need to take any additional action apart from setting the proper flag when the partition is created (PLL_ATTRIB_ASC_BIAS_%), and also setting the number of invariant sites (or asc_state_weights) by calling pll_set_asc_state_weights() in case the method depends on them (i.e., Felsenstein's and Stamatakis' methods). Felsenstein's method does not require the composition of the invariant sites, but only the sum of them, so the weights can be arbitrarily set with the only constraint that they must sum correctly to the total number of invariant sites.

Note:

  • PLL will return an error if ascertainment bias correction is used together with a proportion of invariant sites. Nevertheless, both approaches can be used with the same pll_partition_t structure if either the proportion of invariant sites is set to 0.0 for every matrix or no PLL_ATRIB_ASC_BIAS algorithm attribute is set, although PLL_ATTRIB_ASC_BIAS_FLAG can be active. However, the client code is responsible for disallowing or preventing the use of ascertainment bias correction when the data contains empirical invariable sites (i.e., all tip CLVs are equal for a particular site).

Implementation

When an ascertainment bias correction flag PLL_ATTRIB_ASC_BIAS_% is set when calling pll_partition_create(), the library allocates additional space in the CLVs for storing also the invariant sites. Hence, if there are n sites, CLVs will have capacity for n+s sites, where s is the number of states. However, partition->sites will still contain the value n. The additional space is used only internally and it is transparent to the client code. For this section we will differentiate between regular sites (n) and additional sites (s).

pll_update_partials() and pll_update_sumtable() will update the additional sites as if they were regular sites. The library will automatically determine the effective number of sites according to partition->attributes.

However, these additional sites have a special treatment in pll_compute_%_likelihood() and in pll_compute_likelihood_derivatives(). The likelihood and derivatives are the sum of 2 factors: the regular likelihood and derivatives, computed over the n sites; and the correction, computed as a function (according to the ascertainment bias correction method) of the per-site likelihood scores of the additional sites.

References

  • Kuhner, M. K., P. Beerli, J. Yamato, and J. Felsenstein. 2000. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics 156:439–447.
  • Leaché, A. D., Banbury, B. L., Felsenstein, J., de Oca, A. N. M., & Stamatakis, A. 2015. Short tree, long tree, right tree, wrong tree: new acquisition bias corrections for inferring SNP phylogenies. Systematic biology, syv053.
  • Lewis P.O. 2001. A likelihood approach to estimating phylogeny from discrete morphological character data. Syst. Biol. 50:913–925.
  • McGill, J. R., E. A. Walkup, and M. K. Kuhner. 2013. Correcting coalescent analyses for panel-based SNP ascertainment. Genetics 193:1185–1196.