Skip to content
Permalink
Browse files

Merge pull request #1 from laduplessis/master

Edit formatting, references and add LaTeX tutorial
  • Loading branch information...
rbouckaert committed Mar 20, 2019
2 parents fc37eae + f91ed03 commit 03410a6088a45bb028f32267a960e7970e2245f3
Showing with 910 additions and 68 deletions.
  1. +73 −68 README.md
  2. +201 −0 biblatex_macros.tex
  3. +636 −0 main.tex
141 README.md
@@ -2,7 +2,7 @@
author: Remco Bouckaert
level: Intermediate
title: Model selection
subtitle: with nested sampling
subtitle: Model selection with nested sampling
beastversion: 2.5.2
nsversion: 1.0.3
---
@@ -14,9 +14,14 @@ BEAST provides a bewildering number of models. Bayesians have two techniques to

Bayesian model selection is based on estimating the marginal likelihood: the term forming the denominator in Bayes formula. This is generally a computationally intensive task and there are several ways to estimate them. Here, we concentrate on nested sampling as a way to estimate the marginal likelihood as well as the uncertainty in that estimate.

Say, we have two models, M1 and M2, and estimates of the (log) marginal likelihood, ML1 and ML2, then we can calculate the Bayes factor, which is the fraction BF=ML1/ML2 (or in log space, the difference log(BF) = log(ML1)-log(ML2)). If BF is larger than 1, model M1 is favoured, and otherwise M2 is favoured. How much it is favoured can be found in the following table ([Kass 1995](#kass1995bayes)):
Say, we have two models, M1 and M2, and estimates of the (log) marginal likelihood, ML1 and ML2, then we can calculate the Bayes factor, which is the fraction BF=ML1/ML2 (or in log space, the difference log(BF) = log(ML1)-log(ML2)). If BF is larger than 1, model M1 is favoured, and otherwise M2 is favoured. How much it is favoured can be found in the following table {% cite kass1995bayes --file NS-tutorial/master-refs %}:

<img style="width:80.0%;" src="figures/BFs.png" alt="">
<figure>
<a name="fig:bfs"></a>
<img style="width:80.0%;" src="figures/BFs.png" alt="">
<figcaption>Figure 1: Bayes factor support.</figcaption>
</figure>
<br>

Note that sometimes a factor 2 is used for multiplying BFs, so when comparing BFs from different publications, be aware which definition that was used.

@@ -37,7 +42,7 @@ So, the main parameters of the algorithm are the number of particles `N` and the

### BEAST2 - Bayesian Evolutionary Analysis Sampling Trees 2

BEAST2 is a free software package for Bayesian evolutionary analysis of molecular sequences using MCMC and strictly oriented toward inference using rooted, time-measured phylogenetic trees ([Bouckaert et al, 2014](#Bouckaert2014), [Bouckaert et al. 2019](#bouckaert2018beast)). This tutorial uses the BEAST2 version 2.5.2.
BEAST2 is a free software package for Bayesian evolutionary analysis of molecular sequences using MCMC and strictly oriented toward inference using rooted, time-measured phylogenetic trees {% cite Bouckaert2014 --file NS-tutorial/master-refs %}, {% cite bouckaert2018beast --file NS-tutorial/master-refs %}. This tutorial uses the BEAST2 version 2.5.2.

### BEAUti2 - Bayesian Evolutionary Analysis Utility

@@ -58,50 +63,52 @@ The most popular clock models are the strict clock model and uncorrelated relaxe

## Setting up the Strict clock analysis

First thing to do is set up the two analyses in BEAUti, and run them in order to make sure there are differences in the analyses. The alignment can be downloaded here: [https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex). We will set up a model with tip dates, HKY substitution model, Coalescent prior with constant population, and a fixed clock rate. In BEAUti:

* Start a new analysis using the Standard template.
* Import [HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex) using menu `File > Import alignment`.
* In the tip-dates panel, select `tip dates`, click `Auto configure` and select the `split on character` option, taking group 2 (see [Fig 1](#fig:auto-config)).
* In the site model panel, select `HKY` as substitution model and leave the rest as is.
* In the clock model panel, set the clock rate to 2e-5. Though usually, we want to estimate the rate, to speed things up for the tutorial, we fix the clock rate at that number as follows:
* Uncheck menu `Mode > Automatic set clock rate`. Now the estimate entry should not be grayed out any more.
* Uncheck the `estimate` box next to the clock rate entry.
* In the priors panel, select `Coalescent Constant Population` as tree prior.
* Also in the priors panel, change to `popSize` prior to `Gamma` with alpha = 0.01, beta = 100 ([Fig 2](#fig:priors)).
* In the MCMC panel, change the `Chain Length` to 1 million.
* You can rename the file for trace log and tree file to include "Strict" to distinguish them for the relaxed clock ones.
* Save the file as `HBVStrict.xml`. __Do not close BEAUti just yet!__
* Run the analysis with BEAST.
First thing to do is set up the two analyses in BEAUti, and run them in order to make sure there are differences in the analyses. The alignment can be downloaded here: [https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex). We will set up a model with tip dates, HKY substitution model, Coalescent prior with constant population, and a fixed clock rate.

> **In BEAUti:**
>
> * Start a new analysis using the Standard template.
> * Import [HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex) using menu `File > Import alignment`.
> * In the tip-dates panel, select `tip dates`, click `Auto configure` and select the `split on character` option, taking group 2 (see [Fig 2](#fig:auto-config)).
> * In the site model panel, select `HKY` as substitution model and leave the rest as is.
> * In the clock model panel, set the clock rate to 2e-5. Though usually, we want to estimate the rate, to speed things up for the tutorial, we fix the clock rate at that number as follows:
> * Uncheck menu `Mode > Automatic set clock rate`. Now the estimate entry should not be grayed out any more.
> * Uncheck the `estimate` box next to the clock rate entry.
> * In the priors panel, select `Coalescent Constant Population` as tree prior.
> * Also in the priors panel, change to `popSize` prior to `Gamma` with alpha = 0.01, beta = 100 ([Fig 3](#fig:priors)).
> * In the MCMC panel, change the `Chain Length` to 1 million.
> * You can rename the file for trace log and tree file to include "Strict" to distinguish them for the relaxed clock ones.
> * Save the file as `HBVStrict.xml`. __Do not close BEAUti just yet!__
> * Run the analysis with BEAST.
>
> Do you have a clock rate prior in the priors panel? If so, the clock rate is estimated, and you should revisit the part where the clock is set up!
> **Do you have a clock rate prior in the priors panel?** If so, the clock rate is estimated, and you should revisit the part where the clock is set up!
>

<a name="fig:auto-config">
<figcaption>Figure 1: Configuring tip dates in BEAUti</figcaption>
</a>

<img style="width:80.0%;" src="figures/BEAUti-configure-tip-dates.png" alt="">
<figure>
<a name="fig:auto-config"></a>
<img style="width:80.0%;" src="figures/BEAUti-configure-tip-dates.png" alt="">
<figcaption>Figure 2: Configuring tip dates in BEAUti</figcaption>
</figure>
<br>


<a name="fig:priors">
<figcaption>Figure 2: Priors panel for strict clock analysis in BEAUti</figcaption>
</a>

<img style="width:80.0%;" src="figures/BEAUti-priors.png" alt="">
<figure>
<a name="fig:priors"></a>
<img style="width:80.0%;" src="figures/BEAUti-priors.png" alt="">
<figcaption>Figure 3: Priors panel for strict clock analysis in BEAUti</figcaption>
</figure>
<br>

## Setting up the relaxed clock analysis

While you are waiting for BEAST to finish, it is time to set up the relaxed clock analysis. This is now straightforward if BEAUti is still open (if BEAUti was closed, open BEAUti, and load the file `HBVStrict.xml` through the menu `File > Load`):
* In the clock model panel, change `Strict clock` to `Relaxed Clock Log Normal`.
* Set the clock rate to 2e-5, and uncheck the `estimate` box.
* In the MCMC panel, replace `Strict` in the file names for trace and tree log to `UCLN`.
* Save file as `HBVUCLN.xml` **Do not click the `File > Save` menu, but `File > Save as`, otherwise the strict clock XML file will be overwritten**
* Run the analysis in BEAST
> While you are waiting for BEAST to finish, it is time to set up the relaxed clock analysis. This is now straightforward if BEAUti is still open (if BEAUti was closed, open BEAUti, and load the file `HBVStrict.xml` through the menu `File > Load`):
>
> * In the clock model panel, change `Strict clock` to `Relaxed Clock Log Normal`.
> * Set the clock rate to 2e-5, and uncheck the `estimate` box.
> * In the MCMC panel, replace `Strict` in the file names for trace and tree log to `UCLN`.
> * Save file as `HBVUCLN.xml` **Do not click the `File > Save` menu, but `File > Save as`, otherwise the strict clock XML file will be overwritten**
> * Run the analysis in BEAST
Once the analyses have run, open the log file in Tracer and compare estimates and see whether the analyses substantially differ. You can also compare the trees in DensiTree.

@@ -110,40 +117,47 @@ Once the analyses have run, open the log file in Tracer and compare estimates an
> Which analysis is preferable and why?
<!-- depends on the question you want to answer: if tree height is of interest, strict clock is preferred, since it reduces the uncertainty. If kappa is of interest, things are not that different -->
If there are no substantial differences between the analysis for the question you are interested in, you do not have to commit to one model or another, and you can claim that the results are robust under different models. However, if there are significant differences, you may want to do a formal test to see which model is preferred over other models. In a Bayesian context, in practice this comes down to estimating the marginal likelihood, and calculating Bayes factors: the ratios of marginal likelihoods. Nested sampling ([Maturana et al, 2018](#russel2018model)) is one way to estimate marginal likelihoods.
If there are no substantial differences between the analysis for the question you are interested in, you do not have to commit to one model or another, and you can claim that the results are robust under different models. However, if there are significant differences, you may want to do a formal test to see which model is preferred over other models. In a Bayesian context, in practice this comes down to estimating the marginal likelihood, and calculating Bayes factors: the ratios of marginal likelihoods. Nested sampling {% cite russel2018model --file NS-tutorial/master-refs %} is one way to estimate marginal likelihoods.

## Installing the NS Package

To use nested sampling, first have to install the NS (version {{ page.nsversion }} or above) package.

> Open BEAUti and navigate to **File > Manage Packages**. Select NS and then click **Install/Upgrade** ([Fig 3](#fig:install)). Then **_restart BEAUti_** to load the package.
> Open BEAUti and navigate to **File > Manage Packages**. Select NS and then click **Install/Upgrade** ([Fig 4](#fig:install)). Then **_restart BEAUti_** to load the package.
>
<a id="fig:install">
<figcaption>Figure 3: Installing NS in the Manage Packages window in BEAUti</figcaption>
</a>

<img style="width:80.0%;" src="figures/install_NS.png" alt="">
<figure>
<a id="fig:install"></a>
<img style="width:80.0%;" src="figures/install_NS.png" alt="">
<figcaption>Figure 4: Installing NS in the Manage Packages window in BEAUti</figcaption>
</figure>
<br>


## Setting up the nested sampling analyses

* copy the file `HBVStrict.xml` to `HBVStric-NS.xml` and
* copy `HBVUCLN.xml` to `HBVUCLN-NS.xml`
* start a text editor and in both copied files, change
```xml
<run id="mcmc" spec="MCMC" chainLength="1000000">
```
to
```xml
<run id="mcmc" spec="beast.gss.NS" chainLength="20000" particleCount="1" subChainLength="5000">
```
Here the `particleCount` represents the number of active points used in nested sampling: the more points used, the more accurate the estimate, but the longer the analysis takes. The `subChainLength` is the number of MCMC samples taken to get a new point that is independent (enough) from the point that is saved. Longer lengths mean longer runs, but also more independent samples. In practice, running with different `subChainLength` is necessary to find out which length is most suitable (see [FAQ](#nested-sampling-faq)).
* change the file names for the trace and tree log to include `NS` (searching for `fileName=` will get you there fastest).
* save the files, and run with BEAST.
> * copy the file `HBVStrict.xml` to `HBVStric-NS.xml` and
> * copy `HBVUCLN.xml` to `HBVUCLN-NS.xml`
> * start a text editor and in both copied files, change
>
> ```xml
> <run id="mcmc" spec="MCMC" chainLength="1000000">
> ```
>
> to
>
> ```xml
> <run id="mcmc" spec="beast.gss.NS" chainLength="20000" particleCount="1" subChainLength="5000">
> ```
>
> * Here the `particleCount` represents the number of active points used in nested sampling: the more points used, the more accurate the estimate, but the longer the analysis takes. The `subChainLength` is the number of MCMC samples taken to get a new point that is independent (enough) from the point that is saved. Longer lengths mean longer runs, but also more independent samples. In practice, running with different `subChainLength` is necessary to find out which length is most suitable (see [FAQ](#nested-sampling-faq)).
> * change the file names for the trace and tree log to include `NS` (searching for `fileName=` will get you there fastest).
> * save the files, and run with BEAST.

The end of the BEAST run for nested sampling with the strict clock should look something like this:

```
Total calculation time: 34.146 seconds
End likelihood: -202.93224422946253
@@ -270,6 +284,7 @@ to


# Nested sampling FAQ
<a name="nested-sampling-faq"> </a>

## The analysis prints out multiple ML estimates with their SDs. Which one to choose?

@@ -325,7 +340,7 @@ The ESSs in Tracer of log files with the posterior samples are meaningless, beca

# Useful Links

- [Bayesian Evolutionary Analysis with BEAST 2](http://www.beast2.org/book.html) [Drummond 2014](#BEAST2book2014)
- [Bayesian Evolutionary Analysis with BEAST 2](http://www.beast2.org/book.html) {% cite BEAST2book2014 --file NS-tutorial/master-refs %}
- BEAST 2 website and documentation: [http://www.beast2.org/](http://www.beast2.org/)
- Nested sampling website and documentation: [https://github.com/BEAST2-Dev/nested-sampling](https://github.com/BEAST2-Dev/nested-sampling)
- Join the BEAST user discussion: [http://groups.google.com/group/beast-users](http://groups.google.com/group/beast-users)
@@ -334,14 +349,4 @@ The ESSs in Tracer of log files with the posterior samples are meaningless, beca

# Relevant References

<!--{% bibliography --cited --file NS-tutorial/master-refs.bib %}-->

* <a name="Bouckaert2014">Remco Bouckaert, Joseph Heled, Denise Kuhnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. Beast 2: a software platform for bayesian evolutionary analysis. PLoS computational biology, 10(4):e1003537, apr 2014.</a>

* <a name="bouckaert2018beast">Remco Bouckaert, Timothy G Vaughan, Joelle Barido-Sottani, Sebastian Duchene, Mathieu Fourment, Alexandra Gavryushkina, Joseph Heled, Gra- ham Jones, Denise Kuhnert, Nicola De Maio, et al. Beast 2.5: An advanced software platform for bayesian evolutionary analysis. PLoS computational biology, 2019.</a>

* <a name="BEAST2book2014">Alexei J. Drummond and Remco R. Bouckaert. Bayesian evolutionary analysis with BEAST 2. Cambridge University Press, 2014.</a>

* <a name="kass1995bayes">Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.</a>

* <a name="russel2018model">Patricio Maturana Russel, Brendon J Brewer, Steffen Klaere, and Remco R Bouckaert. Model selection and parameter inference in phylogenetics using nested sampling. Systematic biology, 68(2):219–233, 2018.</a>
{% bibliography --cited --file NS-tutorial/master-refs.bib %}
Oops, something went wrong.

0 comments on commit 03410a6

Please sign in to comment.
You can’t perform that action at this time.