Skip to content

Commit

Permalink
Merge pull request #1 from laduplessis/master
Browse files Browse the repository at this point in the history
Edit formatting, references and add LaTeX tutorial
  • Loading branch information
rbouckaert committed Mar 20, 2019
2 parents fc37eae + f91ed03 commit 03410a6
Show file tree
Hide file tree
Showing 3 changed files with 910 additions and 68 deletions.
141 changes: 73 additions & 68 deletions README.md
Original file line number Original file line Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
author: Remco Bouckaert author: Remco Bouckaert
level: Intermediate level: Intermediate
title: Model selection title: Model selection
subtitle: with nested sampling subtitle: Model selection with nested sampling
beastversion: 2.5.2 beastversion: 2.5.2
nsversion: 1.0.3 nsversion: 1.0.3
--- ---
Expand All @@ -14,9 +14,14 @@ BEAST provides a bewildering number of models. Bayesians have two techniques to


Bayesian model selection is based on estimating the marginal likelihood: the term forming the denominator in Bayes formula. This is generally a computationally intensive task and there are several ways to estimate them. Here, we concentrate on nested sampling as a way to estimate the marginal likelihood as well as the uncertainty in that estimate. Bayesian model selection is based on estimating the marginal likelihood: the term forming the denominator in Bayes formula. This is generally a computationally intensive task and there are several ways to estimate them. Here, we concentrate on nested sampling as a way to estimate the marginal likelihood as well as the uncertainty in that estimate.


Say, we have two models, M1 and M2, and estimates of the (log) marginal likelihood, ML1 and ML2, then we can calculate the Bayes factor, which is the fraction BF=ML1/ML2 (or in log space, the difference log(BF) = log(ML1)-log(ML2)). If BF is larger than 1, model M1 is favoured, and otherwise M2 is favoured. How much it is favoured can be found in the following table ([Kass 1995](#kass1995bayes)): Say, we have two models, M1 and M2, and estimates of the (log) marginal likelihood, ML1 and ML2, then we can calculate the Bayes factor, which is the fraction BF=ML1/ML2 (or in log space, the difference log(BF) = log(ML1)-log(ML2)). If BF is larger than 1, model M1 is favoured, and otherwise M2 is favoured. How much it is favoured can be found in the following table {% cite kass1995bayes --file NS-tutorial/master-refs %}:


<img style="width:80.0%;" src="figures/BFs.png" alt=""> <figure>
<a name="fig:bfs"></a>
<img style="width:80.0%;" src="figures/BFs.png" alt="">
<figcaption>Figure 1: Bayes factor support.</figcaption>
</figure>
<br>


Note that sometimes a factor 2 is used for multiplying BFs, so when comparing BFs from different publications, be aware which definition that was used. Note that sometimes a factor 2 is used for multiplying BFs, so when comparing BFs from different publications, be aware which definition that was used.


Expand All @@ -37,7 +42,7 @@ So, the main parameters of the algorithm are the number of particles `N` and the


### BEAST2 - Bayesian Evolutionary Analysis Sampling Trees 2 ### BEAST2 - Bayesian Evolutionary Analysis Sampling Trees 2


BEAST2 is a free software package for Bayesian evolutionary analysis of molecular sequences using MCMC and strictly oriented toward inference using rooted, time-measured phylogenetic trees ([Bouckaert et al, 2014](#Bouckaert2014), [Bouckaert et al. 2019](#bouckaert2018beast)). This tutorial uses the BEAST2 version 2.5.2. BEAST2 is a free software package for Bayesian evolutionary analysis of molecular sequences using MCMC and strictly oriented toward inference using rooted, time-measured phylogenetic trees {% cite Bouckaert2014 --file NS-tutorial/master-refs %}, {% cite bouckaert2018beast --file NS-tutorial/master-refs %}. This tutorial uses the BEAST2 version 2.5.2.


### BEAUti2 - Bayesian Evolutionary Analysis Utility ### BEAUti2 - Bayesian Evolutionary Analysis Utility


Expand All @@ -58,50 +63,52 @@ The most popular clock models are the strict clock model and uncorrelated relaxe


## Setting up the Strict clock analysis ## Setting up the Strict clock analysis


First thing to do is set up the two analyses in BEAUti, and run them in order to make sure there are differences in the analyses. The alignment can be downloaded here: [https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex). We will set up a model with tip dates, HKY substitution model, Coalescent prior with constant population, and a fixed clock rate. In BEAUti: First thing to do is set up the two analyses in BEAUti, and run them in order to make sure there are differences in the analyses. The alignment can be downloaded here: [https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex). We will set up a model with tip dates, HKY substitution model, Coalescent prior with constant population, and a fixed clock rate.


* Start a new analysis using the Standard template. > **In BEAUti:**
* Import [HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex) using menu `File > Import alignment`. >
* In the tip-dates panel, select `tip dates`, click `Auto configure` and select the `split on character` option, taking group 2 (see [Fig 1](#fig:auto-config)). > * Start a new analysis using the Standard template.
* In the site model panel, select `HKY` as substitution model and leave the rest as is. > * Import [HBV.nex](https://raw.githubusercontent.com/rbouckaert/NS-tutorial/master/data/HBV.nex) using menu `File > Import alignment`.
* In the clock model panel, set the clock rate to 2e-5. Though usually, we want to estimate the rate, to speed things up for the tutorial, we fix the clock rate at that number as follows: > * In the tip-dates panel, select `tip dates`, click `Auto configure` and select the `split on character` option, taking group 2 (see [Fig 2](#fig:auto-config)).
* Uncheck menu `Mode > Automatic set clock rate`. Now the estimate entry should not be grayed out any more. > * In the site model panel, select `HKY` as substitution model and leave the rest as is.
* Uncheck the `estimate` box next to the clock rate entry. > * In the clock model panel, set the clock rate to 2e-5. Though usually, we want to estimate the rate, to speed things up for the tutorial, we fix the clock rate at that number as follows:
* In the priors panel, select `Coalescent Constant Population` as tree prior. > * Uncheck menu `Mode > Automatic set clock rate`. Now the estimate entry should not be grayed out any more.
* Also in the priors panel, change to `popSize` prior to `Gamma` with alpha = 0.01, beta = 100 ([Fig 2](#fig:priors)). > * Uncheck the `estimate` box next to the clock rate entry.
* In the MCMC panel, change the `Chain Length` to 1 million. > * In the priors panel, select `Coalescent Constant Population` as tree prior.
* You can rename the file for trace log and tree file to include "Strict" to distinguish them for the relaxed clock ones. > * Also in the priors panel, change to `popSize` prior to `Gamma` with alpha = 0.01, beta = 100 ([Fig 3](#fig:priors)).
* Save the file as `HBVStrict.xml`. __Do not close BEAUti just yet!__ > * In the MCMC panel, change the `Chain Length` to 1 million.
* Run the analysis with BEAST. > * You can rename the file for trace log and tree file to include "Strict" to distinguish them for the relaxed clock ones.
> * Save the file as `HBVStrict.xml`. __Do not close BEAUti just yet!__
> * Run the analysis with BEAST.
> >
> Do you have a clock rate prior in the priors panel? If so, the clock rate is estimated, and you should revisit the part where the clock is set up! > **Do you have a clock rate prior in the priors panel?** If so, the clock rate is estimated, and you should revisit the part where the clock is set up!
> >

<a name="fig:auto-config">
<figcaption>Figure 1: Configuring tip dates in BEAUti</figcaption>
</a>


<img style="width:80.0%;" src="figures/BEAUti-configure-tip-dates.png" alt=""> <figure>
<a name="fig:auto-config"></a>
<img style="width:80.0%;" src="figures/BEAUti-configure-tip-dates.png" alt="">
<figcaption>Figure 2: Configuring tip dates in BEAUti</figcaption>
</figure>
<br> <br>



<figure>
<a name="fig:priors"> <a name="fig:priors"></a>
<figcaption>Figure 2: Priors panel for strict clock analysis in BEAUti</figcaption> <img style="width:80.0%;" src="figures/BEAUti-priors.png" alt="">
</a> <figcaption>Figure 3: Priors panel for strict clock analysis in BEAUti</figcaption>

</figure>
<img style="width:80.0%;" src="figures/BEAUti-priors.png" alt="">
<br> <br>


## Setting up the relaxed clock analysis ## Setting up the relaxed clock analysis


While you are waiting for BEAST to finish, it is time to set up the relaxed clock analysis. This is now straightforward if BEAUti is still open (if BEAUti was closed, open BEAUti, and load the file `HBVStrict.xml` through the menu `File > Load`): > While you are waiting for BEAST to finish, it is time to set up the relaxed clock analysis. This is now straightforward if BEAUti is still open (if BEAUti was closed, open BEAUti, and load the file `HBVStrict.xml` through the menu `File > Load`):
* In the clock model panel, change `Strict clock` to `Relaxed Clock Log Normal`. >
* Set the clock rate to 2e-5, and uncheck the `estimate` box. > * In the clock model panel, change `Strict clock` to `Relaxed Clock Log Normal`.
* In the MCMC panel, replace `Strict` in the file names for trace and tree log to `UCLN`. > * Set the clock rate to 2e-5, and uncheck the `estimate` box.
* Save file as `HBVUCLN.xml` **Do not click the `File > Save` menu, but `File > Save as`, otherwise the strict clock XML file will be overwritten** > * In the MCMC panel, replace `Strict` in the file names for trace and tree log to `UCLN`.
* Run the analysis in BEAST > * Save file as `HBVUCLN.xml` **Do not click the `File > Save` menu, but `File > Save as`, otherwise the strict clock XML file will be overwritten**
> * Run the analysis in BEAST
Once the analyses have run, open the log file in Tracer and compare estimates and see whether the analyses substantially differ. You can also compare the trees in DensiTree. Once the analyses have run, open the log file in Tracer and compare estimates and see whether the analyses substantially differ. You can also compare the trees in DensiTree.


Expand All @@ -110,40 +117,47 @@ Once the analyses have run, open the log file in Tracer and compare estimates an
> Which analysis is preferable and why? > Which analysis is preferable and why?
<!-- depends on the question you want to answer: if tree height is of interest, strict clock is preferred, since it reduces the uncertainty. If kappa is of interest, things are not that different --> <!-- depends on the question you want to answer: if tree height is of interest, strict clock is preferred, since it reduces the uncertainty. If kappa is of interest, things are not that different -->
If there are no substantial differences between the analysis for the question you are interested in, you do not have to commit to one model or another, and you can claim that the results are robust under different models. However, if there are significant differences, you may want to do a formal test to see which model is preferred over other models. In a Bayesian context, in practice this comes down to estimating the marginal likelihood, and calculating Bayes factors: the ratios of marginal likelihoods. Nested sampling ([Maturana et al, 2018](#russel2018model)) is one way to estimate marginal likelihoods. If there are no substantial differences between the analysis for the question you are interested in, you do not have to commit to one model or another, and you can claim that the results are robust under different models. However, if there are significant differences, you may want to do a formal test to see which model is preferred over other models. In a Bayesian context, in practice this comes down to estimating the marginal likelihood, and calculating Bayes factors: the ratios of marginal likelihoods. Nested sampling {% cite russel2018model --file NS-tutorial/master-refs %} is one way to estimate marginal likelihoods.


## Installing the NS Package ## Installing the NS Package


To use nested sampling, first have to install the NS (version {{ page.nsversion }} or above) package. To use nested sampling, first have to install the NS (version {{ page.nsversion }} or above) package.


> Open BEAUti and navigate to **File > Manage Packages**. Select NS and then click **Install/Upgrade** ([Fig 3](#fig:install)). Then **_restart BEAUti_** to load the package. > Open BEAUti and navigate to **File > Manage Packages**. Select NS and then click **Install/Upgrade** ([Fig 4](#fig:install)). Then **_restart BEAUti_** to load the package.
> >
<a id="fig:install">
<figcaption>Figure 3: Installing NS in the Manage Packages window in BEAUti</figcaption>
</a>


<img style="width:80.0%;" src="figures/install_NS.png" alt=""> <figure>
<a id="fig:install"></a>
<img style="width:80.0%;" src="figures/install_NS.png" alt="">
<figcaption>Figure 4: Installing NS in the Manage Packages window in BEAUti</figcaption>
</figure>
<br> <br>



## Setting up the nested sampling analyses ## Setting up the nested sampling analyses


* copy the file `HBVStrict.xml` to `HBVStric-NS.xml` and > * copy the file `HBVStrict.xml` to `HBVStric-NS.xml` and
* copy `HBVUCLN.xml` to `HBVUCLN-NS.xml` > * copy `HBVUCLN.xml` to `HBVUCLN-NS.xml`
* start a text editor and in both copied files, change > * start a text editor and in both copied files, change
```xml >
<run id="mcmc" spec="MCMC" chainLength="1000000"> > ```xml
``` > <run id="mcmc" spec="MCMC" chainLength="1000000">
to > ```
```xml >
<run id="mcmc" spec="beast.gss.NS" chainLength="20000" particleCount="1" subChainLength="5000"> > to
``` >
Here the `particleCount` represents the number of active points used in nested sampling: the more points used, the more accurate the estimate, but the longer the analysis takes. The `subChainLength` is the number of MCMC samples taken to get a new point that is independent (enough) from the point that is saved. Longer lengths mean longer runs, but also more independent samples. In practice, running with different `subChainLength` is necessary to find out which length is most suitable (see [FAQ](#nested-sampling-faq)). > ```xml
* change the file names for the trace and tree log to include `NS` (searching for `fileName=` will get you there fastest). > <run id="mcmc" spec="beast.gss.NS" chainLength="20000" particleCount="1" subChainLength="5000">
* save the files, and run with BEAST. > ```
>
> * Here the `particleCount` represents the number of active points used in nested sampling: the more points used, the more accurate the estimate, but the longer the analysis takes. The `subChainLength` is the number of MCMC samples taken to get a new point that is independent (enough) from the point that is saved. Longer lengths mean longer runs, but also more independent samples. In practice, running with different `subChainLength` is necessary to find out which length is most suitable (see [FAQ](#nested-sampling-faq)).
> * change the file names for the trace and tree log to include `NS` (searching for `fileName=` will get you there fastest).
> * save the files, and run with BEAST.
The end of the BEAST run for nested sampling with the strict clock should look something like this: The end of the BEAST run for nested sampling with the strict clock should look something like this:
``` ```
Total calculation time: 34.146 seconds Total calculation time: 34.146 seconds
End likelihood: -202.93224422946253 End likelihood: -202.93224422946253
Expand Down Expand Up @@ -270,6 +284,7 @@ to




# Nested sampling FAQ # Nested sampling FAQ
<a name="nested-sampling-faq"> </a>


## The analysis prints out multiple ML estimates with their SDs. Which one to choose? ## The analysis prints out multiple ML estimates with their SDs. Which one to choose?


Expand Down Expand Up @@ -325,7 +340,7 @@ The ESSs in Tracer of log files with the posterior samples are meaningless, beca


# Useful Links # Useful Links


- [Bayesian Evolutionary Analysis with BEAST 2](http://www.beast2.org/book.html) [Drummond 2014](#BEAST2book2014) - [Bayesian Evolutionary Analysis with BEAST 2](http://www.beast2.org/book.html) {% cite BEAST2book2014 --file NS-tutorial/master-refs %}
- BEAST 2 website and documentation: [http://www.beast2.org/](http://www.beast2.org/) - BEAST 2 website and documentation: [http://www.beast2.org/](http://www.beast2.org/)
- Nested sampling website and documentation: [https://github.com/BEAST2-Dev/nested-sampling](https://github.com/BEAST2-Dev/nested-sampling) - Nested sampling website and documentation: [https://github.com/BEAST2-Dev/nested-sampling](https://github.com/BEAST2-Dev/nested-sampling)
- Join the BEAST user discussion: [http://groups.google.com/group/beast-users](http://groups.google.com/group/beast-users) - Join the BEAST user discussion: [http://groups.google.com/group/beast-users](http://groups.google.com/group/beast-users)
Expand All @@ -334,14 +349,4 @@ The ESSs in Tracer of log files with the posterior samples are meaningless, beca


# Relevant References # Relevant References


<!--{% bibliography --cited --file NS-tutorial/master-refs.bib %}--> {% bibliography --cited --file NS-tutorial/master-refs.bib %}

* <a name="Bouckaert2014">Remco Bouckaert, Joseph Heled, Denise Kuhnert, Tim Vaughan, Chieh-Hsi Wu, Dong Xie, Marc A Suchard, Andrew Rambaut, and Alexei J Drummond. Beast 2: a software platform for bayesian evolutionary analysis. PLoS computational biology, 10(4):e1003537, apr 2014.</a>

* <a name="bouckaert2018beast">Remco Bouckaert, Timothy G Vaughan, Joelle Barido-Sottani, Sebastian Duchene, Mathieu Fourment, Alexandra Gavryushkina, Joseph Heled, Gra- ham Jones, Denise Kuhnert, Nicola De Maio, et al. Beast 2.5: An advanced software platform for bayesian evolutionary analysis. PLoS computational biology, 2019.</a>

* <a name="BEAST2book2014">Alexei J. Drummond and Remco R. Bouckaert. Bayesian evolutionary analysis with BEAST 2. Cambridge University Press, 2014.</a>

* <a name="kass1995bayes">Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statistical association, 90(430):773–795, 1995.</a>

* <a name="russel2018model">Patricio Maturana Russel, Brendon J Brewer, Steffen Klaere, and Remco R Bouckaert. Model selection and parameter inference in phylogenetics using nested sampling. Systematic biology, 68(2):219–233, 2018.</a>
Loading

0 comments on commit 03410a6

Please sign in to comment.