Skip to content

sabifo4/Tutorial_MCMCtree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Timetree inference analysis with PAML: step-by-step tutorials

DISCLAIMER: These tutorials are based on a phylogenetics tool that I am working on at the moment, which I am still developing and is yet to be published. While some of the scripts/tools that you will find here have been validated and used in published research (Álvarez-Carretero et al., 2022), I am actively implementing new features as part of the current workflow of this pipeline as well as developing new scripts/tools. In other words, the code is not stable and I am still validating the new features. If you want to use the tools that I have developed as part of these tutorials/pipelines, please first contact me. Thank you :)

Introduction

Tutorial 1: reproducible timetree inference

In this repository, you will find a first tutorial with step-by-step guidelines for timetree inference with PAML under a reproducible environment. To that end, you will follow all the steps within a self-contained environment that follows a specific file structure. Consequently, every command and script that you run rely on this file structure to ensure data reproducibility. Please note that basic bioinformatics skills regarding data handling and parsing are required when going through this tutorial. For analyses with PAML programs that do not require a pre-determined file structure, please read the next section on "Quick Start Tutorial"

At the start of this tutorial, we will assume that we have already (1) collected our data, (2) inferred the corresponding sequence alignment, and (3) inferred the corresponding phylogeny. We will focus on the following:

  1. Getting the example data ready (i.e., correct format to run PAML programs).
  2. Setting a prior for the rates using R.
  3. Running BASEML to calculate the branch lengths, the gradient, and the Hessian; which we will then use to approximate the likelihood calculation implemented in MCMCtree for timetree inference.
  4. Using the estimated branch lengths, gradient, and Hessian for timetree inference with MCMCtree.

Specific README.md files and scripts have been generated for each part of the tutorial so that users can follow the guidelines regardless of their operating system:

  • 00_data_formatting
  • 01_PAML/00_BASEML
    • Linux and WSL users can follow the instructions detailed in the README.md file.
    • Mac OSX users can follow the instructions detailed in the README_MacOSX.md file.
    • Users that want to submit scripts to a High-Performance Computing cluster (SGE scheduler) can follow the instructions detailed in the README_HPC_SGE.md file.
  • 01_PAML/01_MCMCtree
    • Linux and WSL users can follow the instructions detailed in the README.md file.
    • Mac OSX users can follow the instructions detailed in the README_MacOSX.md file.
    • Users that want to submit scripts to a High-Performance Computing cluster (SGE scheduler) can follow the instructions detailed in the README_HPC_SGE.md file.

NOTE: While not addressed in this tutorial, it is noteworthy that everyone needs to be familiar with their dataset/s before proceeding with timetree inference: how were the data collected? How were the alignments generated? How are the files going to be organised? Here, we will just focus on the subsequent steps, and will assume that all the checks required for timetree and alignment inference have been carried out with the example dataset and are understood by everyone following this tutorial. For a summary on how to approach this data parsing process, which includes phylogeny and alignment inference, you may want to read Álvarez-Carretero & dos Reis, 2022.

Quick Start tutorial

The second tutorial is a "Quick Start tutorial" that you can follow to run a simple analysis with PAML. This type of analysis is suitable for running small tests with your dataset when using PAML programs as well as for timetree inference with small datasets. Only basic bash scripting (e.g., changing directories, modify file content, and execute programs from the terminal) will be required to follow this tutorial.

For details on how to run PAML with phylogenomic datasets without relying on a specific file structure, you can visit the divtime GitHub repository developed and maintained by @mariodosreis. This repository is supposed to be followed alongside the protocol "Bayesian Molecular Clock Dating Using Genome-Scale Datasets (dos Reis and Yang, 2019).

Software requirements

Before you start this tutorial, please make sure you have the following software installed on your PCs:

  • PAML: you will be using the latest PAML release (at the time of writing, v4.10.7), available from the PAML GitHub repository. If you do not want to install the software from the source code, then follow (A). If you want to install PAML from the source code, then follow (B). If you have a Mac with the latest chips (or if you have other chips but neither option A or B work for you), please follow (C):

    • Installation (A): if you have problems installing PAML from the source code or you do not have the tools required to compile the source code, then you can download the pre-compiled binaries available from the latest release by following this link. Please choose the pre-compiled binaries you need according to your OS, download the corresponding compressed file, and save it in your preferred directory. Then, after decompressing the file, please give executable permissions, export the path to this binary file so you can execute it from a terminal, and you should be ready to go!

      Windows users: I suggest you install the Windows Subsystem for Linux (i.e., WSL) on your PCs to properly follow this tutorial -- otherwise, you may experience problems with the Windows Command Prompt. Once you have the WSL installed, then you can download the binaries for Linux.

    • Installation (B): to install PAML from the latest source code, please follow the instructions given in the code snippet below:

      # Clone to the `PAML` GitHub repository to get the latest `PAML` version
      # You can go to "https://github.com/abacus-gene/paml" and manually clone
      # the repository or continue below from the command line
      git clone https://github.com/abacus-gene/paml
      ##> NOTE: You can also download the source code from the latest release
      ##> if you want to download a stable version!
      ##> https://github.com/abacus-gene/paml/releases
      # Change name of cloned directory to keep track of version
      mv paml paml4.10.7
      # Move to `src` directory and compile programs
      cd paml4.10.7/src
      make -f Makefile
      rm *o
      # Move the new executable files to the `bin` directory and give executable
      # permissions
      mkdir ../bin
      mv baseml basemlg chi2 codeml evolver infinitesites mcmctree pamp yn00 ../bin
      chmod 775 ../bin/*

      Now, you just need to export the path to the bin directory where you have saved the executable file. If you want to automatically export this path to your ./bashrc/~/.zshrc/~/.bash_profile/<you_name_it> (i.e., file name depends on your OS), you can run the following commands AFTER ADAPTING the absolute paths written in the code snippet below to those in your filesystem:

      # Run from any location. Male sure you change 
      # `~/.bashrc` if you are using another file!
      printf "\n# Export path to PAML\n" >> ~/.bashrc
      # Replace "/c/usr/Bioinfor_tools/" with the path
      # that leads to the location where you have saved the
      # `paml4.10.7` directory. Modify any other part of the
      # absolute path if you have made other changes to the 
      # name of the directory where you have downloaded `PAML`
      printf "export PATH=/c/usr/bioinfo_tools/paml4.10.7/bin:\$PATH\n" >>  ~/.bashrc
      # Now, source the `~/.bashrc` file (or the file you are 
      # using) to update the changes
      source ~/.bashrc

      Alternatively, you can edit this file using your preferred text editor (e.g., vim, nano, etc.).

      Windows users: I suggest you install the Windows Subsystem for Linux (i.e., WSL) on your PCs to properly follow this tutorial -- otherwise, you may experience problems with the Windows Command Prompt. Once you have the WSL installed, then download the source code and follow the instructions listed above.

  • Installation (C) for M1/M2 chips or Mac users with other chips that experience problems with options A and/or B (Mac OSX): you will need to download the dev branch on the PAML GitHub repository and compile the binaries from the dev source code. Please follow this link and click the green button [<> Code] to start the download. You will see that a compressed file called paml-dev.zip will start to download. Once you decompress this file, you can go to directory src and follow the instructions in (B) to compile the binaries from the source code. If you wanted to do this from the terminal, you could also clone the repository as explained above and then change the branch using command git checkout dev.

  • R and RStudio: please download R and RStudio as they are used throughout the tutorial. The packages we will be using should work with R versions that are either newer than or equal to v4.1.2. If you are a Windows user, please make sure that you have the correct version of RTools installed, which will allow you to install packages from the source code if required. For instance, if you have R v4.1.2, then installing RTools4.0 shall be fine. If you have another R version installed on your PC, please check whether you need to install RTools 4.2 or RTools 4.3. For more information on which version you should download, please go to the CRAN website by following this link and download the version you need.

    Before you proceed, however, please make sure that you install the following packages too:

    # Run from the R console in RStudio
    # Check that you have at least R v4.1.2
    version$version.string
    # Now, install the packages we will be using
    # Note that it may take a while if you have not 
    # installed all these software before
    install.packages( c('rstudioapi', 'ape', 'phytools', 'sn', 'stringr', 'rstan'), dep = TRUE )
    ## NOTE: If you are a Windows user and see the message "Do you want to install from sources the 
    ## packages which need compilarion?", please make sure that you have installed the `RTools`
    ## aforementioned.
  • FigTree: you can use this graphical interface to display tree topologies with/without branch lengths and with/without additional labels. You can then decide what you want to be displayed by selecting the buttons and options that you require for that to happen. You can download the latest pre-compiled binaries, FigTree v1.4.4 at the time of writing, from the FigTree GitHub repository.

  • Tracer: you can use this graphical interface to visually assess the quality of the MCMCs you have run during the analyses with MCMCtree (e.g., chain efficiency, chain convergence, autocorrelation, etc.). You can download the latest pre-compiled binaries, Tracer v1.7.2 at the time of writing, from the Tracer GitHub repository.

  • Visual Studio Code (optional): for best experience with the tutorials, we highly recommend you install Visual Studio Code and run the tutorial from this IDE to keep everything tidy, organised, and self-contained. You can download VSC from their website. If you are new to VSC, you can check their webinars to learn about its various features and how to make the most out of it. In addition, you may also want to install the following extensions:

    • Markdown PDF -- developer: yzane
    • markdownlint -- developer: David Anson
    • Spell Right -- developer: Bartosz Antosik
    • vscode-pdf -- developer: tomoki1207

Are you ready?

If you have gone through the previous sections, have a clear understanding of how this repository is organised, and have installed the required software... Then you are ready to go!

You can start the tutorial on reproducibility by jumping to the 00_data_formatting directory and choosing the README.md file that suits your OS. If you want to first go through the quick start tutorial, then go to the quick_start directory, choose the README.md file that suits your OS, and...

Happy timetree inference! :)


ⓒ Dr Sandra Álvarez-Carretero | @sabifo4