DISCLAIMER: These tutorials are based on a phylogenetics tool that I am working on at the moment, which I am still developing and is yet to be published. While some of the scripts/tools that you will find here have been validated and used in published research (Álvarez-Carretero et al., 2022), I am actively implementing new features as part of the current workflow of this pipeline as well as developing new scripts/tools. In other words, the code is not stable and I am still validating the new features. If you want to use the tools that I have developed as part of these tutorials/pipelines, please first contact me. Thank you :)
In this repository, you will find a first tutorial with step-by-step guidelines for timetree inference with PAML
under a reproducible environment. To that end, you will follow all the steps within a self-contained environment that follows a specific file structure. Consequently, every command and script that you run rely on this file structure to ensure data reproducibility. Please note that basic bioinformatics skills regarding data handling and parsing are required when going through this tutorial. For analyses with PAML
programs that do not require a pre-determined file structure, please read the next section on "Quick Start Tutorial"
At the start of this tutorial, we will assume that we have already (1) collected our data, (2) inferred the corresponding sequence alignment, and (3) inferred the corresponding phylogeny. We will focus on the following:
- Getting the example data ready (i.e., correct format to run
PAML
programs). - Setting a prior for the rates using
R
. - Running
BASEML
to calculate the branch lengths, the gradient, and the Hessian; which we will then use to approximate the likelihood calculation implemented inMCMCtree
for timetree inference. - Using the estimated branch lengths, gradient, and Hessian for timetree inference with
MCMCtree
.
Specific README.md
files and scripts have been generated for each part of the tutorial so that users can follow the guidelines regardless of their operating system:
00_data_formatting
- Linux and WSL users can follow the instructions detailed in the
README.md
file. - Mac OSX users can follow the instruction detailed in the
README_MacOSX.md
file.
- Linux and WSL users can follow the instructions detailed in the
01_PAML/00_BASEML
- Linux and WSL users can follow the instructions detailed in the
README.md
file. - Mac OSX users can follow the instructions detailed in the
README_MacOSX.md
file. - Users that want to submit scripts to a High-Performance Computing cluster (SGE scheduler) can follow the instructions detailed in the
README_HPC_SGE.md
file.
- Linux and WSL users can follow the instructions detailed in the
01_PAML/01_MCMCtree
- Linux and WSL users can follow the instructions detailed in the
README.md
file. - Mac OSX users can follow the instructions detailed in the
README_MacOSX.md
file. - Users that want to submit scripts to a High-Performance Computing cluster (SGE scheduler) can follow the instructions detailed in the
README_HPC_SGE.md
file.
- Linux and WSL users can follow the instructions detailed in the
NOTE: While not addressed in this tutorial, it is noteworthy that everyone needs to be familiar with their dataset/s before proceeding with timetree inference: how were the data collected? How were the alignments generated? How are the files going to be organised? Here, we will just focus on the subsequent steps, and will assume that all the checks required for timetree and alignment inference have been carried out with the example dataset and are understood by everyone following this tutorial. For a summary on how to approach this data parsing process, which includes phylogeny and alignment inference, you may want to read Álvarez-Carretero & dos Reis, 2022.
The second tutorial is a "Quick Start tutorial" that you can follow to run a simple analysis with PAML
. This type of analysis is suitable for running small tests with your dataset when using PAML
programs as well as for timetree inference with small datasets. Only basic bash scripting (e.g., changing directories, modify file content, and execute programs from the terminal) will be required to follow this tutorial.
For details on how to run PAML
with phylogenomic datasets without relying on a specific file structure, you can visit the divtime
GitHub repository developed and maintained by @mariodosreis
. This repository is supposed to be followed alongside the protocol "Bayesian Molecular Clock Dating Using Genome-Scale Datasets (dos Reis and Yang, 2019).
Before you start this tutorial, please make sure you have the following software installed on your PCs:
-
PAML
: you will be using the latestPAML
release (at the time of writing, v4.10.7), available from thePAML
GitHub repository. If you do not want to install the software from the source code, then follow (A). If you want to installPAML
from the source code, then follow (B). If you have a Mac with the latest chips (or if you have other chips but neither option A or B work for you), please follow (C):-
Installation (A): if you have problems installing
PAML
from the source code or you do not have the tools required to compile the source code, then you can download the pre-compiled binaries available from the latest release by following this link. Please choose the pre-compiled binaries you need according to your OS, download the corresponding compressed file, and save it in your preferred directory. Then, after decompressing the file, please give executable permissions, export the path to this binary file so you can execute it from a terminal, and you should be ready to go!Windows users: I suggest you install the Windows Subsystem for Linux (i.e., WSL) on your PCs to properly follow this tutorial -- otherwise, you may experience problems with the Windows Command Prompt. Once you have the WSL installed, then you can download the binaries for Linux.
-
Installation (B): to install
PAML
from the latest source code, please follow the instructions given in the code snippet below:# Clone to the `PAML` GitHub repository to get the latest `PAML` version # You can go to "https://github.com/abacus-gene/paml" and manually clone # the repository or continue below from the command line git clone https://github.com/abacus-gene/paml ##> NOTE: You can also download the source code from the latest release ##> if you want to download a stable version! ##> https://github.com/abacus-gene/paml/releases # Change name of cloned directory to keep track of version mv paml paml4.10.7 # Move to `src` directory and compile programs cd paml4.10.7/src make -f Makefile rm *o # Move the new executable files to the `bin` directory and give executable # permissions mkdir ../bin mv baseml basemlg chi2 codeml evolver infinitesites mcmctree pamp yn00 ../bin chmod 775 ../bin/*
Now, you just need to export the path to the
bin
directory where you have saved the executable file. If you want to automatically export this path to your./bashrc
/~/.zshrc
/~/.bash_profile
/<you_name_it> (i.e., file name depends on your OS), you can run the following commands AFTER ADAPTING the absolute paths written in the code snippet below to those in your filesystem:# Run from any location. Male sure you change # `~/.bashrc` if you are using another file! printf "\n# Export path to PAML\n" >> ~/.bashrc # Replace "/c/usr/Bioinfor_tools/" with the path # that leads to the location where you have saved the # `paml4.10.7` directory. Modify any other part of the # absolute path if you have made other changes to the # name of the directory where you have downloaded `PAML` printf "export PATH=/c/usr/bioinfo_tools/paml4.10.7/bin:\$PATH\n" >> ~/.bashrc # Now, source the `~/.bashrc` file (or the file you are # using) to update the changes source ~/.bashrc
Alternatively, you can edit this file using your preferred text editor (e.g.,
vim
,nano
, etc.).Windows users: I suggest you install the Windows Subsystem for Linux (i.e., WSL) on your PCs to properly follow this tutorial -- otherwise, you may experience problems with the Windows Command Prompt. Once you have the WSL installed, then download the source code and follow the instructions listed above.
-
-
Installation (C) for M1/M2 chips or Mac users with other chips that experience problems with options A and/or B (Mac OSX): you will need to download the
dev
branch on thePAML
GitHub repository and compile the binaries from thedev
source code. Please follow this link and click the green button [<> Code
] to start the download. You will see that a compressed file calledpaml-dev.zip
will start to download. Once you decompress this file, you can go to directorysrc
and follow the instructions in (B) to compile the binaries from the source code. If you wanted to do this from the terminal, you could also clone the repository as explained above and then change the branch using commandgit checkout dev
. -
R
andRStudio
: please download R and RStudio as they are used throughout the tutorial. The packages we will be using should work withR
versions that are either newer than or equal to v4.1.2. If you are a Windows user, please make sure that you have the correct version ofRTools
installed, which will allow you to install packages from the source code if required. For instance, if you haveR
v4.1.2, then installingRTools4.0
shall be fine. If you have anotherR
version installed on your PC, please check whether you need to installRTools 4.2
orRTools 4.3
. For more information on which version you should download, please go to the CRAN website by following this link and download the version you need.Before you proceed, however, please make sure that you install the following packages too:
# Run from the R console in RStudio # Check that you have at least R v4.1.2 version$version.string # Now, install the packages we will be using # Note that it may take a while if you have not # installed all these software before install.packages( c('rstudioapi', 'ape', 'phytools', 'sn', 'stringr', 'rstan'), dep = TRUE ) ## NOTE: If you are a Windows user and see the message "Do you want to install from sources the ## packages which need compilarion?", please make sure that you have installed the `RTools` ## aforementioned.
-
FigTree
: you can use this graphical interface to display tree topologies with/without branch lengths and with/without additional labels. You can then decide what you want to be displayed by selecting the buttons and options that you require for that to happen. You can download the latest pre-compiled binaries,FigTree v1.4.4
at the time of writing, from theFigTree
GitHub repository. -
Tracer
: you can use this graphical interface to visually assess the quality of the MCMCs you have run during the analyses withMCMCtree
(e.g., chain efficiency, chain convergence, autocorrelation, etc.). You can download the latest pre-compiled binaries,Tracer v1.7.2
at the time of writing, from theTracer
GitHub repository. -
Visual Studio Code (optional): for best experience with the tutorials, we highly recommend you install Visual Studio Code and run the tutorial from this IDE to keep everything tidy, organised, and self-contained. You can download VSC from their website. If you are new to VSC, you can check their webinars to learn about its various features and how to make the most out of it. In addition, you may also want to install the following extensions:
- Markdown PDF -- developer: yzane
- markdownlint -- developer: David Anson
- Spell Right -- developer: Bartosz Antosik
- vscode-pdf -- developer: tomoki1207
If you have gone through the previous sections, have a clear understanding of how this repository is organised, and have installed the required software... Then you are ready to go!
You can start the tutorial on reproducibility by jumping to the 00_data_formatting
directory and choosing the README.md
file that suits your OS. If you want to first go through the quick start tutorial, then go to the quick_start
directory, choose the README.md
file that suits your OS, and...
Happy timetree inference! :)
ⓒ Dr Sandra Álvarez-Carretero | @sabifo4