# Phylogenetic inference





<br>


## Overview


In the first two, modules, we assembled RADseq data with ipyrad and ran some population clustering anlayses. **Next, we'll make a couple quick phylogenetic trees of our samples from the ipyrad assembly.** These anlayses will not require the output from submodule 2 on population structure, only submodule 1. 

We'll use two different programs that have different theoretical underpinnings: IQ-TREE (IQ-TREE 2 paper [here](https://academic.oup.com/mbe/article/37/5/1530/5721363?ref=https://githubhelp.com) and documentation [here](http://www.iqtree.org/doc/)) and [SVDQuartets](https://academic.oup.com/bioinformatics/article/30/23/3317/206559?login=true). IQ-TREE is a maximum likelhood method for estimating phylogenies on concatentated alignments. Most models in IQ-TREE and similar maximum likelihood and Bayesian phylogeny programs assume that all sites in an alignment share a single underlying tree. This assumption may be broken for many reasons, including incomplete lineage sorting and gene flow. 

We will demonstrate two specific methods to infer phylogenies, but explaining the theoretical underpinnings of phylogenetic inference and the various classes of methods is beyond the scope of this tutorial. There are a great many programs and approaches for inferring phylogenies, and which one you choose will ultimately depend on your data and the questions you want to answer. If you need a general introduction to phylogenetics, Felsenstein's [Inferring Phylogenies](https://global.oup.com/academic/product/inferring-phylogenies-9780878931774?cc=us&lang=en&), Baum and Smith's [Tree Thinking: An Introduction to Phylogenetic Biology](https://store.macmillanlearning.com/us/product/Tree-Thinking-An-Introduction-to-Phylogenetic-Biology/p/1936221160?srsltid=AfmBOorzK_ikT0AD1kCOswzDK_hGQ4t5ht7iQp_l5MnGxGDe9JtSKkzg), and Kubatko and Knowles' [Species Tree Inference: A Guide to Methods and Applications](https://press.princeton.edu/books/hardcover/9780691207599/species-tree-inference?srsltid=AfmBOorMTxQfQ0L9s4oXURRlGYZhYRM2N3ggYRKpAhIz1mPR8_ffdf9T) are excellent places to start. [Maddison (1997)](https://academic.oup.com/sysbio/article/46/3/523/1651369) and [Degnan and Rosenberg (2006)](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020068) both offer good introductions to the differences between species trees and gene trees and the problems that incomplete lineage sorting can cause.

Gene flow is still notoriously hard to detect and adequately model in phylogenetic inference, but many programs now exist that are robust to variation in phylogenetic signal across sites/genes caused by incomplete lineage sorting.

SVDQuartets is one such method. SVDQuartets uses a quartet approach that is statistically consistent with the multi-species coalescent without the computational burden of explicitly modeling the multi-species coalescent process.




## Learning objectives

1. Estimate phylogeny using IQ-TREE
2. Estimate phylogeny using SVDQuartets


## Prerequisites

All necessary software is included in the container that we will use on Google Cloud: `us-east4-docker.pkg.dev/nih-cl-shared-resources/nigms-sandbox/nigms-vertex-r-wy`

If you are not using the container, you will need to install the following software:

- [PAUP](https://paup.phylosolutions.com/get-paup/) for running SVDQuartets
- [IQtree](http://www.iqtree.org/#download)



<br>
<br>
<br>
<br>
<br>

# Get started

<br>

If you did not run through the previous tutorial or are running this tutorial from a fresh instance, you can download the ipyrad output we provide in the "radseq_cloud" Google bucket. Only uncomment and run these next lines if you want to download the ipyrad assembly.

In [None]:
#! gsutil -m cp -r gs://radseq_cloud/ .
#! mkdir -p ./ipyrad_out/ruber_reduced_denovo_outfiles/
#! cp ./radseq_cloud/ruber-ipyrad-out/* ./ipyrad_out/ruber_reduced_denovo_outfiles/

<br>
<br>
<br>


## IQ-TREE

IQ-TREE is a common and very easy to use program for generating maximum likelihood phylogenies. We'll start with this.



It's very easy to run, we mostly just need to point iqtree to the input file, which we'll set as the phylip-formatted output from ipyrad, the `.phy` file. 


We'll set up our input and output paths as variables so that these can easily be changed and we shouldn't need to change much in the actual program call for different datasets, just these variables.

`INFILE` will be the name and path to the input file
`OUTFIX` will the be the prefix that gets prepended to each output file.
`outdir` is the directory that we want all output to go into.


 Options that we'll set in the program call include:

`-s $INFILE` sets the input sequence file.

`-m MFP` which instead of specifying a model of evolution, tells IQTree to use ModelFinderPlus to find the best model of sequence evolution.

`-T auto` tells IQTree to automatically determine the best number of threads to use, within some maximum we specify based on what we've allocated.

`--prefix $OUTFIX` sets the prefix for our output to what we define in out `OUTFIX` bash variable.

`-B 1000` tells IQTree to use 1000 ultrafast bootstraps for assessing support.

`-alrt 1000` uses 1000 bootstrap replicates for SH-aLRT calculation (a likelihood-based metric of branch support).

`-ntmax 8` sets the maximum number of threads to use, this should not exceeed the number of cores in your instance.


In [1]:
import os
from jupyterquiz import display_quiz

# set up the input file, outfile prefix, and output directory
os.environ["INFILE"] = "/home/jupyter/RADseq_cloud_learn/ipyrad_out/ruber_reduced_denovo_outfiles/ruber_reduced_denovo.phy"
os.environ["OUTFIX"] = "ruber"
outdir = "/home/jupyter/RADseq_cloud_learn/iqtree_out"

In [None]:
os.makedirs(outdir, exist_ok=True) # create the output directory if it doesn't already exist
os.chdir(outdir)

In [None]:
## Execute iqtree


! iqtree2 -s $INFILE -m MFP -T auto --prefix $OUTFIX -B 1000 -alrt 1000 -nt AUTO -ntmax 8 -safe

You may see a bunch of "likelihood is underflown" warnings, these aren't ideal, but the tree we get is reasonable even with these, so we'll ignore them for now.

If IQ-TREE runs sucessfully, you should see something that ends like this:


<img src="images/IQ_end.png" width=40% />

and your output directory should end up with various files, most importantly `ruber.treefile`. We'll visualize the tree you estimated in the R in the next tutorial.



* Note that if you are using SNP data that does not include invariant characters, you need to use models that account for this. If your alignment contains only variable characters and you use standard substitution models, IQ-TREE will assume that the substitution rate is really high, which can cause weird inferences.



In [2]:
display_quiz("quizzes/submodule_3/quiz1.json")

<IPython.core.display.Javascript object>

## SVDQuartets


SVDQuartets is a quartet-based method that is designed to work on SNPs to create species trees, but it can also be used with full concatenated alignments to generate trees of indiviuals like we've done with IQTree.

It is somewhat more involved to set up, and we'll again set it up with a bunch of bash variables.

What we'll do is run a single search for the best tree, save it, then run a search that includes bootstrapping and save those trees. Later, in R, we'll plot the bootstraps onto the best tree. Note that if you run a bootstrap analysis and just plot the tree that comes out from that with bootstap values at nodes, the bootstraps will be plotted on a consensus of bootstrap trees, not the tree that has the highest likelihood onyour actual data. I consider this to be highly undesirable.

### Edit the nexus file

You will need to edit the nexus file created by ipyrad to create a nexus file that SVDQuartets/PAUP will correctly read in. The character sets specified in the file we got from ipyrad will cause issues, and so we need to delete them. 

We need to make a copy of `ruber_reduced_denovo.nex` as `ruber_reduced_denovoPAUP.nex` and then in the new file, we will delete the line `BEGIN SETS;` 


<img src="images/start_sets.png" width=30% />


all of the lines that begin `charset` and the `END;` that marks the end of the charsets block

<img src="images/end_sets.png" width=30% />

We will do this programatically using a sed command. Be sure to modify the path depending on if you ran ipyrad or merely copied the inputs in. 

In [None]:
! sed '/BEGIN SETS;/,/^END;/d' /home/jupyter/RADseq_cloud_learn/ipyrad_out/ruber_reduced_denovo_outfiles/ruber_reduced_denovo.nex > /home/jupyter/RADseq_cloud_learn/ipyrad_out/ruber_reduced_denovo_outfiles/ruber_reduced_denovoPAUP.nex


Note that there are other `end;` statements in the nexus file that you do not want to delete. The end of your Nexus file should look like this after deleting the charsets block:
 
<img src="images/no_charsets.png" width=40% />
 




Once that is done, you can proceed. We will set this up to use variables to specify the input, output, and some options for SVDQuartets so that in most cases, you should not need to edit anything in the second part of this code block. Note, however, that there are some options that we have defined in the program call, and in some cases you may want to change these.

In [None]:
%%bash
PAUP="/usr/bin/paup4" # set up PAUP path
OUTDIR="/home/jupyter/RADseq_cloud_learn/svdq_out"


#define  variables for the PAUPblock
filebname="ruber_reduced_denovo" #basename for all produced files
# double check that you have this path
infile="/home/jupyter/RADseq_cloud_learn/radseq_cloud/ruber-ipyrad-out/ruber_reduced_denovoPAUP.nex" #name of input nexus file; can give a path so the input files don't have to be part of the working directory
nthreads=8 #number of threads to use
nreps=200 #number of replicates for bootstrapping



################################################################################################################################################################
################################################################################################################################################################
####    Run based on the parameters set above
################################################################################################################################################################
################################################################################################################################################################


#change working directory to where your output files will go
mkdir -p $OUTDIR
cd $OUTDIR


cat <<EOF > $filebname.paup.txt
Begin paup;
set autoclose=yes warntree=no warnreset=no flock=no;
log start file=$filebname.log ;
execute $infile;
svdQuartets evalQuartets=all showScores=no ambigs=distribute bootstrap=no nthreads=$nthreads;
savetrees file=$filebname.besttree.tre;
svdQuartets evalQuartets=all showScores=no ambigs=distribute bootstrap=standard nreps=$nreps nthreads=$nthreads treefile=$filebname.svdqboots.tre;  
quit; 
end;
EOF

$PAUP $filebname.paup.txt #execute your new paup block file





If SVDQuartets runs sucessfully, your output directory should end up with various files, most importantly `ruber_reduced_denovo.svdqboots.tre` and `ruber_reduced_denovo.besttree.tre`. You should also see a text representation of the tree that was inferred. We'll visualize the tree you estimated in the R in the next tutorial.


In [3]:
display_quiz("quizzes/submodule_3/quiz2.json")

<IPython.core.display.Javascript object>

# Conclusion

In this tutorial, we took the RADseq data that was assembled using ipyrad in the first submodule and used two different phylogenetic methods to estimate trees from the data. In the next tutorial, we will show how to plot these trees in R.



<br>

# Cleanup

If you are not immediately moving on to the next notebook, shut down your GCP instance to prevent being charged while it sits idle.