# Lab 3: Tree inference using parsimony; bootstrapping

<img src="http://bitesizebio.s3.amazonaws.com/wp-content/uploads/2012/01/Pseudoreplication_header-image_cropped.jpg" />

### By the end of the lab, you will know how to:
1. Reconstruct a tree using the maximum parsimony optimality criterion.
2. Perform a bootstrap analysis to estimate confidence in the inferred topology
3. View the results in FigTree

We will use the rhodopsin and _wingless_ alignments from last week. In case those files have gone missing, they can be downloaded from the course website:

- [LWRh.afa](https://sites.google.com/a/fieldmuseum.org/rtol/lab-exercises/LWRh.afa)
- [Wg.afa](https://sites.google.com/a/fieldmuseum.org/rtol/lab-exercises/Wg.afa)

## Tree inference using PAUP*

[__PAUP*__](http://phylosolutions.org) is one of the original phylogeny inference programs. Its original name was just PAUP, for Phylogenetic Analysis using Parsimony, but as likelihood-based methods were implemented, it was changed to PAUP* - Phylogenetic Analysis using PAUP - a [recursive acronym](https://en.wikipedia.org/wiki/Recursive_acronym) in the venerable UNIX tradition.

The PAUP* program and manual can be downloaded from its website: http://phylosolutions.org. There is a GUI (mouse-driven) version for Mac computers, but we will use the command-line version, which works across operating systems and will soon be free and open-source.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Download PAUP* (if not already installed)

The Mac command-line version is here: http://phylosolutions.com/paup-test/paup4a157_osx.gz

Download and extract the binary executable file (**`paup4a157_osx`**) to your working directory.

Open a Terminal and, if necessary, **`cd`** to the working directory.

For convenience, rename the file **`paup4a157_osx`** to just **`paup`**.

  ```bash
  mv paup4a157_osx paup
  ```

Make sure it is recognized as an executable program:

  ```bash
  chmod +x paup
  ```
  
Run the program. Since it resides in the current working directory **`./`**, which is not in our [PATH environment variable], we need to specify its path:

  ```bash
  ./paup
  ```

Note how it replaces the Terminal shell prompt with its own command prompt. You can access a rudimentary help system for the available commands by typing **`?`**:

  ```
  paup> ?
  ```
  
To quit the program, enter **`q`**.

  ```
  paup> q
  ```

[PATH environment variable]: https://en.wikipedia.org/wiki/PATH_(variable)

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Convert your aligned FASTA files to NEXUS files

PAUP* and Mesquite require input files in [NEXUS format](https://en.wikipedia.org/wiki/Nexus_file). Like FASTA files, NEXUS files are just text files with a particular structure (and vocabulary).

Last time we used MUSCLE to create FASTA alignments, **`LWRh.afa`** and **`Wg.afa`**.

Open each file in Mesquite. This will prompt you save it in NEXUS format.

Save each file with the default names, **`LWRh.afa.nex`** and **`Wg.afa.nex`**.

Open one in a text editor (you can do this in Jupyter, from the Home tab) and inspect the NEXUS format. Note:

* the leading **`#NEXUS`** token
* the modular structure of named _blocks_, demarcated by **BEGIN _BLOCKNAME_;** and **END _BLOCKNAME_;** statements
* comments (ignored by programs reading the file) are enclosed in square brackets: **`[this is a comment]`**

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Create and run a NEXUS file of commands to infer the _wingless_ gene tree for _Pseudomyrmex_

In the Jupyter Home tab, create a new text file and name it **`run-parsimony.nex`**. Enter the following commands:

```
#NEXUS

begin paup;
log file=Wg-parsimony.log replace;
execute Wg.afa.nex;
outgroup 'Myrcidris_epicharis_AY703651';
set increase=auto;
hsearch addseq=random;
savetrees file=Wg-mp-trees.nex root=yes brlens=yes replace;
end;
```

Save the file and make sure it is in the working directory with your alignment files. If on lab computer, move all files to Documents where PAUP is located.

Run the file in PAUP. If your PAUP session is still open in the Terminal, use the **`execute`** command:

  ```
  paup> execute run-parsimony.nex
  ```
  
Otherwise, you can pass the file as an argument to the PAUP executable:

  ```bash
  ./paup run-parsimony.nex
  ```
  
Look at the output, either on the Terminal screen, or in the log file **`Wg-parsimony.log`**. You have just done a bare-bones heuristic search for the most parsimonious trees for the _wingless_ gene. Congratulations!

Let's examine our NEXUS command file to understand what it's doing. The first line simply declares it to be a NEXUS file. Then there is a single **`paup`** block, demarcated by the **`begin ...`** and **`end;`** statements.

>***Note***: NEXUS commands are case-insensitive.

Line-by-line breakdown:

    log file=Wg-parsimony.log replace;
    
* Record everything to the specified file, and to overwrite (replace) the file if it already exists. **`replace`** is an option that can be changed: type **`help log`** for details.


    execute Wg.afa.nex;
    
* Execute (read commands from) the file `Wg.afa.nex`, which in this case just contains the sequence data matrix.


    outgroup Myrcidris_epicharis_AY703651;
    
*  Because we are interested in the rooted tree of _Pseudomyrmex_, we specify an outgroup, _Myrcidris_epicharis_AY703651_. We could specify the outgroup as any subset (including all) of the non-_Pseudomyrmex_ sequences, but for simplicity we'll use just the first one.


    set increase=auto;

* By default, PAUP will save only 100 most-parsimonious trees in memory at a time. This command increases this limit automatically as additional trees are found, at the risk of exhausting all the computer's memory.


    hsearch addseq=random;
    
* Do the heuristic search, using random addition for each starting tree replicate. This command has many options, for building the starting tree, branch swapping, replicating the search, etc.; see **`help hsearch`** to see them.


    savetrees file=Wg-mp-trees.nex root=yes brlens=yes replace;

* Save all the most-parsimonious trees, with their parsimony branch lengths, to a NEXUS file, replacing it if necessary.

***How many trees were found?***

***What was their score, meaning their length - the number of nucleotide changes required to explain the data?***

***Open the treefile in FigTree and page through some of the trees, to get a sense of how different they are.***

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Make a strict consensus tree

Given a number of equally optimal trees, one naturally wants to know what relationships they have in common -- i.e., their ***consensus***. There are different kinds and degrees of consensus; the simplest, easiest to interpret, and arguably the most useful is the **strict consensus**. This method computes the set of clades (or in the case of unrooted trees, the bipartitions) that are common to ALL of the input trees. This set can then be summarized as a single tree, the ***strict consensus tree***.

Here are the commands to do this in PAUP*. We will save the tree both as a NEXUS file and a plain Newick file.

    contree / strict treefile=Wg-mp-contree.nex replace;
    gettrees file=Wg-mp-contree.nex;
    savetrees file=Wg-mp-contree.newick format=newick;

***View the consensus tree in FigTree***. What does the number of polytomies suggest to you about how informative the _wingless_ gene is to _Pseudomyrmex_ phylogeny?

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Estimate confidence in the tree using the non-parametric bootstrap

Create a new NEXUS file for a bootstrap analysis:

```
#NEXUS

begin paup;
execute Wg.afa.nex;
outgroup 'Myrcidris_epicharis_AY703651';
set maxtrees=1 increase=no;
hsearch addseq=random;
bootstrap nrep=100 keepall=yes;
savetrees file=Wg-mp-bootcon.newick format=newick root=yes replace;
end;
```

In a new Terminal, **`cd`** to the working directory and execute this file in PAUP*.

***How quickly does the analysis finish, compared to the previous heuristic search?***

The commmands tell PAUP to do _**100** replicate heuristic searches_ - compared to the previous _single_ search.

***Why then did this analysis finish so quickly?***

The basic output of a bootstrap analysis is a table of clades or bipartitions and their associated frequencies -- the percentage of the replicates in which the clade was found by the heuristic search. These percentages are commonly called ***bootstrap values***.

By default, PAUP computes a ***majority-rule consensus tree*** from this table. The tree is composed of clades found in over 50% of the  replicate searches. It will also contain clades that are _not contradicted_ by any other tree, in order of their frequency.

***Open the bootstrap consensus tree*** (the file `Wg-mp-bootcon.newick`) in a text editor and look at the Newick format:

    ...((Pseudomyrmex_elongatulus_AY703659:100,Pseudomyrmex_apache_AY703652:100):55,...
    
The Newick convention is to write _branch lengths_ as numbers following colons. PAUP writes bootstrap consensus trees with ***bootstrap values*** as branch lengths. So here, the clade of sister species _Pseudomyrmex elongatulus_ + _P. apache_ has a bootstrap value of 55%.

***Open the bootstrap consensus tree in FigTree.*** Use the side panel to show the bootstrap values:

- Check and expand **Branch Labels**
- Choose **Display: Branch lengths (raw)**

***How many internal nodes have "strongly supported" nodes -- with bootstrap values > 70%?***

FigTree will show the tree as a ***phylogram*** (if you can't remember what this means, refer to your lecture notes!) because the tree contains branch lengths. We know they aren't branch lengths, but bootstrap values. Visually, this means that longer branches are more strongly supported.

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### One gene down, one to go

Repeat the above analyses for the rhodopsin alignment, `LWRh.afa.nex`. Make sure to edit your executable nexus files to match the information for LWRh.afa.nex and so that it doesn't replace your existing files.

***Which gene has more strongly supported clades?***

***Are there any strongly supported clades in common?***

---
<img style="float: right;" src="http://www.reelab.net/todo.png">

### Final exercise: a combined analysis of both genes

We will repeat the exercise with the _wingless_ and rhodopsin genes laid end-to-end, i.e., ***concatenated***.



***The problem:*** recall that when we created the single-gene FASTA files in Lab 1, we labeled the sequences like this:

    >Myrcidris_epicharis_AY703651
    GTTGCCGAACTTTCGCGTGGTTGGAGACAATCTGAAAGATCGCTTCGATGGAGCATCCCG
    ...

That is, we _included the GenBank accession number_ in the sequence label. This seemed like a good idea at the time, since it kept the provenance of the data in the file. But now we want to ***concatenate*** the alignments, and make a "supermatrix" where the row labels contain only the species name.

The script below concatenates the alignments and saves the result as a new FASTA file.

***Create a new code cell below and execute the script.***

```python
from Bio import AlignIO

# A convenience function for stripping the accession number
def stripacc(label):
    # We know the label is of the form: Genus_species_accession
    # so we split it on underscores and rejoin it, minus the last element
    return '_'.join(label.split('_')[:-1])

# Let's read in the rhodopsin alignment first
fn = 'LWRh.afa'
aln = AlignIO.read(fn, format='fasta')
print('{} has length {}'.format(fn, aln.get_alignment_length()))

# Strip the accession number from each Bio.SeqRecord in the alignment
for s in aln:
    s.id = stripacc(s.id)
    s.description = ''

# Make a dictionary for looking up SeqRecords by id
d = dict([ (s.id, s) for s in aln ])

# Now read in the wingless alignment
fn = 'Wg.afa'
wg = AlignIO.read(fn, format='fasta')
print('{} has length {}'.format(fn, wg.get_alignment_length()))

# Concatenate the sequences. Every wingless sequence has one,
# and only one, matching rhodopsin sequence, making things simple
for s in wg:
    k = stripacc(s.id)
    lwrh = d[k]  # matching rhodopsin sequence
    lwrh.seq += s.seq  # this 'adds' the sequences together, i.e., concatenates them

print('combined alignment has length {}'.format(aln.get_alignment_length()))

AlignIO.write(aln, 'LWRh.Wg.afa', format='fasta')
```

***Do a bootstrap analysis of the combined matrix. How do the support values compare to the individual gene analyses?***

---
## Optional: interactive tree comparison

Few GUI tools exist for interactively comparing trees. One is http://phylo.io/. Another is implemented in a Python package, [**`ivy`**](https://github.com/rhr/ivy), written by yours truly (RR). The code is not well documented, nor extensively tested, and contains bugs. With that in mind, I've written a tutorial on how to use **`ivy`** to visually and interactively compare trees such as the bootstrap consensus trees computed above. Please try it out if you have time: download the notebook below and run it from this working directory.

https://www.dropbox.com/s/etsaf6lpwr0jcl8/tree-compare.ipynb

---
<img src="http://3.bp.blogspot.com/-F_kwLarXHcc/Ux5fciXTyYI/AAAAAAAANNY/3MrB_L23WxQ/s1600/Porky_Pig_Thats_All_Folks.jpg" />