# MCB-IB Phylogenetics practical

Here we will examine the phylogenetic relationships between great apes.  The data set includes two sequences from each of 12 populations: 2 humans, 3 gorilla, 1 bonobo and 4 chimpanzee and 2 orang-utan.

We will use the internal BioPython.Phylo module functions to explore some basic properties, then run the more powerful iqtree software to explore maximum likelihood calculations.

A side trip into the iqtree manual will demonstrate, we hope, that you have acquired enough background to understand at least in part a rich range of options.

Finally we will investigate incomplete lineage sorting (ILS) in this data set, and in a corresponding data set simulated by iqtree from its best fit model.

### First make sure external libraries are installed

In [None]:
%%sh
pip install biopython pandas seaborn scipy

### Load what we need

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import *

# Open MSA

In [None]:
great_apes_msa = AlignIO.read("great_apes.phy", format="phylip-relaxed")

## Convert to a dictionary from ape name to pair of sequences
For each individual there are two entries, e.g. Pongo_abelii-1 and Pongo_abelii-2.  We want to remove the final two characters.

In [None]:
great_ape_names = sorted(set([seq.id[:-2] for seq in great_apes_msa.alignment.sequences]))

display(great_ape_names)

In [None]:
name_to_sequences = {}

for name in great_ape_names:
    name_to_sequences[name] = []

for sequence in great_apes_msa.alignment.sequences:
    name = sequence.id[:-2]
    
    name_to_sequences[name].append(sequence.seq)
    

In [None]:
display(name_to_sequences)

# Calculate pairwise divergences within species

In [None]:
for name in great_ape_names:
    seq1 = name_to_sequences[name][0]
    seq2 = name_to_sequences[name][1]
    
    seq_length = len(seq1)
    n_diffs = 0
    for ***:
        ***
            
    theta = n_diffs/seq_length
            
    print(f"{name:40}{n_diffs}\t{seq_length}\t{theta}")

These numbers look a bit lower than we expect, e.g. Homo_sapiens_afr should be ~0.0012 and nonAfr ~0.0008. This is most likely a consequence of how these sites were chosen. We will proceed regardless. 

# Calculate the distance matrix

In [None]:
def view_distance_matrix(distance_matrix, return_matrix=False):
    n_names = len(distance_matrix)
    D = np.zeros((n_names, n_names))
    for i in range(n_names):
        D[i,:i+1] = distance_matrix.matrix[i]
    D += D.T
    names = np.array(distance_matrix.names).astype(str)
    df = pd.DataFrame(data=D, index=names, columns=names)
    
    if return_matrix:
        return df
    
    fig, ax = plt.subplots(figsize=(8,6))
    sns.heatmap(df, vmin=0, square=True, ax=ax)
    ax.set_title("Distance matrix")
    
#     sns.clustermap(df)
    
    

In [None]:
print(DistanceCalculator.dna_models)

In [None]:
distance_matrix = DistanceCalculator('identity').get_distance(great_apes_msa)

In [None]:
view_distance_matrix(distance_matrix)


# Build a tree

## Neighbor joining

In [None]:
NJ_constructor = DistanceTreeConstructor(DistanceCalculator('identity'), "nj")

In [None]:
NJ_tree = NJ_constructor.build_tree(great_apes_msa)

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
Phylo.draw(NJ_tree, axes=ax)

Why does this tree look different from the one we started with?

Are all the sequence pairs from the same population neighbours?  Why might they not be?

## UPGMA

In [None]:
UPGMA_constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'upgma')

In [None]:
UPGMA_tree = UPGMA_constructor.build_tree(great_apes_msa)

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
Phylo.draw(UPGMA_tree, axes=ax)

# Bootstrapping

In [None]:
bootstrap_alns = list(bootstrap(great_apes_msa, 10))

In [None]:
for bootstrap_aln in bootstrap_alns[:3]:
    fig, ax = plt.subplots(figsize=(15,10))
    Phylo.draw(UPGMA_constructor.build_tree(bootstrap_aln), axes=ax)


### Build on a shorter segment

In [None]:
for bootstrap_aln in bootstrap_alns[:3]:
    shorter_aln = bootstrap_aln[:, :500]
    
    fig, ax = plt.subplots(figsize=(15,10))
    Phylo.draw(NJ_constructor.build_tree(shorter_aln), axes=ax)


## Run IQtree

In [None]:
%%sh
iqtree2 -s ./great_apes.phy -B 1000 --redo

## Open IQtree outputs

### Maximum likelihood tree

In [None]:
fig, ax = plt.subplots(figsize=(15,10))
Phylo.draw(Phylo.read("./great_apes.phy.treefile", "newick"), axes=ax)

Is this the same as the NJ and UPGMA trees?  What is different?

Not all the bootstrap values are 100%, even with nearly 20,000 sites.  Why is that?

### See ML distances

In [None]:
mat = np.array([line.strip().split()[1:] for line in open("./great_apes.phy.mldist").readlines()[1:]]).astype(float)
    

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15,7), sharey=True, sharex=True)

sns.heatmap(
    mat, 
    square=True, 
    ax=ax[0],
    vmin=0,
    vmax=0.03,
)
ax[0].set_title("Maximum Likelihood");

sns.heatmap(
    view_distance_matrix(distance_matrix, return_matrix=True), 
    square=True, 
    ax=ax[1],
    vmin=0,
    vmax=0.03,
)
ax[1].set_title("Identity distance");


# ILS

Here we will explore incomplete lineage sorting within great ape ancestors.  We do this by looking at quartets of species such as (orang, (gorilla, (human, chimp))) and looking at sites which segregate 2:2.  If there is no ILS and there was only ever one mutation per site then we should see only the (orang, gorilla)(human, chimp) pattern.  If there is ILS as shown in the lectures, then we would see (equal numbers of) (orang, human)(gorilla, chimp) and (orang, chimp)(gorilla, human). Repeated mutations at the same site create noise.

First, before we do the actual tests, we will ask iqtree to create a matching dataset to the one we are looking at, based on its best fit model.  This model doesn't know about within-population variation, so doesn't simulate ILS, but it can and will simulate recurrent mutations.

(For the record, iqtree does in fact support models that allow for ILS, via the "PoMo" models.  You can look in the iqtree documentation for how to do this.)

## Simulate from the tree

In [None]:
!iqtree2 -s ./great_apes.phy --alisim mimicked_MSA --redo

## Create a new mapping from the names to the simulated sequences

We need this in order to access the sequences

In [None]:
great_apes_simulated_msa = AlignIO.read("mimicked_MSA.phy", format="phylip-relaxed")

In [None]:
sim_name_to_sequences = {}

for name in great_ape_names:
    sim_name_to_sequences[name] = []

for sequence in great_apes_simulated_msa.alignment.sequences:
    name = sequence.id[:-2]
    
    sim_name_to_sequences[name].append(sequence.seq)
    

## Define ILS function

In [None]:
def ils_quartet(names, name_to_sequences):
    # Check there are four names
    assert len(names) == 4
    
    # Get the first sequence per name
    sequences = []
    for name in names:
        sequences.append(name_to_sequences[name][0])
        
    # Sequence length
    seq_length = len(sequences[0])
    
    # Pattern count - xxyy, xyxy or xyyx
    patterns = [0, 0, 0]
    
    # Go through the positions one by one
    for i in range(seq_length):
        # Per position, check if there are 2 of each alleles, and which pattern
        alleles = [seq[i] for seq in sequences]
        
        # First pattern
        if (alleles[0] == alleles[1]) and *** and ***:
            patterns[0] += 1

        # Second pattern
        if (alleles[0] == alleles[2]) and *** and ***:
            patterns[1] += 1

        # Third pattern
        if (alleles[0] == alleles[3]) and *** and ***:
            patterns[2] += 1
            
    return patterns

# ['Gorilla_beringei_graueri',
#  'Gorilla_gorilla_dielhi',
#  'Gorilla_gorilla_gorilla',
#  'Homo_sapiens_afr',
#  'Homo_sapiens_nonAfr',
#  'Pan_paniscus',
#  'Pan_troglodytes_ellioti',
#  'Pan_troglodytes_schweinfurthii',
#  'Pan_troglodytes_troglodytes',
#  'Pan_troglodytes_verus',
#  'Pongo_abelii',
#  'Pongo_pygmaeus']

## Run on the real data

In [None]:
# Pattern count - xxyy, xyxy or xyyx
ils_quartet(
    [
        'Pongo_abelii',
        'Homo_sapiens_afr',
        'Pan_troglodytes_ellioti',
        'Gorilla_gorilla_gorilla',        
    ],
    name_to_sequences,
)    

## Run on the simulated data

In [None]:
# Pattern count - xxyy, xyxy or xyyx
ils_quartet(
    [
        'Pongo_abelii',
        'Homo_sapiens_afr',
        'Pan_troglodytes_ellioti',
        'Gorilla_gorilla_gorilla',        
    ],
    sim_name_to_sequences,
)    

The "canonical" pattern is the third: xyyx, which groups Pongo with Gorilla, and Homo with Pan.  This is what we see as the dominant value in the simulated data, for which there should be no ILS.  Why are there non-zero values in the first and second patterns?

Why might there be a high value for the first pattern (Pongo, Human)(Chimp, Gorilla)?  Hint (perhaps): maybe the human genome was privileged in some way in generating this data set?