# Day 15 notebook

The objectives of this notebook are to practice

* the UPGMA algorithm
* the neighbor joining algorithm

In [62]:
import toytree

## The case of possible HIV transmission from dentist to patient
In the early 1990s a controversial case arose in which it was suspected that an HIV-positive dentist had transmitted HIV to at least one of his patients.  For some background on the case, you can read the [obituary of one of the patients who died of AIDS](https://www.nytimes.com/1991/12/09/obituaries/kimberly-bergalis-is-dead-at-23-symbol-of-debate-over-aids-tests.html).  Ultimately, the Centers for Disease Control and Prevention (CDC) became involved and obtained HIV samples from the dentist, HIV-positive patients of this dentist, and a number of HIV-positive individuals in the local community (which served as controls).  The CDC performed DNA sequencing on these samples and a subsequent phylogenetic analysis to determine whether or not the molecular data provided evidence that the dentist had transmitted HIV to his patients ([Ou et al. Science, 1992](http://science.sciencemag.org/content/256/5060/1165)).

We will revisit the data from this study and run the UPGMA and neighbor joining algorithms to see if we can reproduce its conclusions.  Below is a pairwise distance matrix for HIV genomic segments obtained from viral samples from five individuals: the dentist (D), patient A (PA), patient B (PB), local control 1 (C1), and local control 2 (C2).  These distances come from a multiple alignment of the V3 variable region of the HIV genome for these samples.

              C1      C2       D      PA      PB
      C1     0.0    0.09   0.098   0.105    0.12
      C2    0.09     0.0   0.072   0.076   0.101
       D   0.098   0.072     0.0    0.04   0.061
      PA   0.105   0.076    0.04     0.0   0.068
      PB    0.12   0.101   0.061   0.068     0.0

Here is that same matrix in the form of a Python dictionary-based distance matrix:

In [5]:
v3_matrix = {
    ('C1', 'C1'): 0.0,   ('C1', 'C2'): 0.09,  ('C1', 'D'): 0.098, ('C1', 'PA'): 0.105, ('C1', 'PB'): 0.12,
    ('C2', 'C1'): 0.09,  ('C2', 'C2'): 0.0,   ('C2', 'D'): 0.072, ('C2', 'PA'): 0.076, ('C2', 'PB'): 0.101, 
     ('D', 'C1'): 0.098,  ('D', 'C2'): 0.072,  ('D', 'D'): 0.0,    ('D', 'PA'): 0.04,   ('D', 'PB'): 0.061,
    ('PA', 'C1'): 0.105, ('PA', 'C2'): 0.076, ('PA', 'D'): 0.04,  ('PA', 'PA'): 0.0,   ('PA', 'PB'): 0.068,
    ('PB', 'C1'): 0.12,  ('PB', 'C2'): 0.101, ('PB', 'D'): 0.061, ('PB', 'PA'): 0.068, ('PB', 'PB'): 0.0}

In [6]:
import itertools
def d(a, b):
    lst = [v3_matrix[k] for k in itertools.product(a, b)]
    return sum(lst) / len(lst)

## PROBLEM 1: UPGMA (3 POINTS)
Run the UPGMA algorithm (by hand, or by code if you really want to) on this distance matrix.  Record your result by assigning the resulting tree, in Newick string format, to the variable `upgma_tree_newick` below.  Your tree should have branch lengths, with lengths rounded to three digits after the decimal point.  **Important note:** do *not* round the intermediate distances computed during the algorithm.

In [18]:
upgma_tree_newick = "(C1:0.051625,(C2:0.0415,((D:0.020,PA:0.020):0.01225,PB:0.03225):0.00925):0.010125);" 

In [19]:
# draw your UPGMA tree
upgma_tree = toytree.tree(upgma_tree_newick)
canvas, axes = upgma_tree.draw(use_edge_lengths=True, scalebar=True)

In [6]:
# test upgma_tree_newick valid tree
upgma_tree = toytree.tree(upgma_tree_newick)
assert sorted(upgma_tree.get_tip_labels()) == ['C1', 'C2', 'D', 'PA', 'PB']
assert upgma_tree.is_rooted()
print("SUCCESS: upgma_tree_newick valid tree test passed")

SUCCESS: upgma_tree_newick valid tree test passed


In [None]:
# test upgma_tree_newick tree topology
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# test upgma_tree_newick branch lengths
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## PROBLEM 2: Neighbor joining (3 POINTS)
Run the neighbor joining algorithm (by hand, or by code if you really want to) on this distance matrix.  Record your result by assigning the resulting tree, in Newick string format, to the variable `nj_tree_newick` below.  Your tree should have branch lengths, with lengths rounded to three digits after the decimal point.  **Important note:** do *not* round the intermediate distances computed during the algorithm.

In [15]:
import pprint

L = list({e1 for (e1, e2) in v3_matrix})
d = v3_matrix.copy()

while(len(L) > 2):
    r = {e : sum(d[(e, k)] for k in L if e != k) / (len(L) - 2) for e in L}
    D = {(i, j) : d[(i, j)] - r[i] - r[j] for (i, j) in itertools.combinations(L, r=2)}
    k = (i, j) = min(D, key=D.get)

    d[(i, k)] = (d[(i, j)] + r[i] - r[j]) / 2
    d[(j, k)] = (d[(i, j)] + r[j] - r[i]) / 2
    d.update({(m, k) : (d[(i, m)] + d[(j, m)] - d[(i, j)]) / 2 for m in L if m not in (i, j)})
    d.update({tuple(reversed(e)) : d[e] for e in d})
    
    L = [e for e in L if e not in (i, j)] + [k]

    print(L)

pprint.pprint(d)

['PA', 'PB', 'D', ('C2', 'C1')]
['PB', ('C2', 'C1'), ('PA', 'D')]
[('PA', 'D'), ('PB', ('C2', 'C1'))]
{('C1', 'C1'): 0.0,
 ('C1', 'C2'): 0.09,
 ('C1', 'D'): 0.098,
 ('C1', 'PA'): 0.105,
 ('C1', 'PB'): 0.12,
 (('C2', 'C1'), 'C1'): 0.05733333333333334,
 ('C1', ('C2', 'C1')): 0.05733333333333334,
 ('C2', 'C1'): 0.09,
 ('D', ('PA', 'D')): 0.016875,
 (('C2', 'C1'), ('PA', 'D')): 0.022749999999999996,
 ('PA', ('PA', 'D')): 0.023125000000000007,
 (('C2', 'C1'), 'D'): 0.039999999999999994,
 (('C2', 'C1'), 'PA'): 0.0455,
 ('C2', ('C2', 'C1')): 0.03266666666666666,
 ('C2', 'C2'): 0.0,
 ('C2', 'D'): 0.072,
 ('C2', 'PA'): 0.076,
 ('C2', 'PB'): 0.101,
 ('D', 'C1'): 0.098,
 ('PB', ('PA', 'D')): 0.0445,
 (('PB', ('C2', 'C1')), 'PB'): 0.043625,
 ('PB', ('PB', ('C2', 'C1'))): 0.043625,
 (('C2', 'C1'), 'C2'): 0.03266666666666666,
 (('C2', 'C1'), 'PB'): 0.0655,
 (('PB', ('C2', 'C1')), ('PA', 'D')): 0.0008749999999999938,
 ('D', ('C2', 'C1')): 0.039999999999999994,
 ('D', 'C2'): 0.072,
 ('D', 'D'): 0.0,
 

In [2]:
nj_tree_newick = "(PB:0.044,(C1:0.057,C2:0.033):0.022,(D:0.017,PA:0.023):0.001);"

In [90]:
# draw your NJ tree
nj_tree = toytree.tree(nj_tree_newick)
canvas, axes = nj_tree.draw(use_edge_lengths=True, scalebar=True)

In [67]:
# test nj_tree_newick valid tree
nj_tree = toytree.tree(nj_tree_newick)
assert sorted(nj_tree.get_tip_labels()) == ['C1', 'C2', 'D', 'PA', 'PB']
assert not nj_tree.is_rooted()
print("SUCCESS: nj_tree_newick valid tree test passed")

SUCCESS: nj_tree_newick valid tree test passed


In [68]:
# test nj_tree_newick tree topology
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [25]:
# test nj_tree_newick branch lengths
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Follow-up analysis

Consider the following questions:
1. How do the two trees that you constructed differ, if at all?
2. Are these trees consistent with the possibility that the dentist transmitted HIV to the two patients considered here?

###
### Your answers here
###
