# Day 15 notebook

The objectives of this notebook are to practice

* the UPGMA algorithm
* the neighbor joining algorithm

In [1]:
import toytree

## The case of possible HIV transmission from dentist to patient
In the early 1990s a controversial case arose in which it was suspected that an HIV-positive dentist had transmitted HIV to at least one of his patients.  For some background on the case, you can read the [obituary of one of the patients who died of AIDS](https://www.nytimes.com/1991/12/09/obituaries/kimberly-bergalis-is-dead-at-23-symbol-of-debate-over-aids-tests.html).  Ultimately, the Centers for Disease Control and Prevention (CDC) became involved and obtained HIV samples from the dentist, HIV-positive patients of this dentist, and a number of HIV-positive individuals in the local community (which served as controls).  The CDC performed DNA sequencing on these samples and a subsequent phylogenetic analysis to determine whether or not the molecular data provided evidence that the dentist had transmitted HIV to his patients ([Ou et al. Science, 1992](http://science.sciencemag.org/content/256/5060/1165)).

We will revisit the data from this study and run the UPGMA and neighbor joining algorithms to see if we can reproduce its conclusions.  Below is a pairwise distance matrix for HIV genomic segments obtained from viral samples from five individuals: the dentist (D), patient A (PA), patient B (PB), local control 1 (C1), and local control 2 (C2).  These distances come from a multiple alignment of the V3 variable region of the HIV genome for these samples.

              C1      C2       D      PA      PB
      C1     0.0    0.09   0.098   0.105    0.12
      C2    0.09     0.0   0.072   0.076   0.101
       D   0.098   0.072     0.0    0.04   0.061
      PA   0.105   0.076    0.04     0.0   0.068
      PB    0.12   0.101   0.061   0.068     0.0

Here is that same matrix in the form of a Python dictionary-based distance matrix:

In [2]:
v3_matrix = {
    ('C1', 'C1'): 0.0,   ('C1', 'C2'): 0.09,  ('C1', 'D'): 0.098, ('C1', 'PA'): 0.105, ('C1', 'PB'): 0.12,
    ('C2', 'C1'): 0.09,  ('C2', 'C2'): 0.0,   ('C2', 'D'): 0.072, ('C2', 'PA'): 0.076, ('C2', 'PB'): 0.101, 
     ('D', 'C1'): 0.098,  ('D', 'C2'): 0.072,  ('D', 'D'): 0.0,    ('D', 'PA'): 0.04,   ('D', 'PB'): 0.061,
    ('PA', 'C1'): 0.105, ('PA', 'C2'): 0.076, ('PA', 'D'): 0.04,  ('PA', 'PA'): 0.0,   ('PA', 'PB'): 0.068,
    ('PB', 'C1'): 0.12,  ('PB', 'C2'): 0.101, ('PB', 'D'): 0.061, ('PB', 'PA'): 0.068, ('PB', 'PB'): 0.0}

## PROBLEM 1: UPGMA (3 POINTS)
Run the UPGMA algorithm (by hand, or by code if you really want to) on this distance matrix.  Record your result by assigning the resulting tree, in Newick string format, to the variable `upgma_tree_newick` below.  Your tree should have branch lengths, with lengths rounded to three digits after the decimal point.  **Important note:** do *not* round the intermediate distances computed during the algorithm.

In [3]:
### BEGIN SOLUTION TEMPLATE=upgma_tree_newick=?
upgma_tree_newick = '(C1:0.052,(C2:0.042,((D:0.02,PA:0.02):0.012,PB:0.032):0.009):0.01);'

# Distance matrices:
#               C1      C2       D      PA      PB
#       C1       0    0.09   0.098   0.105    0.12
#       C2    0.09       0   0.072   0.076   0.101
#        D   0.098   0.072       0    0.04   0.061
#       PA   0.105   0.076    0.04       0   0.068
#       PB    0.12   0.101   0.061   0.068       0

#               C1      C2     DPA      PB
#       C1       0    0.09  0.1015    0.12
#       C2    0.09       0   0.074   0.101
#      DPA  0.1015   0.074       0  0.0645
#       PB    0.12   0.101  0.0645       0

#               C1      C2   DPAPB
#       C1       0    0.09  0.1077
#       C2    0.09       0   0.083
#    DPAPB  0.1077   0.083       0

#               C1 C2DPAPB
#       C1       0  0.1033
#  C2DPAPB  0.1033       0

### END SOLUTION

In [4]:
# draw your UPGMA tree
upgma_tree = toytree.tree(upgma_tree_newick)
canvas, axes = upgma_tree.draw(use_edge_lengths=True, scalebar=True)

In [5]:
# test upgma_tree_newick valid tree
upgma_tree = toytree.tree(upgma_tree_newick)
assert sorted(upgma_tree.get_tip_labels()) == ['C1', 'C2', 'D', 'PA', 'PB']
assert upgma_tree.is_rooted()
print("SUCCESS: upgma_tree_newick valid tree test passed")

SUCCESS: upgma_tree_newick valid tree test passed


In [6]:
# test upgma_tree_newick tree topology
### BEGIN HIDDEN TESTS
upgma_tree = toytree.tree(upgma_tree_newick)
upgma_tree.treenode.sort_descendants()
assert upgma_tree.write(tree_format=9) == '(C1,(C2,((D,PA),PB)));'
print("SUCCESS: upgma_tree_newick topology test passed")
### END HIDDEN TESTS

SUCCESS: upgma_tree_newick topology test passed


In [7]:
# test upgma_tree_newick branch lengths
### BEGIN HIDDEN TESTS
upgma_tree = toytree.tree(upgma_tree_newick)
upgma_tree.treenode.sort_descendants()
upgma_tree = toytree.tree(upgma_tree.write(tree_format=5)) # hack to make sure edges are in same order
upgma_branch_lengths = [round(x, 3) for x in upgma_tree.get_edge_values(feature="dist")]
assert upgma_branch_lengths == [0.052, 0.042, 0.032, 0.02, 0.02, 0.012, 0.009, 0.01]
print("SUCCESS: upgma_tree_newick branch lengths test passed")
### END HIDDEN TESTS

SUCCESS: upgma_tree_newick branch lengths test passed


## PROBLEM 2: Neighbor joining (3 POINTS)
Run the neighbor joining algorithm (by hand, or by code if you really want to) on this distance matrix.  Record your result by assigning the resulting tree, in Newick string format, to the variable `nj_tree_newick` below.  Your tree should have branch lengths, with lengths rounded to three digits after the decimal point.  **Important note:** do *not* round the intermediate distances computed during the algorithm.

In [8]:
### BEGIN SOLUTION TEMPLATE=nj_tree_newick=?
nj_tree_newick = '(((C1:0.057,C2:0.033):0.022,PB:0.044):0.001,D:0.017,PA:0.023);'

# d, r, and D values at each iteration:
# Iteration 1:
# d
#               C1      C2       D      PA      PB
#       C1       0    0.09   0.098   0.105    0.12
#       C2    0.09       0   0.072   0.076   0.101
#        D   0.098   0.072       0    0.04   0.061
#       PA   0.105   0.076    0.04       0   0.068
#       PB    0.12   0.101   0.061   0.068       0
# r
#       C1   0.138
#       C2   0.113
#        D    0.09
#       PA   0.096
#       PB   0.117
# D
#               C1      C2       D      PA      PB
#       C1       0 -0.1607   -0.13  -0.129 -0.1343
#       C2 -0.1607       0 -0.1313 -0.1333 -0.1287
#        D   -0.13 -0.1313       0 -0.1467  -0.146
#       PA  -0.129 -0.1333 -0.1467       0  -0.145
#       PB -0.1343 -0.1287  -0.146  -0.145       0
# Join ('C1', 'C2')
#
# Iteration 2:
# d
#             C1C2       D      PA      PB
#     C1C2       0    0.04  0.0455  0.0655
#        D    0.04       0    0.04   0.061
#       PA  0.0455    0.04       0   0.068
#       PB  0.0655   0.061   0.068       0
# r
#     C1C2   0.075
#        D   0.071
#       PA   0.077
#       PB   0.097
# D
#             C1C2       D      PA      PB
#     C1C2       0  -0.106 -0.1067 -0.1073
#        D  -0.106       0 -0.1073 -0.1068
#       PA -0.1067 -0.1073       0  -0.106
#       PB -0.1073 -0.1068  -0.106       0
# Tie for minimum D value ('C1C2', 'PB') or ('D', 'PA')
# Arbitrarily join ('D', 'PA')
#
# Iteration 3:
# d
#             C1C2     DPA      PB
#     C1C2       0  0.0227  0.0655
#      DPA  0.0227       0  0.0445
#       PB  0.0655  0.0445       0
# r
#     C1C2   0.088
#      DPA   0.067
#       PB    0.11
# D
#             C1C2     DPA      PB
#     C1C2       0 -0.1327 -0.1327
#      DPA -0.1327       0 -0.1327
#       PB -0.1327 -0.1327       0
# Tie for minium D value ('C1C2', 'DPA'), ('C1C2', 'PB'), or ('DPA', 'PB')
# Arbitrarily join ('DPA', 'PB')
#
# Iteration 4:
# d
#             C1C2   DPAPB
#     C1C2       0  0.0219
#    DPAPB  0.0219       0
# Connect last two nodes 'C1C2' and 'DPAPB' with an edge.

### END SOLUTION

In [9]:
# draw your NJ tree
nj_tree = toytree.tree(nj_tree_newick)
canvas, axes = nj_tree.draw(use_edge_lengths=True, scalebar=True)

In [10]:
# test nj_tree_newick valid tree
nj_tree = toytree.tree(nj_tree_newick)
assert sorted(nj_tree.get_tip_labels()) == ['C1', 'C2', 'D', 'PA', 'PB']
assert not nj_tree.is_rooted()
print("SUCCESS: nj_tree_newick valid tree test passed")

SUCCESS: nj_tree_newick valid tree test passed


In [11]:
# test nj_tree_newick tree topology
### BEGIN HIDDEN TESTS
nj_tree = toytree.tree(nj_tree_newick)
rooted_nj_tree = nj_tree.root(names="C2")
rooted_nj_tree.treenode.sort_descendants()
assert rooted_nj_tree.write(tree_format=9) == '((C1,((D,PA),PB)),C2);'
print("SUCCESS: nj_tree_newick topology test passed")
### END HIDDEN TESTS

SUCCESS: nj_tree_newick topology test passed


In [12]:
# test nj_tree_newick branch lengths
### BEGIN HIDDEN TESTS
nj_tree = toytree.tree(nj_tree_newick)
rooted_nj_tree = nj_tree.root(names="C2")
rooted_nj_tree.treenode.sort_descendants()
rooted_nj_tree = toytree.tree(rooted_nj_tree.write(tree_format=5)) # hack to make sure edges are in same order
rooted_nj_branch_lengths = [round(x, 3) for x in rooted_nj_tree.get_edge_values(feature="dist")]
assert rooted_nj_branch_lengths == [0.017, 0.057, 0.044, 0.017, 0.023, 0.001, 0.022, 0.017]
print("SUCCESS: nj_tree_newick branch lengths test passed")
### END HIDDEN TESTS

SUCCESS: nj_tree_newick branch lengths test passed


## Follow-up analysis

Consider the following questions:
1. How do the two trees that you constructed differ, if at all?
2. Are these trees consistent with the possibility that the dentist transmitted HIV to the two patients considered here?

### BEGIN SOLUTION TEMPLATE=Your answers here

1. Both trees have the same unrooted topology.  The branch lengths differ because of UPGMA's assumption of the molecular clcok.
2. Yes, they are consistent with a scenario in which the dentist transmitted HIV to the two patients because the dentist leaf clusters with the two patients, instead of with the two controls, and there is some significant separation between the dentist/patient cluster and the control cluster.  However, more samples (from additional patients and controls) would be needed to be confident that dentist-patient transmission occurred.  See the cited paper for how the trees looked with additional samples.

### END SOLUTION