# BMI/CS 576 - HW5
The objectives of this homework are to practice

* K-means clustering
* Gaussian mixture model-based clustering
* hierarchical clustering

## HW policies
Before starting this homework, please read over the [homework policies](https://canvas.wisc.edu/courses/167969/pages/hw-policies) for this course.  In particular, note that homeworks are to be completed *individually*.

You are welcome to use any code from the weekly notebooks in your solutions to the HW.

## PROBLEM 1: K-means algorithm (10 points)

Run the $k$-means algorithm (either by hand or with code) on the following set of one-dimensional points: $X = (x_1,x_2,x_3,x_4,x_5) = (2,4,5,9,10)$. Let $k = 2$ and the initial cluster centers be $f_1 = 2$ and $f_2 = 5$. After each iteration, show 

**(i)** the assignment of points to clusters and 

**(ii)** the updated cluster centers.

*If you run the algorithm via code, you must present the output in a nicely formatted manner.*

In [None]:
###
### YOUR CODE HERE
###


###
### solution to problem 1
###


## PROBLEM 2: Gaussian mixture model-based clustering (20 points)

Run the EM algorithm (either by hand or with code) for Gaussian mixture model-based
clustering on the set of points in Problem 1 for **three** iterations.
Let $k=2$, the initial cluster means be $\mu_1 = 2$ and $\mu_2 = 5$,
the initial cluster prior probabilities be $P_1 = P_2 = 0.5$, and the
variances be $\sigma^2_1 = \sigma^2_2 = 3$.  You should treat the variances as fixed
parameters that are not updated during EM.  After each iteration, show 

**(i)** the probabilities of each point being assigned to each cluster, 

**(ii)** the updated cluster means, and 

**(iii)** the updated cluster prior probabilities. 

*If you run the algorithm via code, you must present the output in a nicely formatted manner.*

**For your own understanding (not graded)**: compare and constrast your results from the EM algorithm to those from k-means in problem 1.

###
### solution to problem 2
###


## PROBLEM 3: Bottom-up hierarchical clustering (60 points)
Implement bottom-up hierarchical clustering with the function `cluster_bottom_up` below.  The function takes as input a list of profiles and a list of corresponding names (which will be the names for the leaves of the resulting tree).  In addition, the function will take as input a string specifying which linkage function (e.g., single) to use for the clustering as well as a distance function that computes the distance (e.g., euclidean distance) between a pair of profiles.  The function will output a `TreeNode` object, representing the root of the hierarchical clustering tree.

**Distances:** The tree should have branch lengths computed in the same way as for the UPGMA algorithm for phylogenetic trees.  Each node should have a "height", with the leaf nodes being at height zero and the root node being the highest node.  After merging two nodes (clusters), $i$ and $j$, into a new node $k$, the height of node $k$ should be half the distance from cluster $i$ to cluster $j$, i.e., $height(k) = d_{ij} / 2$.

**Tie-breaking:** For tie-breaking purposes, you should keep track of an index for each node (cluster) in the tree.  The input profiles will correspond to the leaves of the tree, and should have indices 0 to $n-1$ where $n$ is the number of profiles.  Each successive node that is created should have the next available integer index (e.g., the very first merge of the algorithm should produce the node with index $n$ and the following merge should produce the node with index $n+1$).  When finding the next pair of clusters to merge, if two or more pairs have the same minimum distance, pick the pair with the lexicographically smallest pair of indices $(i, j)$.  For example, if the pairs of clusters (3, 8) and (5, 7) have the same minimum distance, you should choose the pair (3, 8) to merge next.

**Efficiency:** Your implementation should have runtime complexity of $O(n^3)$.  You are welcome to implement the more efficient $O(n^2)$ (for single-link) and $O(n^2 \log n)$ (for complete and average-link) algorithms, but this is not required.

**Hierarchical clustering data structure:** You are to use objects of the `TreeNode` class as we did in notebook 22 to build your hierarchical clustering structure.

**Tests:** Visible and hidden tests for your function are found at the bottom of this notebook.

## Modules for this HW

In [None]:
import toytree                        # for working with trees
from toytree.TreeNode import TreeNode # make TreeNode directly available
import math

In [None]:
def euclidean_distance(p1, p2):
    """The Euclidean distance between two profiles."""
    return math.sqrt(sum((e1 - e2)**2 for e1, e2 in zip(p1, p2)))

def manhattan_distance(p1, p2):
    """The Manhattan distance between two profiles."""
    return sum(abs(e1 - e2) for e1, e2 in zip(p1, p2))

def cluster_bottom_up(profiles, profile_names, linkage="single", distance=euclidean_distance):
    """Performs a bottom-down hierarchical clustering of a list of profiles, returning
    a tree that has the given profile names labeling the leaves.
    
    Args:
        profiles: a list of profiles/points (each of which is represented as a tuple)
        profile_names: a list of the same length as profiles giving the names of the profiles
        linkage: a string indicating the linkage method to use, which should be one of 
            "single", "complete", and "average"
        distance: a function that takes as input two profiles and returns a number giving
                  the distance between the two profile, i.e., distance(p1, p2) should return
                  the distance between profile p1 and profile p2.
    Returns:
        A TreeNode instance representing the root of the hierarchical clustering tree.
    """
    ###
    ### YOUR CODE HERE
    ###



## PROBLEM 4: Evaluation of cell type clusterings (10 points)

In this problem we will revisit the cell type expression profile dataset that we used in the day 22 activity.  This dataset is read in via the cell below.  One major division between human cell types is between blood cell types and non-blood cell types.  The cell below also reads in a list of the cell types in the dataset that are blood cell types.  
In this problem you are to evaluate the results of bottom-up clusterings computed from this dataset by determining how well the blood cell types cluster within each tree.  For a given hierarchical clustering tree, we may compute how well  the blood cell types cluster within the tree by finding the subtree within the tree that has the maximum Jaccard index with the set of blood cell types.  The Jaccard index is a measure of the similarity of two sets, $A$ and $B$, and is defined as:

$jaccard\_index(A, B) = \frac{|A \cap B|}{|A \cup B|}$

For the *six* possible combinations of distance measure (euclidean and manhattan) and linkage function (single, average, and complete), compute the bottom-up clustering of the cell type dataset, and then compute the maximum Jaccard index of a subtree of that tree with the blood cell types.  **Give the value of the maximum Jaccard index for each clustering and determine which distance measure and linkage function gives the best clustering with respect to this evaluation.**

*Implementation notes:* Note that each node in a tree corresponds to a subtree (the subtree rooted by that node), and thus one can iterate through all subtrees in a tree by iterating through the nodes.  You will likely find the `traverse` method of TreeNode objects helpful for iterating through nodes.  Also, the `get_leaf_names` method of TreeNode objects may be helpful in retrieving the names of the leaves in a subtree.


In [None]:
def read_gene_expression_profiles(filename):
    rows = [line.rstrip().split("\t") for line in open(filename)]
    sample_names = rows[0]
    columns = zip(*rows[1:])
    profiles = [tuple(map(float, column)) for column in columns]
    return profiles, sample_names

cell_type_profiles, cell_type_names = read_gene_expression_profiles("cell_type_expression.txt")

blood_cell_type_names = [line.rstrip() for line in open("blood_cell_types.txt")]

In [None]:
###
### YOUR CODE HERE
###


## Tests for PROBLEM 3

In [None]:
# we will generate a random dataset of 100-dimensional profiles from the vertices of a hypercube
import random
def random_hypercube_vertex(dims):
    return tuple(random.randint(0, 1) for i in range(dims))

random.seed(42)
dims = 100
num_profiles = 250
random_100_250_profiles = [random_hypercube_vertex(dims) for i in range(num_profiles)]
random_100_250_names = ["P{}".format(i) for i in range(num_profiles)]

# a dictionary of datasets for testing
# the key is the name of the test
# the value is a tuple (profiles, profile_names)
datasets = {
    "pair": ([(4, 2), (-2, -6)], 
             ["A", "B"]),
    "triple": ([(4, 2), (-2, -6), (1, 6)],
               ["A", "B", "C"]),
    "quintet": ([(0,), (6,), (8,), (11,), (15,)],
                ["A", "B", "C", "D", "E"]),
    "tiebreaker": ([(0,), (1,), (2,), (3,), (4,), (5,)],
                   ["A", "B", "C", "D", "E", "F"]),
    "cell_type_sample": (
        [(2.7, 0.0, 2.2, 0.0, 1.6, 2.2, 0.8, 2.6, 0.0, 0.0),
         (3.0, 0.0, 2.0, 0.0, 3.2, 1.6, 1.2, 2.6, 1.1, 1.1),
         (2.9, 0.0, 2.3, 0.0, 2.2, 1.9, 0.5, 2.3, 1.0, 0.3),
         (2.6, 0.0, 2.3, 0.0, 2.2, 2.0, 0.7, 2.0, 1.3, 0.8),
         (2.7, 0.0, 2.1, 0.0, 4.0, 2.1, 1.3, 2.6, 1.0, 1.0),
         (2.8, 0.0, 2.1, 0.0, 2.6, 2.2, 2.3, 2.4, 1.4, 1.1),
         (2.7, 0.0, 2.1, 0.0, 3.2, 2.8, 1.9, 1.9, 1.2, 0.8),
         (2.8, 0.0, 2.5, 1.3, 2.8, 2.1, 2.8, 2.0, 1.0, 0.7),
         (2.9, 0.0, 2.0, 0.8, 1.4, 2.2, 1.9, 2.6, 1.0, 0.7)],
        ["placental pericyte",
         "stromal cell",
         "pericyte cell",
         "skin fibroblast",
         "hematopoietic cell",
         "stromal cell of ovary",
         "calvarial osteoblast",
         "osteoblast",
         "astrocyte"]),
    "random_100_250": (random_100_250_profiles, random_100_250_names),
    "cell_type": (cell_type_profiles, cell_type_names)
}

# testing functions
def test_case_newick(name, linkage, dist):
    profiles, names = datasets[name]
    tree = cluster_bottom_up(profiles, names, linkage=linkage, distance=dist)
    tree.sort_descendants()
    return tree.write(format=1)    

def test_case(name, linkage, distance, correct_newick):
    output_newick = test_case_newick(name, linkage, distance)
    if output_newick != correct_newick:
        assert False, "Failed test\n Output: %s\nCorrect: %s" % (output_newick, correct_newick)
    else:
        print("SUCCESS: test passed")

In [None]:
# test pair_single_euclidean(4 points)
test_case("pair", "single", euclidean_distance, "(A:5,B:5);")

In [None]:
# test pair_single_manhattan (4 points)
test_case("pair", "single", manhattan_distance, "(A:7,B:7);")

In [None]:
# test triple_single_manhattan (3 points)
test_case("triple", "single", manhattan_distance, "((A:3.5,C:3.5):3.5,B:7);")

In [None]:
# test triple_complete_manhattan (3 points)
test_case("triple", "complete", manhattan_distance, "((A:3.5,C:3.5):4,B:7.5);")

In [None]:
# test triple_average_manhattan (4 points)
test_case("triple", "average", manhattan_distance, "((A:3.5,C:3.5):3.75,B:7.25);")

In [None]:
# test quintet_single_manhattan (3 points)
test_case("quintet", "single", manhattan_distance, "(A:3,(((B:1,C:1):0.5,D:1.5):0.5,E:2):1);")

In [None]:
# test quintet_complete_manhattan (3 points)
test_case("quintet", "complete", manhattan_distance, "((A:4,(B:1,C:1):3):3.5,(D:2,E:2):5.5);")

In [None]:
# test quintet_average_manhattan (4 points)
test_case("quintet", "average", manhattan_distance, "(A:5,((B:1,C:1):2,(D:2,E:2):1):2);")

In [None]:
# test tiebreaker_single_manhattan (5 points)
test_case("tiebreaker", "single", manhattan_distance, "(((A:0.5,B:0.5):0,(C:0.5,D:0.5):0):0,(E:0.5,F:0.5):0);")

In [None]:
# test cell_type_sample_average_euclidean (5 points)
test_case("cell_type_sample", "average", euclidean_distance, 
          "((astrocyte:0.936623,((pericyte cell:0.377492,skin fibroblast:0.377492):0.3967,placental pericyte:0.774191):0.162432):0.172292,(((calvarial osteoblast:0.563471,stromal cell of ovary:0.563471):0.223098,(hematopoietic cell:0.504975,stromal cell:0.504975):0.281594):0.24476,osteoblast:1.03133):0.0775861);")

In [None]:
# test cell_type_sample_single_chebyshev (5 points)
test_case("cell_type_sample", 
          "single", 
          lambda p1, p2: max(abs(e1 - e2) for e1, e2 in zip(p1, p2)), # Chebyshev distance
          '((astrocyte:0.55,(((calvarial osteoblast:0.3,stromal cell of ovary:0.3):0.1,(hematopoietic cell:0.4,stromal cell:0.4):0):0.1,((pericyte cell:0.25,skin fibroblast:0.25):0.25,placental pericyte:0.5):0):0.05):0.1,osteoblast:0.65);')

In [None]:
# test random_100_250_runtime (10 points)
import timeit
random_100_250_newick = "((((((((P0:18,P75:18):5,(P178:18.5,P246:18.5):4.5):5,((P118:17,P33:17):7.5,((P125:19.5,P18:19.5):1,P235:20.5):4):3.5):2,(((P13:19,P133:19):3,(P162:17.5,P4:17.5):4.5):4,((P154:20,P93:20):3,(P73:19.5,P85:19.5):3.5):3):4):1.5,((((P10:18.5,P28:18.5):6,(P117:19,P37:19):5.5):3.5,(((P115:18.5,P48:18.5):4,(P138:17.5,P160:17.5):5):3,(P186:19.5,(P249:19,P47:19):0.5):6):2.5):2,((((P101:17.5,P137:17.5):6.5,(P108:17.5,P89:17.5):6.5):2,((P155:18.5,P54:18.5):3,P165:21.5):4.5):3,(((P119:20,P70:20):3.5,(P236:15,P61:15):8.5):3,((P187:18.5,P36:18.5):5,(P27:18,P80:18):5.5):3):2.5):1):1.5):1,(((((P100:19.5,P157:19.5):3,(P145:18.5,P19:18.5):4):3,((P179:17.5,P42:17.5):5,(P245:17,P52:17):5.5):3):3.5,(((P14:18.5,P144:18.5):5,((P17:20,P202:20):1,P46:21):2.5):3.5,((P159:18,P22:18):5.5,(P244:22,P51:22):1.5):3.5):2):2,((((P111:16,P128:16):7,(P20:16,P229:16):7):1,((P114:18,P168:18):3,P205:21):3):5.5,((((P123:14.5,P161:14.5):8.5,(P230:18,P88:18):5):4.5,((P129:19,P32:19):3.5,(P53:16.5,P6:16.5):6):5):1,(((P140:19.5,P141:19.5):1,P234:20.5):4.5,(P30:20,P49:20):5):3.5):1):1.5):1.5):1.5,((((((P1:19.5,P237:19.5):3,(P189:19.5,P228:19.5):3):4,((P122:18.5,P217:18.5):3,(P57:20.5,P84:20.5):1):5):2,((((P2:17,P226:17):1.5,P69:18.5):4.5,(P240:17.5,P241:17.5):5.5):3.5,(P81:20.5,(P90:18,P99:18):2.5):6):2):2,(((P105:17,P181:17):5.5,(P150:17,P77:17):5.5):4.5,((P188:18,P225:18):7,(P201:18.5,P95:18.5):6.5):2):3.5):1,(((((P11:19,P142:19):3.5,(P210:18.5,P34:18.5):4):3,(P164:16.5,P86:16.5):9):2,(((P158:20,P180:20):2,P182:22):2,(P247:18.5,P44:18.5):5.5):3.5):2,(((P120:19.5,P16:19.5):5.5,((P127:17.5,P64:17.5):2,P167:19.5):5.5):2.5,(((P131:18,P166:18):4,(P215:20.5,P222:20.5):1.5):2,(P172:19,P183:19):5):3.5):2):2):2.5):1,(((((((P102:18.5,P191:18.5):3,(P132:17.5,P63:17.5):4):2.5,(P106:19.5,(P194:16,P72:16):3.5):4.5):4,(((P112:17.5,P24:17.5):6,(P151:17.5,P71:17.5):6):0.5,(P146:19.5,P232:19.5):4.5):4):2,((((P110:18.5,P134:18.5):5.5,((P15:17.5,P203:17.5):4.5,(P238:17.5,P40:17.5):4.5):2):2.5,((P163:17.5,P198:17.5):5.5,(P169:19,P26:19):4):3.5):2.5,(((P116:18,P25:18):4.5,(P62:18.5,P83:18.5):4):3,((P121:21,P45:21):3,(P184:18.5,P39:18.5):5.5):1.5):3.5):1):2.5,(((((P103:17,P218:17):4.5,(P200:18,P56:18):3.5):5,(((P147:18.5,P176:18.5):5,(P177:18,P66:18):5.5):2,((P209:19,P219:19):1,P223:20):5.5):1):2.5,((((P124:18.5,P96:18.5):1,P21:19.5):5,(P185:19,P211:19):5.5):2.5,((P152:18.5,P204:18.5):4,(P29:16,P60:16):6.5):4.5):2):1.5,((((P107:21,(P5:16.5,P67:16.5):4.5):5.5,((P199:18,P231:18):5.5,(P92:18.5,P94:18.5):5):3):1,((P213:19,P68:19):5,(P227:20.5,(P248:18,P3:18):2.5):3.5):3.5):1.5,(((P149:19.5,P197:19.5):2.5,(P192:19,P91:19):3):3,(P156:21,(P221:19,P74:19):2):4):4):1.5):2):1,((((((P104:18,P78:18):5,(P113:19.5,P87:19.5):3.5):3.5,((P208:18,P239:18):5.5,(P58:20,P98:20):3.5):3):2.5,(((P171:17,P243:17):5.5,(P216:17,P23:17):5.5):4,((P242:18,P31:18):4,(P59:18.5,P76:18.5):3.5):4.5):2.5):1.5,((((P12:17.5,P38:17.5):4.5,P65:22):2.5,((P135:20.5,P8:20.5):2,(P173:17,P174:17):5.5):2):4,(((P170:16,P206:16):6,(P233:18,P82:18):4):3.5,((P207:15,P50:15):6.5,P43:21.5):4):3):2):0.5,(((((P109:18.5,P136:18.5):5,(P143:18.5,P212:18.5):5):2,(P153:19.5,P214:19.5):6):1.5,((P126:19.5,(P130:18.5,P195:18.5):1):6,((P190:18.5,P9:18.5):5,(P193:18,P220:18):5.5):2):1.5):2,(((P139:20.5,P224:20.5):4,((P175:15.5,P35:15.5):5.5,P55:21):3.5):3.5,((P148:17.5,P196:17.5):7.5,((P41:18,P7:18):5.5,(P79:19,P97:19):4.5):1.5):3):1):2):2.5):1.5);"
test_statement = 'test_case("random_100_250", "complete", manhattan_distance, random_100_250_newick)'
random_100_250_runtime = timeit.timeit(test_statement, number=1, globals=globals())
assert random_100_250_runtime < 12, "your cluster_bottom_up implementation is too inefficient"
print("SUCCESS: random_100_250_runtime test passed" )

In [None]:
# hidden test 1
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# hidden test 2
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# hidden test 3
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# hidden test 4
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# hidden test 5
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# hidden test 6
###
### AUTOGRADER TEST - DO NOT REMOVE
###
