# Handout 05
#### Sara Díaz del Ser

In [20]:
import matplotlib.pyplot as plt
import numpy as np
from termcolor import colored
plt.style.use('dark_background')
from data.showtree import showtree

In [2]:
# Files
small_distances_file = './data/small-distances.txt'
diistances_file = './data/distances.txt'

### Ex. 1 _(15 pts)_ Hierarchical clustering algorithm

#### (a) _(2 pts)_ Reading in distance matrices
Assume that all the distances have already been calculated and are stored in a text file
similar to the Blosum matrices of the previous weeks. Two files, one containing pair-
wise distances between 5 objects (```small-distances.txt```) and one containing pairwise
distances between 13 objects (```distances.txt```) are given. Write a function that is able
to read distance matrices and store them, for instance, in a dictionary of dictionaries
so that distances can be accessed like ```dist['D']['B']```.

In [33]:
def read_distance_matrix(filename:str) -> dict:
	"""Read a distance matrix from a .txt file and return it as a dictionary"""
	with open(small_distances_file, 'r') as f:
		file = [ list(filter(None, row.strip().split(' '))) for row in f.readlines() ]
	# Save first row as headers
	header = file[0]
	# Strip first column (is also headers)
	matrix = [ x[1:] for x in file[1:]]

	d = { key : dict(zip(header, val)) for key,val in dict(zip(header, matrix)).items() }
	return d

In [73]:
# Test it
d = read_distance_matrix(diistances_file)
d['D']['B']

'2'

#### (b) _(2 pts)_ Number of elements of a nested tuple
First, write a function that counts the number of elementary objects in a nested tuple.
I.e., the function should return 3 for (('A','B'),'C') and 5 for ((('A','B'),'C'),('D','E')).
This function will be helpful when determining cluster distances.

In [38]:
# Flatten the nested tuples
def flatten(nested_tuple):
	"""Generator that flattens a tuple"""
	for i in nested_tuple:
		yield from [i] if not isinstance(i, tuple) else flatten(i)

In [39]:
# Get length of flattened generator
def n_elements(nested_tuple):
	"""Number of elementary objects in a nested tuple"""
	g = flatten(nested_tuple)
	return sum(1 for _ in g)

In [40]:
# Test it
print(n_elements((('A','B'),'C')))
print(n_elements(((('A','B'),'C'),('D','E'))))

3
5


#### (c) _(4 pts)_ Merging clusters
When two clusters are merged the distance of the merged cluster to all other clusters
has to be determined. Given two clusters R and S that are merged to a new cluster
M = (R, S) the distance of M to a cluster T can be determined using

$$ d(M, T) = \frac{1}{|R|+|S|} *(|R| d(R, T) + |S| d(S, T)) $$

Write a function taking three parameters: a distance matrix (i.e. a dictionary of dic-
tionaries as in exercise 1) and two clusters (represented as strings/tuples) that merges
two clusters by updating the distance matrix.

Note, that after merging clusters R and S to cluster M = (R, S) the clusters R and
S are no longer needed. Their keys should be removed from the distance matrix. You
can use del ```dist[key]``` to remove a key from a dictionary. To remove R, for instance,
you not only need to remove ```dist[R]``` but also ```dist[T][R]``` for all other clusters T.


In [67]:
def calc_distance(dist:dict, R:str or tuple, S:str or tuple, T:str or tuple):
	"""Calculate distance from merged (RS) node to new (T) node"""
	# Calculate R, S and T
	nR = n_elements(R)
	nS = n_elements(S)
	nT = n_elements(T)
	return (1/(abs(nR)+abs(nS)))*(int(abs(nR)* dist[R][T]) + int(abs(nS)*dist[S][T]))

In [74]:
def cluster_merger(dist:dict, R:str or tuple, S:str or tuple) -> dict:
	"""Merges the given two clusters"""
	# Merge cluster
	new_cluster = (R,S)

	# Add merged node to the distance matrix
	dist[new_cluster] = { node : calc_distance(dist,R,S,node) for node in dist.keys()}
	[ row.update({ new_cluster: calc_distance(dist,R,S,node) }) for node,row in dist.items()]

	# Remove keys from distance matrix
	dist.pop(R)
	dist.pop(S)
	[ (row.pop(S), row.pop(R)) for row in dist.values()]

	return dist

In [75]:
cluster_merger(d,'A','C')

{'B': {'B': '0', 'D': '2', 'E': '3', ('A', 'C'): 4.5},
 'D': {'B': '2', 'D': '0', 'E': '3', ('A', 'C'): 2.5},
 'E': {'B': '3', 'D': '3', 'E': '0', ('A', 'C'): 5.5},
 ('A', 'C'): {'B': 4.5, 'D': 2.5, 'E': 5.5, ('A', 'C'): 0.0}}

### (d) _(3 pts)_ Find closest clusters
Write a function that takes a distance matrix as input and returns the two clusters that
should be merged, i.e. whose distance is smallest.

In [43]:
def find_closest_clusters(dist_matrix) -> list:
	"""Find the two clusters whose distance is smallest """
	return

{'A': {'A': '0', 'B': '4', 'C': '1', 'D': '2', 'E': '5'},
 'B': {'A': '4', 'B': '0', 'C': '5', 'D': '2', 'E': '3'},
 'C': {'A': '1', 'B': '5', 'C': '0', 'D': '3', 'E': '6'},
 'D': {'A': '2', 'B': '2', 'C': '3', 'D': '0', 'E': '3'},
 'E': {'A': '5', 'B': '3', 'C': '6', 'D': '3', 'E': '0'}}

### (e) _(4 pts)_ Hierarchical clustering
Write a function implementing the hierarchical clustering according to the pseudocode.
The function should return the final clustering as a tuple and the heights for each
cluster. The height should be stored as a dictionary, where the key is the cluster and
the value the height. Test your program using the two files ```small-distances.txt``` and
```distances.txt```. To visualize the result you can use the function ```showtree``` provided in
```showtree.py``` by copying that file from the workshop folder and using:
 ```from showtree import showtree```