## Clustering

We will compute a max-spacing k clustering with Kruskal's Algorithm.

**Kruskal's Algorithm with Union-Find**

**Input:** A complete undirected graph $G=(V,E)$ with distance $d_{xy}$ for each vertex pair $x,y\in V$

**Output:** A minimum spanning tree of G

1. $T:=\emptyset$

2. $U:=Initialize(V)$

3. sort edges of E by distance in increasing order

4. **for** each edge $(v,w)\in E$ **do**

   4.1 **if** $Find(U,v)\neq Find(U,w)$ **then**

   4.2 no v-w path in T, ok to add (v,w) $T:=T \cup {(v,w)}$ 

   4.3 $Union(U,v,w)$

5. return T

**Single-Link Clustering Algorithm**

1. Define a complete undirected graph $G=(V,E)$ with distance $d_{xy}$ for each vertex pair $x,y\in V$.

2. Run Kruskal's algorithm with input graph G until T contains $|V|-k$ edges or having k connected components.

3. Compute the connected components of (X,T) and return the corresponding partition of X.

In [10]:
# Union-Find  Data Structure
class UnionFind():
    def __init__(self, nodes):
        self.root = dict(zip(nodes, nodes))
        self.subtree = dict(zip(nodes, [[node] for node in nodes]))

    def find(self, node):
        """ find the root of a node """
        return self.root[node]

    def union(self, i, j):
        """ union two nodes i and j by merging a smaller tree to the larger one """
        pi, pj = self.root[i], self.root[j]
        if pi != pj:
            if len(self.subtree[pj]) > len(self.subtree[pi]):
                pi, pj = pj, pi

            for node in self.subtree[pj]:
                self.root[node] = pi
            self.subtree[pi] += self.subtree[pj]
            del self.subtree[pj]

        else:
            return

In [11]:
def clustering(graph, k):
    """ compute the maximum spacing of a k-cluster """
    nodes = set()
    for u, v, d in graph:
        nodes.add(u)
        nodes.add(v)

    group = UnionFind(nodes)
    # sort the graph by costs
    graph = sorted(graph, key=lambda x: x[2])

    while len(group.subtree.keys()) > k:
        u, v, d = graph.pop(0)
        group.union(u, v)

    # do not output the cost between two nodes that are both in the same cluster
    while True:
        u, v, min_cost = graph.pop(0)
        if group.find(u) != group.find(v):
            break

    return min_cost

In [12]:
input_file = '/workspace/Algorithms/clustering1.txt'

with open(input_file, 'r') as f:
    lines = f.readlines()

G = [(int(line.split()[0]), int(line.split()[1]), int(line.split()[2])) for line in lines[1:]]
cost = clustering(G, 4)
cost

106

Next, we will cluster by hamming distance between binary data points. For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2. The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels.  For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits). The question is: what is the largest value of k such that there is a k-clustering with spacing at least 3?  That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters? (Hamming distance between each member in the same group is at most 2)

In [13]:
from collections import defaultdict
from itertools import combinations

In [14]:



def hamming1(num):
    """ return the list of numbers with 1 bit difference from num """
    masks = [1 << i for i in range(num.bit_length())]
    code = [num ^ mask for mask in masks]
    return code


def hamming2(num):
    """ return the list of numbers with 2 bit difference from num """
    masks = [(1 << i) ^ (1 << j) for (i, j) in combinations(range(num.bit_length()), 2)]
    code = [num ^ mask for mask in masks]
    return code

In [15]:
def clustering2(nodes):
    """ clustering the nodes by hamming distance """
    clusters = UnionFind(nodes)
    for num in nodes:
        for code in hamming1(num):
            if code in nodes:
                clusters.union(num, code)

        for code in hamming2(num):
            if code in nodes:
                clusters.union(num, code)

    return len(clusters.subtree.keys())

In [16]:



input_file = '/workspace/Algorithms/clustering_big.txt'

with open(input_file, 'r') as f:
    lines = f.readlines()

graph = defaultdict(list)
for i, line in enumerate(lines[1:]):
    num = int(''.join(line.split()), 2)
    graph[num].append(i)
cost = clustering2(graph)
cost

6118