<a href="https://colab.research.google.com/github/shivavsrivastava/Algorithms/blob/main/Course3_W2_Kruskals_MST_Clustering_UnionFind.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kruskals MST for Clustering and Advanced Union Find


In [1]:
import numpy as np
import random
import urllib3
import math
from collections import defaultdict
import time
import heapq
from heapq import heappush, heappop, heapify
import pandas as pd
import sys
from itertools import count

## Union Find

In [4]:
## Instantiate class UnionFind
class UnionFind():
  def __init__(self, n):
    self.parent = [i for i in range(n)]
    self.rank = [0 for _ in range(n)]
    self.n = n

  def find(self, x):
    # path compression: When finding the root r of the tree containing x,
    # change the parent pointer of all nodes along the path to point directly to r
    if x != self.parent[x]:
        self.parent[x] = self.find(self.parent[x])
    return self.parent[x]

  def union(self, x, y):
    # Maintain an integer rank for each node, initially 0. Link root of
    # smaller rank to root of larger rank; if tie, increase rank of larger root by 1.
    px, py = self.find(x), self.find(y)
    # if they were already connected, do nothing and return 0
    if px != py:
      if self.rank[px] > self.rank[py]:
        self.parent[py] = px
      elif self.rank[px] < self.rank[py]:
        self.parent[px] = py
      else:
        self.parent[py] = px
        self.rank[px] += 1
      self.n -= 1

  def getCount(self):
    return self.n

## Programming Assignment: Part 1


In this programming problem and the next you'll code up the clustering algorithm from lecture for computing a max-spacing k-clustering.

Download the text file below - **clustering1.txt**

This file describes a distance function (equivalently, a complete graph with edge costs).  It has the following format:

[number_of_nodes]

[edge 1 node 1] [edge 1 node 2] [edge 1 cost]

[edge 2 node 1] [edge 2 node 2] [edge 2 cost]

...

There is one edge
(i,j) for each choice of 1≤i<j≤n, where n is the number of nodes.

For example, the third line of the file is "1 3 5250", indicating that the distance between nodes 1 and 3 (equivalently, the cost of the edge (1,3)) is 5250.  You can assume that distances are positive, but you should NOT assume that they are distinct.

Your task in this problem is to run the clustering algorithm from lecture on this data set, where the target number k of clusters is set to 4.  What is the maximum spacing of a 4-clustering?

### Single Link Clustering function using Kruskal's Algorithm

In [21]:
def single_link_clustering(EdgeList, k):
  #mst_edges = []
  # Create a UnionFind object
  sorted_edges = sorted(EdgeList, key=lambda x: x[2])
  unique_nodes = []
  list(map(lambda x: unique_nodes.append(x[0]), sorted_edges))
  list(map(lambda x: unique_nodes.append(x[1]), sorted_edges))
  unique_nodes = list(set(unique_nodes))
  #print(unique_nodes)
  uf = UnionFind(len(unique_nodes))
  max_spacing = 0
  for node1, node2, weight in sorted_edges:
    # Find the 2 node's parents and cluster
    #print("Nodes {} {} and weight {}, clusterCount {}".format(node1, node2, weight, uf.getCount()))
    if uf.find(node1) != uf.find(node2):
      if uf.getCount() == k:
        max_spacing = weight
        break
      uf.union(node1, node2)
      #mst_edges.append([node1, node2, weight])
  return max_spacing


### Testcases to test single-link clustering

In [22]:
## Test 1
EdgeList = [[0, 1, 1], [0, 2, 2], [0, 3, 4], [0, 4, 5], [1, 2, 4], [1, 3, 3], [1, 4, 6], [2, 3, 1], [2, 4, 7], [3, 4, 8]]
max_spacing = single_link_clustering(EdgeList, 2)
assert(max_spacing==5)
max_spacing = single_link_clustering(EdgeList, 3)
assert(max_spacing==2)
max_spacing = single_link_clustering(EdgeList, 4)
assert(max_spacing==1)

#### Assignment Part 1

In [25]:
df = pd.read_table('https://d3c33hcgiwev3.cloudfront.net/_fe8d0202cd20a808db6a4d5d06be62f4_clustering1.txt?Expires=1713916800&Signature=Owya0c2Ur4p4vMlnUMw-GjU3zTVF4BPxPSggFXUXhQXgBnaOHs-QY5tA0fKyhhQwqDVPG-~wkHrDBkFhFAQvVpB2Su-dZ8WlmZyfcF2gvumgT7LZpXZ1F1Nc7CfsPYh2KEAb4D93G43Y0k25a5GNlt4trZHoCtFon1hyvZ4i0PE_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A',header=None)

In [26]:
num_nodes = df[0].iloc[0]
print(num_nodes)


500


In [27]:
df = df.drop(df.index[0])
df.head()

Unnamed: 0,0
1,1 2 6808
2,1 3 5250
3,1 4 74
4,1 5 3659
5,1 6 8931


In [29]:
edgeArray = df[0].to_list()
edgeArray[:5]

['1 2 6808', '1 3 5250', '1 4 74', '1 5 3659', '1 6 8931']

In [34]:
EdgeList = []
for i in range(len(edgeArray)):
  # convert string to integer
  res = [eval(x) for x in edgeArray[i].strip().split()]
  #print(res)
  EdgeList.append([res[0]-1, res[1]-1, res[2]])

print(EdgeList[:5])


[[0, 1, 6808], [0, 2, 5250], [0, 3, 74], [0, 4, 3659], [0, 5, 8931]]


In [35]:
max_spacing = single_link_clustering(EdgeList, 4)
print(max_spacing)

106


Answer is: 106

## Programming Assignment: 2

In this question your task is again to run the clustering algorithm from lecture, but on a MUCH bigger graph.  So big, in fact, that the distances (i.e., edge costs) are only defined implicitly, rather than being provided as an explicit list.

The data set is below -- **clustering_big.txt**

The format is:

[# of nodes] [# of bits for each node's label]

[first bit of node 1] ... [last bit of node 1]

[first bit of node 2] ... [last bit of node 2]

...

For example, the third line of the file "0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1" denotes the 24 bits associated with node #2.

The distance between two nodes u and v in this problem is defined as the Hamming distance--- the number of differing bits --- between the two nodes' labels.  For example, the Hamming distance between the 24-bit label of node #2 above and the label "0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1" is 3 (since they differ in the 3rd, 7th, and 21st bits).

The question is: what is the largest value of
k such that there is a k-clustering with spacing at least 3?  That is, how many clusters are needed to ensure that no pair of nodes with all but 2 bits in common get split into different clusters?

NOTE: The graph implicitly defined by the data file is so big that you probably can't write it out explicitly, let alone sort the edges by cost.  So you will have to be a little creative to complete this part of the question.  For example, is there some way you can identify the smallest distances without explicitly looking at every pair of nodes?

In this programming problem you'll code up Prim's minimum spanning tree algorithm.

Download the text file below - edges.txt

This file describes an undirected graph with integer edge costs.  It has the format

[number_of_nodes] [number_of_edges]

[one_node_of_edge_1] [other_node_of_edge_1] [edge_1_cost]

[one_node_of_edge_2] [other_node_of_edge_2] [edge_2_cost]

...

For example, the third line of the file is "2 3 -8874", indicating that there is an edge connecting vertex #2 and vertex #3 that has cost -8874.

You should NOT assume that edge costs are positive, nor should you assume that they are distinct.

Your task is to run Prim's minimum spanning tree algorithm on this graph.  You should report the overall cost of a minimum spanning tree --- an integer, which may or may not be negative --- in the box below.

IMPLEMENTATION NOTES: This graph is small enough that the straightforward O(mn) time implementation of Prim's algorithm should work fine. OPTIONAL: For those of you seeking an additional challenge, try implementing a heap-based version. The simpler approach, which should already give you a healthy speed-up, is to maintain relevant edges in a heap (with keys = edge costs).  The superior approach stores the unprocessed vertices in the heap, as described in lecture.  Note this requires a heap that supports deletions, and you'll probably need to maintain some kind of mapping between vertices and their positions in the heap.

In [51]:
# Create a mask array of hamming distance 1
maskArray = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 2097152, 4194304, 8388608]
print(len(maskArray))


24


In [47]:
def create_all_integers_with_two_bits_1(n):
  """Creates a list of all integers that have only 2 bits 1.

  Args:
    n: The number of bits in the integers.

  Returns:
    A list of all integers that have only 2 bits 1.
  """
  result = []
  for i in range(2**n):
    if bin(i).count('1') == 2:
      result.append(i)
  return result

In [52]:
maskHam2= create_all_integers_with_two_bits_1(24)
print(len(maskHam2))

276


#### maskArray has all the masks of hamming distance 1 and hamming distance 2

In [53]:
maskArray = maskArray + maskHam2
print(len(maskArray))

300


#### Read the clustering_big.txt

In [58]:
df = pd.read_table('https://d3c33hcgiwev3.cloudfront.net/_fe8d0202cd20a808db6a4d5d06be62f4_clustering_big.txt?Expires=1714003200&Signature=RBMuv5P1PAv5IGTSOuum0QHn-MmIxqTOfJDsGNtzcdcxsNtWwbHPZpGcNEHqUha31jFF5ZyiEjCR1Bh2wpcoYhYjIuLQ36WB~5DBPDuIcWXnJCuq2w8-LwrLXp2HHLldjYJrAJmjRafn86HzAcKNoiRvSZSmZjrTjLP8qCyfiMg_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A', header=None)
df.head()

Unnamed: 0,0
0,200000 24
1,1 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1
2,0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1
3,0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1
4,1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0 0


In [59]:
df = df.drop(index=0)
df.head()

Unnamed: 0,0
1,1 1 1 0 0 0 0 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1
2,0 1 1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 1
3,0 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 0 1
4,1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 0 1 0 0 1 1 0 0 0
5,0 1 0 0 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 0


In [60]:
# Now work on the df
inArray = df[0].to_list()
print(len(inArray))


200000


In [96]:
# Create a dictionary of set
labelDict = defaultdict(set)
for i in range(len(inArray)):
  # convert string to integer
  res = ""
  for x in inArray[i].strip().split(' '):
    res += x
  label = int(res, 2)
  #print(label)
  labelDict[label].add(i)


In [97]:
print(len(labelDict))

198788


In [99]:
uf = UnionFind(len(inArray))
for key, value in labelDict.items():
  # If the hash(key) is a set of items
  node1 = list(value)[0]
  # Cluster all hamming distance 0 nodes
  for i in value:
    for j in value:
      if i != j:
        uf.union(i, j)
  # Loop the maskArray and cluster all hamming distance 1 and 2 nodes
  for mask in maskArray:
    if key ^ mask in labelDict:
      node2 = list(labelDict[key ^ mask])[0]
      uf.union(node1, node2)

print(uf.getCount())

6118


Answer : 6118