# Disjoint Set Union (DSU) / Union Find

- The DSU is an algorithm that tracks and merges groups of connected and/or equivalent elements

- This algorithm is useful for answering questions like:
    1. are there cycles in my graph?
    2. from a set of edges and nodes, which nodes are reachable from each other (i.e. in the same set)
    3. grouping related things in a set
        - in image processing, you can use this algorithm to merge pixels belonging to the same object in the same set
        - in social networks, you can group people who are in the same network
        - in clustering, you can use this algorithm to union intervals, strings, etc

- These work if relationship between nodes are **undirected**. 

## Idea

- Inputs:
    - Nodes: You have some collection of elements
    - Edges: You have some set of edges indicating relationship between the node elements 

- Given these nodes and edges, DSU will help you answer
    1. Which nodes are linked to each other, and which are not?
    2. Which nodes can reach each other, and which cannot
    3. Over time, if new edges appear, merge components so that we maintain the relationship of the nodes with each other

- For all $N$ nodes
    - Maintain an array of size $N$ called `parents`, which each index `i` indicates the parent of each node `i`. If a node is a root, then in the `parents` array, the value of the node's parent is just itself.
    - Maintain an array of size $N$ called `size`, which each index `i` indicates the count of the set of nodes whose root is at node `i`
    - Each child can only be tagged to 1 parent
    - `size` is irrelevant for non-parent nodes

- Define a method that lets us join nodes for each edge relationship encountered. We'll call this `def union()`
    - Given 2 singleton nodes $A, B$, and an edge $E$ between them
        - If `size[A] > size[B]`, assign `A`as `B`'s parent
        - Then `size[A] += 1`
        - Then `parent[B] = A`
    - Given 2 non-singleton nodes $A, B$, and an edge $E$ between them
        - `find()` the parent of A, and `find()` the parent of B
            - Since each `find()` only brings us one level up, we can end up with many `find()` steps before we get to the parent
            - To avoid this, ensure that when `find()` is called, you change the parent of all nodes along the `find()` path to the eventual parent
            - This is known as **path compression**
        - If the parents match, then the edge will lead to a cycle. Break
        - If the parents don't match, let's call the 2 parents $P_A$ and $P_B$
        - If `size[P_A] > size[P_B]`, then set `parent[P_B] = P_A`
        - Then set `size[P_A] += size[P_B]`




In [40]:
import numpy as np

N_NODES = 10
N_EDGES = 5

y = np.random.randint(1, N_NODES, size=N_EDGES)
x = np.random.randint(y, size=N_EDGES)
EDGES = set([(int(x),int(y)) for x,y in zip(x,y)])

PARENT = list(range(N_NODES)) ## Every nodes starts as its own parent
SIZE = [1] * N_NODES ## Every set is of size 1

def find_parent(node: int) -> int:
    ## Path compression
    if PARENT[node] != node:
        PARENT[node] = find_parent(PARENT[node])
    return PARENT[node]

def union(node1: int, node2: int) -> bool:
    p1, p2 = find_parent(node1), find_parent(node2)
    if p1 == p2:
        ## Cannot union because these already belong to the same set
        print(f'CANNOT UNION BECAUSE {node1=} AND {node2=} ARE ALREADY IN THE SAME SET')
        return False
    
    if SIZE[p1] <= SIZE[p2]:
        PARENT[p1] = p2
        SIZE[p2] += SIZE[p1]
    else:
        PARENT[p2] = p1
        SIZE[p1] += SIZE[p2]
    
    return True

for n1, n2 in EDGES:
    union(n1, n2)

PARENT, SIZE

([0, 5, 4, 3, 5, 5, 5, 7, 5, 9], [1, 1, 1, 1, 2, 6, 1, 1, 1, 1])