### Trees

- As discussed previously, a naive implementation of disjoint set leads to $O(N)$ run time for the `Union` operation
    - We tried looking at linked lists to resolve this issue, but by solving the problem with `Union`, the other operation `Find` now becomes $O(N)$ instead

- We now explore how representing each set as a rooted tree can help resolve this problem
    - Let each set be a rooted tree
    - Let the ID of the set be the root of the tree
    - Define a separate array `parent` such that `parent[i]` is the parent of `i`

- Assume the same 3 sets again {5}, {6,8,1}, {9,3,2,4,7}
    - Define 3 trees: 
        - 5
        - 6 
            - 8
            - 1
        - 4
            - 2
            - 3
            - 9
                - 7
    - Then `parent` array is $[6,4,4,4,5,6,9,6,4]$
        - Remember, array[0] refers to the parent of the value 1, which is 6

- By representing the sets as trees, this allows us a shortcut for `Union`
    - We can call Union simply by pointing the root of one tree to the root of another!

- Using the example above, let's suppose we call `Union(3, 8)`. This gives us 2 options. We can union the second tree (rooted at 6) into the third (rooted at 4), or the other way around. 

- For ease of traversal (minimise tree height), we always put the smaller tree into the bigger tree. So now the trees are:
    - 5
    - 4
        - 2
        - 3
        - 9
            - 7
        - 6 
            - 8
            - 1

In [1]:
class DisjointSetTree:
    def __init__(self, input_array):
        self.input_array = input_array
        self.parents = [None] * len(input_array)

    def make_set(self, value):
        '''
        O(1)
        '''
        self.parents[index] = value

    def find(self, value):
        '''
        O(log(N)) because in a complete binary tree, there are at most log(N) levels to traverse
        '''
        curr_value = value
        while parents[curr_value] != curr_value:
            curr_value = parents[curr_value]
        return curr_value


### Union by rank

- As discussed, it is better to hang a shorter tree under the root of the taller one, to minimise tree height

- To find the height of a tree, we will store the height of the subtree in an array `rank`, where `rank[i]` is the height of the subtree with root at $i$

- This is known as the union by rank heuristic

- Note that for any node $i$ at any point, the values in `rank[i]` are equal to the height of the tree rooted at $i$

In [23]:
class DisjointSetTree:
    def __init__(self, max_size):
        self.parents = [None] * max_size
        self.rank = [None] * max_size

    def make_set(self, value):
        '''
        O(1)
        '''
        self.parents[value] = value
        self.rank[value] = 0

    def find_group_id(self, value):
        '''
        O(log(N)) because in a complete binary tree, there are at most log(N) levels to traverse
        '''
        curr_value = value
        while self.parents[curr_value] != curr_value:
            curr_value = self.parents[curr_value]
        return curr_value

    def union(self, value1, value2):
        '''
        O(log(N)) because of the `find_group_id` operation. Otherwise, everything else is actually O(1)
        '''
        ## Get the group IDs of both requested values
        value1_group_id = self.find_group_id(value1)
        value2_group_id = self.find_group_id(value2)
        
        ## If the group IDs are the same, then the values are in the smae group
        if value1_group_id == value2_group_id:
            return

        ## If the group IDs are different, check which is the larger group, and change the smaller group ID to the larger group ID
        if self.rank[value1_group_id] > self.rank[value2_group_id]:
            self.parents[value2_group_id] = value1_group_id
        
        elif self.rank[value1_group_id] < self.rank[value2_group_id]:
            self.parents[value1_group_id] = value2_group_id
        
        ## If the group ranks are equal, then arbitrarily pick the group ID of the second one as the parent, and increment the rank of the second group ID by 1
        else:
            self.parents[value1_group_id] = value2_group_id
            self.rank[value2_group_id] += 1

dst = DisjointSetTree(max_size=10)
[dst.make_set(x) for x in range(1,7)]
dst.union(2,4)
dst.union(5,2)
dst.union(3,1)
dst.union(2,3)
# dst.union(2,6)

display(dst.parents)
display(dst.rank)

[None, 1, 4, 1, 1, 4, 6, None, None, None]

[None, 2, 0, 0, 1, 0, 0, None, None, None]

### Path compression

- Recap
    - We started by representing the sets as linked list, incurring O(N)/O(1) performance for `Union`/`Find`
    - We next discussed representing the sets as trees, incurring O(log(N))/O(log(N)) performance for `Union`/`Find`

- Notice that, in the tree representation, `Union` is O(log(N)) because it relies on `Find`, which is itself O(log(N))
    - We now talk about an approach which transforms the work of `Union`/`Find` into **almost** constant time by performing `Find` more efficiently

- In the `Find` algorithm, we traverse the path of a given node to its root
    - What if, while traversing, we store all intermediate values as the parent as well!
    - For example, imagine we have this path up a tree `2 --> 4 --> 6`
    - Instead of merely returning the parent 6, `Find` modifies the `parent` array along the way such that `parent[2] = 6` and `parent[4] = 6` also
    - Through this, every find operation flatten the tree!
    - This is known as `path compression`

In [None]:
def find_group_id(self, value):
    '''
    Recursively call find until we hit the case where value == self.parent[value] (the root), then return
    '''
    if self.parent[value] != value:
        parent[value] = self.find(self.parent[value])
    return value

### Analysis of algorithm

- Let's first discuss the definition of iterated logarithm
    - The iterated logarithm of some value $n$ is the number of times the logarithm function needs to be applied to $n$ before the result is less than or equal to 1
    $$ \log^{*} n = \begin{Bmatrix} 0 & \text{if } n \le 1 \\ 1 + \log^*(\log(n)) & \text{if } n \ge 1 \end{Bmatrix}$$

    - Practically speaking, for most practical values of $n$, the $\log^*$ function is at most 5
        - That is, if I take the log of any number 5 times, it will be bounded by 1

- Bounds of the Disjoint set data structure
    - Assume `Disjoint` is initially empty 
    - Assume we applied both path compression and rank heuristic
    - Assume we make a sequence of $m$ operations, including $n$ calls to `MakeSet`
    - Then the total running time is $O(m \log^*(N))$
    - This leads to an amortized time of a single operation to be $O(\log^*(N))$
    - As established above, $\log^*(N) \le 5$ for most reasonable values of $N$
    - Hence, disjoint set is amortized constant time!

### Proof that disjoint set's union-join is constant time 

- $\text{Height} \le \text{Rank}$
    - When using path compression, the `rank` array no longer represents the height of the tree (since we are modifying the parent recursively each time we call `find` we go along)
    - BUT `rank` is an upper bound of the height of the tree
    - And for all root nodes, it must be true that their `rank` is not affected by path compression

- Properties of the forest
    - There must be at most $\frac{n}{2^k}$ nodes of rank $k$
        - Obviously, because the trees are binary
        - So if there are $n$ total nodes, and each node of rank $k$ has $2^k$ children, then there must be $\frac{n}{2^k}$ of such nodes
    - For any given node $i$, $\text{rank}[i] \lt \text{rank}[\text{parent}[i]]$
        - That is, the rank of a child must be smaller than the rank of its parent
        - Again, obvious? A parent's rank must comprise of all the children's children too
    - If a nodeis an internal node, it will always be an internal node
        - No combination of `union` and `find` can change a node with a parent to a root