## Meta-goals of the Coming Lectures: Data Structure Refinement

Next couple of weeks, we'll work on deriving solutions to interesting problems, with an emphasis on how sets, maps, and priority queues are implemented.

Today, we'll work on deriving the `Disjoint Sets` data structure for solving the `Dynamic Connectivity` problem. We'll see:
* How a data structure design can evolve from basic to sophisticated
* How our choice of underlying abstraction can affect asymptotic runtime (using our formal Big-Theta notation) and code complexity.

## The Disjoint Sets Data Structure

The Disjoint Sets data structure has 2 operations:
* `connect(x, y)`: Connects `x` and `y`
* `isConnected(x, y)`: Returns true if `x` and `y` are connected
    * Connections can be transitive, e.g. they don't need to be direct.

This is useful for many purposes such as:
* Percolation theory such as computational chemistry
* Implementation of other algorithms, such as Kruskal's algorithm

## Disjoint Sets on Integers

To keep things simple, we're going to:
* Force all items to be integers instead of arbitrary data
* Declare the number of items in advance
    * Initially, everything is disconnected
    * Then we try to connect some integers

In [None]:
ds = DisjointSets(7)
ds.connect(0, 1)
ds.connect(1, 2)
ds.connect(0, 4)
ds.connect(3, 5)

![](images/integers.png)

Then we check connectedness,

In [None]:
ds.isConnected(2, 4) // returns true
ds.isConnected(3, 0) // returns false

Now if we connect more numbers,

In [None]:
ds.connect(4, 2)
ds.connect(4, 6)
ds.connect(3, 6)

![](images/integers2.png)

In [None]:
ds.isConnected(3, 0) // returns true

## The Disjoint Sets Interface

In [None]:
public interface DisjointSets {
    /** Connects 2 items P and Q */
    void connect(int p, int q);
    
    /** Checks to see if 2 items are connected. */
    boolean isConnected(int p, int q);
}

Goal: Design an efficient `DisjointSets` implementation. Some things to take into account,
* Number of elements `N` can be huge
* Number of method calls `M` can be huge
* Calls to methods may be interspersed
    * e.g. We can't assume that it'll be only `connect` operations followed by only `isConnected` operations

## The Naive Approach

* Connecting 2 things: Record every single connecting line in some data structure
* Checking connectedness: Do some sort of iteration over the lines to see if one thing can be reached from the other. 

This approach is too much work!

## A Better Approach: Connected Components

Rather than manually writing out every single connecting line, only record the sets that each item belongs to.

In [None]:
{0}, {1}, {2}, {3}, {4}, {5}, {6}
connect(0, 1)
{0, 1}, {2}, {3}, {4}, {5}, {6}
connect(1, 2)
{0, 1, 2}, {3}, {4}, {5}, {6}
connect(0, 4)
{0, 1, 2, 4}, {3}, {5}, {6}

connect(3, 5)
{0, 1, 2, 4}, {3, 5}, {6}

isConnected(2, 4) // return true
isConnected(3, 0) // return false

connect(4, 2) // doesn't change anything, they're already connected
connect(4, 6)
{0, 1, 2, 4, 6}, {3, 5}

connect(3, 6) // connects everything!
{0, 1, 2, 3, 4, 5, 6}

isConnected(3, 0) // return true


**The idea**: if we only keep track of the sets that an element belong to, then the problem becomes much simpler (we don't have to draw any lines!).

## A Better Approach: Connected Components 2

For each item, its **connected component** is the set of all items that are connected to that item.

For example, if we have the following:

In [None]:
{0, 1, 2, 4}, {3, 5}, {6}

...then 1's connected component is 0, 1, 2, 4.

The better approach: **model connectedness in terms of sets**
* We don't need to know how things are connected
* We only need to keep trach which connected component each item belongs to.