# Greedy Algorithm, Minimum Spanning Tree, and Dynamic Programming

## Application - internet routing

- Ex. Stanford gateway router needs to send data to the Cornell gateway router
- Djikstra's algorithm does this (with nonnegative edge length)
- Issue is that Stanford gateway router would need to know entire Internet
- Need a shortest-path algorithm that uses only local computation
- Solution is Bellman-Ford algorithm (also handles negative edge costs)

## Application - sequence alignment

- Input: two strings over the alphabet {A,C,G,T}
- Problem: figure out how similar the two strings are
- Measure similarity via quality of "best" alignment
    - Penalty $pen_{gap} \ge 0$ for each gap
    - Lenalty $pen_{AT} \ge 0$ for mismatching A and T
    - Etc
- Output: alignment of the strings that minimizes the total penalty (Needleman-Wunsch score)
- Solution: straightforward dynamic programming

## Greedy algorithm

- Ex. Dijkstra's shortest path algorithm
- Easy to propose
- Easy runtime analysis
- Hard to eatablish correctness
- Most greedy algorithms are not correct

## Application - optimal caching

- Cache is faster than memory
- On a fault (cache miss), need to evict something from cache to make room
- Theorem: the "furthest-in-future" algorithm is optimal (minimizes the number of cache misses)
- Serves as guideline for practical algorithm ("Least Recently Used" should do well provided data exhibits locality of reference)
- Serves as idealized benchmark for caching algorithms

## Application - scheduling

- Setup
    - One shared resource(ex. a processor)
    - Many "jobs" to do (ex. processes)
- Question
    - In what order should we sequence the jobs?
- Assume: each job has a 
    - Weight $w_{j}$ ("priority")
    - Length $l_{j}$
- Definition: the completion time $c_{j}$ of job $j$ = sum of job lengths up to and including $j$
- Goal: minimizes the weighted sum of completion times $min \displaystyle\sum_{j=1}^{n}w_{j}c_{j}$
- Intuition
    - With equal lengths, schedule larger or smaller weight jobs earlier? larger
    - With equal weights, schedule shorter or longer jobs earlier? shorter
- What if $w_{i} \gt w_{j}$ but $l_{i} \gt l_{j}$?
    - Assign "scores" to jobs that are
        - Increasing in weight
        - Decreasing in length
- Guess #1: order jobs by decreasing value of $w_{j} - l_{j}$ (not always correct)
- Guess #2: order jobs by decreasing raio $\dfrac{w_{j}}{l_{j}}$ (always correct, runs in $O(nlogn)$ - just need to sort)

### claim - guess #2 is alway correct

- By an exchange argument
- Fix arbitrary input of $n$ jobs
- Consider proof by contradiction
- Let $\sigma$ = greedy schedule, $\sigma*$ = optimal schedule (with $\sigma*$ better than $\sigma$)
- Assume all $\dfrac{w_{j}}{l_{j}}$'s are distinct
- Assume by just renaming jobs $\dfrac{w_{1}}{l_{1}} \gt \dfrac{w_{2}}{l_{2}} \gt \dots \gt \dfrac{w_{n}}{l_{n}}$
- Thus, greedy schedule $\sigma$ is just $1,2,3 \dots n$
- Thus, if optimal schedule $\sigma^{*} \ne \sigma$, then there are consecutive jobs $i,j$ with $i>j$
- Suppose we exchange order of $i,j$ in $\sigma^{*}$ (leaving other jobs unchanged)
    - Cost of exchange is $w_{i}l_{j}$ ($c_{i}$ goes up by $l_{j}$)
    - Benefit of exchange is $w_{j}l_{i}$ ($c_{j}$ goes down by $l_{i}$)
    - $i \gt j => \dfrac{w_{i}}{l_{i}} \lt \dfrac{w_{j}}{l_{j}} => \dfrac{w_{i}}{l_{j}} \lt \dfrac{w_{j}}{l_{i}}$
        - cost $\lt$ benefit, meaning swap improves $\sigma^{*}$, contradicts optimality of $\sigma^{*}$ 
        
### claim - guess #2 is correct even with ties

- Fix arbitrary input of $n$ jobs
- Let $\sigma$ = greedy schedule, $\sigma^{*}$ = any other schedule
- Will show $\sigma$ at least as good as $\sigma^{*}$
    - Implies that greedy schedule is optimal
- Assume by just renaming jobs, greedy schedule $\sigma$ is just $1,2,3 \dots n$ (and so $\dfrac{w_{1}}{l_{1}} \gt \dfrac{w_{2}}{l_{2}} \gt \dots \gt \dfrac{w_{n}}{l_{n}}$)
- Consider arbitrary schedule $\sigma*$. If $\sigma^{*} = \sigma$, done
- Else recall there exists consecutive jobs $i,j$ in $\sigma^{*}$ with $i \gt j$
- Exchanging $i$ and $j$ in $\sigma^{*}$ has net benefit of $w_{j}l_{i}-w_{i}l_{j} \ge 0$
- Exchanging an "adjacent inversion" like $i,j$ only makes $\sigma^{*}$ better, and it decreases the number of inverted pairs (jobs $i,j$ with $i \gt j$ and $i$ scheduled earlier)
- After at most $n\choose{2}$ such exchanges, can transform $\sigma^{*}$ into $\sigma$
- $\sigma$ at least as good as $\sigma^{*}$
- Greedy is optimal

## Minimum spanning trees

- Input: "undirected" graph" $G=(V,E)$ and a cost (for each edge $e \in E$)
    - Assume adjacency list representation 
    - OK if edge cost are negative
- Output: minimum cost tree $T \subseteq E$ that spans all vertices 
    - $T$ has no cycles
    - Subgraph $(U,T)$ is connected
- Assumption #1: input graph $G$ is connected
    - Else no spanning trees
    - Easy to check in preprocessing (ex. depth-first search)
- Assumption #2: edge costs are distinct
    - Prim + Kruskal remain correct with ties (which can be broken arbitrarily)
    
## Prim's MST algorithm

- Runs in $O(mn)$
- Initialize $X = \{s\}$ # $s \in V$ chosen arbitrary
- T = empty set # invariant: $X$ = vertices spanned by tree-so-far $T$
- While $X \ne V$ # increases the number of spanned vertices in cheapest way possible
    - Let edge$(u,v)$ be the cheapest edge with $u \in X$ and $v \notin X$
    - Add $e$ to $T$
    - Add $v$ to $X$
    
### claim - Prim's algorithm outputs a spanning tree

- Definition: a cut of a graph $G = (V,E)$ is a partition of $V$ into 2 non-empty sets
- Empty cut lemma
    - A graph is not connected <=> there exists a cut$(A,B)$ with no crossing edges
    - Proof: (<=)
        - Assume RHS
        - Pick any $u \in A$ and $v \in B$
        - Since no edges cross $(A,B)$, there is no $u,v$ path in $G$ 
        - Thus, $G$ not connected
    - Proof: (=>)
        - Suppose $G$ has no $u,v$ path
        - Define $A$ = {vertices reachable from $u$ in $G$} ($u$'s connected component)
        - Define $B$ = {all other vertices} (all other connected components)
        - Note: no edges cross out $(A,B)$ (otherwise $A$ would be bigger!)
- Double-crossing lemma
    - Suppose the cycle $C \subseteq E$ has an edge crossign the cut$(A,B)$, then so does some other edge of $C$
- Lonely cut corollary
    - If $e$ is the only edge crossing some cut$(A,B)$, then it is not in any cycle (if it were in a cycle, some other edge would have to cross the cut!)
- In summary
    - (1) Algorithm maintains invariant that $T$ spans $X$
    - (2) Can't get stuck with $X \ne V$ (otherwise the cut $(X, V-X)$ must be empty - by empty cut lemma, input graph $G$ is disconnected)
    - (3) No cycles ever get created in $T$
        - Consider any iteration with current sets $X$ and $T$
        - Suppose $e$ gets added
        - $e$ is the first edge crossing $(X, V-X)$ that gets added to $T$ => its addition can't create a cycle in $T$ (by lonely cut corollary)
        
### claim - Prim's algorithm always outputs a minimum-cost spanning tree

- Cut property: consider an edge $e$ of $G$. suppose there is a cut $(A,B)$ such that $e$ is the cheapest edge of $G$ that crosses it. then $e$ belongs to the MST of $G$
- Claim: cut property => Prim's algorithm is correct
    - Already proved Prim's algorithm outputs a spanning tree $T^{*}$
    - Key point: every edge $e \in T^{*}$ is explicitly justified by the cut property
        - $T^{*}$ is a subset of the MST
        - Since $T^{*}$ is already a spanning tree, it must be the MST
- Proof of cut property
    - Suppose there is an edge $e$ that is the cheapest one crossing a cut$(A,B)$, yet $e$ is not in the MST $T^{*}$
    - Idea: exchange $e$ with another edge in $T^{*}$ to make it even cheaper (contradiction)
    - Since $T^{*}$ is connected, must construct an edge $f (\ne e)$ crossing $(A,B)$
    - Idea: exchange $e$ and $f$ to get a spanning tree cheaper than $T^{*}$ (contradiction)
    - Let $C$ = cycle created by adding $e$ to $T^{*}$
    - By double-crossing lemma: some other edge $e^{'}$ of $C$ (with $e^{'} \ne e$ and $e^{'} \in T^{*}$) crosses $(A,B)$ 
    - Note: $T = T^{*}\cup\{e\}-\{e^{'}\}$ is also a spanning tree
    - Since $C_{e} \lt C_{e^{'}}$, $T$ is cheaper than purported MST $T^{*}$, contradiction!
    
## Prim's algorithm with heaps

- Invariant #1: elements in heap = vertices of $V-X$
- Invariant #2: for $v \in V-X$, $key[v]$ = cheapest edge $(u,v)$ with $i \in X$ (or $+\infty$ if no such edges exist)
- Check: can initialize heap with $O(m+nlogn) = O(mlogn)$ preprocessing
- Note: given invariants, extract-min yields next vertex $v \notin X$ and edge $(u,v)$ crossing $(X, V-X)$ to add to $X$ and $T$, respectively
- Issue: might need to recognize some keys to maintain invariant #2 after each extract-min

When $v$ added to $X$
- For each edge $(v,w) \in E$
    - If $w \in V-X$
        - (Update key if needed)
        - Delete $w$ from heap
        - Recompute $key[w] = min[key[w], c_{vw}]$
        - Re-insert into heap
        
Running time with heaps
- Dominated by time required for heap operations
- $(n-1)$ inserts during preprocessing
- $(n-1)$ extract-mins (one per iteration of while loop)
- Each edge $(v,w)$ triggers one delete/insert combo (when its first endpoint is sucked into $X$)
- $O(m)$ heap operations (recall $m \ge n-1$ since $G$ connected)
- $O(mlogn)$ time

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data
    
    Args:
    file_path (string) -- location of file to read
    
    Returns:
    tuple_data (tuple of list and integer) -- adjancency representation of graph and number of nodes
    """
    
    data_array = []
    num_nodes = 0
    
    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0]) 
        del array_of_array[0] # delete first element, which is just the length of data
        for array in array_of_array:
            subarray = array.split(" ")
            node1 = int(subarray[0])
            node2 = int(subarray[1])
            cost = int(subarray[2])
            data_array.append((node1, node2, cost))
            
    tuple_data = (data_array, num_nodes)
    return tuple_data

In [None]:
def greedy_search(array, X, T):
    """
    For all node1 in X, find node2 that is not in X, that makes the cheapest edge between node1 and node2
    
    Args:
    array (list) -- adjancency representation of graph
    X (list) -- stores all vertices that consist minimun spanning tree
    T (list) -- stores all costs of edges that consist minimun spanning tree
    
    Returns:
    None
    """
    
    minimum_cost = 1000000
    minimum_node1 = 0
    minimum_node2 = 0
    for node1 in X:
        for node2 in get_connected_node(node1, array):
            if node2 not in X:
                cost = get_cost(node1, node2, array)
                if cost < minimum_cost:
                    minimum_node1 = node1
                    minimum_node2 = node2
                    minimum_cost = cost
    
    X.append(minimum_node2)
    T.append(minimum_cost)

In [None]:
def get_connected_node(node1, array):
    """
    Find all nodes that are connected by an edge for node1
    
    Args:
    node1 (integer) -- input node
    array (list) -- adjancency representation of graph
    
    Returns:
    nodes (list) - all nodes connected to node1
    """
    
    nodes = []
    
    for item in array:
        if item[0] == node1:
            nodes.append(item[1])
        elif item[1] == node1:
            nodes.append(item[0])
            
    return nodes

In [None]:
def get_cost(node1, node2, array):
    """
    Find cost of edge between node1 and node2
    
    Args:
    node1 (integer) -- first node of an edge
    node2 (integer) -- second node of an edge
    array (list) -- adjancency representation of graph
    
    Returns:
    cost (integer) -- cost of edge between node1 and node2
    """
    
    cost = 0
    
    for item in array:
        if item[0] == node1 and item[1] == node2:
            cost = item[2]
        if item[0] == node2 and item[1] == node1:
            cost = item[2]
            
    return cost

In [None]:
def prim(file_path):
    """
    Implements Prim's MST algorithm
    
    Args:
    file_path (string) -- location of file to read
    
    Returns:
    cost (integer) -- cost of minimum spanning tree
    """
    
    tuple_obj = open_file(file_path)
    array = tuple_obj[0]
    num_nodes = tuple_obj[1]
    
    X = [] # store explored nodes
    s = array[0][0] # pick random node
    X.append(s)
    
    T = [] # store costs
    T.append(0)
    
    while len(X) < num_nodes:
        greedy_search(array, X, T)
    
    cost = sum(T)
    return cost

In [None]:
assert(prim("data/1-3-1-edge.txt") == -3612829)
assert(prim("data/1-3-1-edge1.txt") == 7)
assert(prim("data/1-3-1-edge2.txt") == 15)
assert(prim("data/1-3-1-edge3.txt") == 14)

## MST review

- Input: undirected graph $G = (V,E)$, edge cost $c_{e}$
- Output: min-cost spanning tree (no cycles, connected)
- Assumptions: $G$ is connected, distinct edge costs
- Cut property: if $e$ is the cheapest edge crossing some cut$(A,B)$, then $e$ belongs to the MST


## Kruskal's MST Algorithm 

- $O(mn)$
- Sort edges in order of increasing cost (rename edges 1,2,3,... so that $c_{1} < c_{2} < \dots < c_{m}$)
- Let $T$ = empty set
- For $i=1 \dots m$ # $O(m)$
    - If $T\cup\{i\}$ has no cycles
        - Add $i$ to $T$ # $O(n)$ use BFS/DFS in the graph $(V,T)$ which contains $\le n-1$ edges  
- Return $T$

### correctness

- Let $T^{*}$ = output of Kruskal's algorithm on input graph $G$
- Clearly $T^{*}$ has no cycles
- $T^{*}$ is connected
    - By empty cut lemma, only need to show that $T^{*}$ crosses every cut
    - Fix a cut$(A,B)$, since $G$ connected at least one of its edges cross $(A,B)$
- Key point: Kruskal will include first edge crossing $(A,B)$ that it sees (by lonely cut corollary, cannot create a cycle)
- Every edge of $T^{*}$ satisfied by the cut property (implies $T^{*}$ is the MST)
    - Consider iteration where edge $(u,v)$ added to current set $T$. since $T\cup\{(u,v)\}$ has no cycle, $T$ has no $u-v$ path
        - There exists an empty cut$(A,B)$ separating $u$ and $v$ (as in proof of empty cut lemma)
        - No edges crossing $(A,B)$ were previsouly considered by Kruskal's algorithm
        - $(u,v)$ is the first (hence the cheapest!) edge crossing $(A,B)$
        - $(u,v)$ justified by the cut property

## Union-Find data structure

- Maintain partition of a set of objects
- Find$(x)$: return name of group that $x$ belongs to
- Union$(c_{i},c_{j})$: fuse groups $c_{i},c_{j}$ into a single one

### why usefu for Kruskal's?

- Objects = vertoces
- Groups = connected components w.r.t. chosen edges $T$
- Adding new edge $(u,v)$ to $T$ <=> fusing connected components of $u,v$

### Union-Find basics

- Motivation: $O(1)$ time cycle checks in Kruskal's algorithm
- Idea #1: maintain one linked structure per connected component of $(V,T)$
    - Each component has an arbitrary leader vertex
- Invariant: each vertex points to the leader of its component ("name" of a component inherited from leader vertex)
- Key point: given edge$(u,v)$, can check if $u$ and $v$ already in same component in $O(1)$ time (iff leader pointers of $u$ and $v$ match <=> Find$(u)$ = Find$(v)$ => $O(1)$ time cycle checks!)
- Note: when new edge $(u,v)$ added to $T$, connected components of $u$ and $v$ merge
- How many times does a single vertex $v$ have its leader pointer updated over the course of Kruskal's algorithm?
    - $O(logn)$ because every time $v$'s leader gets updated, population of its component at least doubles => can only happen $\le log_{2}^n$ time

### Running time

- $O(mlogn)$ for sorting
- $O(m)$ for cycle checks ($O(1)$ per iteration)
- $O(nlogn)$ overall for leader pointer updates
- $O(mlogn)$ total, matching Prim's

### State-of-the-art MST

- $O(m)$ randomized algorithm (Karger-Klein-Tarjan JACM 1995)
- $O(m\alpha(n))$ deterministic (Chazelle JACM 2000)
    - "ïnverse Ackerman function": grows much slower than $log^{*}n$

## Clustering

- "unsupervised learning"
- Informal goal: given $n$ points, classify into "coherent groups"
- Assumptions
    - As input, given a (dis)similarity measure - a distance $d(p,g)$ between each point pair
    - Symmetric (ex. $d(p,g) = d(g,p)$)
- Ex. Euclidean distance, genome similarity, etc

### Max-spacing k-clusterings

- Assume: we know $k$ = number of clusters desired (in practice, can experiment with a range of values)
- Call pointers $p,q$ separated if they are assigned to different clusters
- Definition: the spacing of a $k$-clustering in $min_{separated\ p,q}d(p,q)$ (bigger the better)
- Problem: given a distance measure $d$ and $k$, compute the $k$-clustering with maximum spacing

### A greedy algorithm

- Initially, each point in a separate cluster
- Repeat until only $k$ clusters
    - Let $p,q$ = closest paif of separate points (determines the current spacing)
    - Merge the cluster containing $p$ and $q$ into a single cluster
- Just like Kruskal's MST, but stopped early (single-link clustering)
    - Points <=> vertices
    - Distances <=> edge costs
    - Point pairs <=> edges
    
### Correctness

- Claim: single-link clustering finds the max-spacing $k$-clustering 
- Proof
    - Let $c_{1} \dots c_{k}$ = greedy clustering with spacing $S$
    - Let $\hat{c_{1}} \dots \hat{c_{k}}$ = arbitrary other clustering
    - Need to show spacing of $\hat{c_{1}} \dots \hat{c_{k}}$ is $\le S$
    - Case #1: $\hat{c_{i}}$'s are the same as the $c_{i}$'s (maybe after remaning) => has the same spacing $S$
    - Case #2: otherwise, can find a point pair $p,q$ such that 
        - $p,q$ in the same greedy cluster $c_{i}$
        - $p,q$ in different clusters $\hat{c_{i}},\hat{c_{j}}$
    - Property of greedy algorithm: if two points $x,y$ "directly merged at some point", then $d(x,y) \le S$ (distance between merged point pairs only goes up)
    - Easy case: if $p,q$ directly merged at some point, $S \ge d(p,q) \ge$ spacing of $\hat{c_{1}} \dots \hat{c_{k}}$
    - Tricky case: $p,q$ "indirectly merged" through multiple direct merges
        - Let $p,a_{1} \dots a_{l},q$ be the path of direct greedy merges connecting $p$ and $q$
        - Key point: since $p \in \hat{c_{i}}$ and $q \notin \hat{c_{i}}$, $\exists$consecutive pair $a_{j}, a_{j+1}$ with $a_{j} \in \hat{c_{i}}, a_{j+1} \notin \hat{c_{i}}$ => $s \ge d(a_{j}, a_{j+1}) \ge$ spacing of $\hat{c_{1}} \dots \hat{c_{k}}$
        
## Advanced Union-Find

### Previous solution (for Kruskal's MST)

- Each $x \in X$ points directly to the "leader" of its group
- $O(1)$ Find (just return $x$'s leader)
- $O(nlogn)$ total works for $n$ Unions (when 2 groups merge, smaller group inherits leader of larger one)

### Lazy Union

- New idea: update only one pointer each merge!
- In general: when two groups merge in a Union, make one group's leader (ex. root of the tree) a child of the other one
- Pro: Union reduces to 2 Finds ($r_{1}$ = Find$(x)$, $r_{2}$ = Find$(y)$) and $O(1)$ extra work (link $r_{1}, r_{2}$ together)
- Con: to recover leader of an object, need to follow a pth of parent pointers (not just one!) => not clear if Find still takes $O(1)$
- New implementation: each object $x \in X$ has a parent field 
- Invariant: parent pointers induce a collection of directed trees on $x$ ($x$ is root <=> parent$[x] = x$)
- Initially: for all $x$, parent$[x] = x$
- Find$(x)$: traverse parent pointers from $x$ until you hit the root
- Union$(x,y)$: $s_{1}$ = Find$(x)$; $s_{2}$ = Find$(y)$. reset parent of one of $s_{1}, s_{2}$ to be the other

### Union by rank

- For each $x \in X$, maintain field rank$[x]$ (in general rank$[x] = 1 + $(max rank of $x$'s children))
- Invariant: for all $x \in X$, rank$[x]$ - maximum number of hops from some leaf to $x$ (initially, rank$[x] = 0$ for all $x \in X$)
- To avoid scraggly trees, given $x$ and $y$
    - $s_{1}$ = Find$(x)$, $s_{2}$ = Find$(y)$
    - If rank$[s_{1}]$ $\gt$ rank$[s_{2}]$, then set parent$[s_{2}]$ to $s_{1}$, else get parent$[s_{1}]$ to $s_{2}$ 
- Make old root with smaller rank child of the root with the larger rank (choose new root arbitrarily in case of a tie and add $1$ to its rank)    

### Properties of rank

- Immediate from invariant/rank maintenance
    - For all objects $x$, rank$[x]$ only goes up over time
    - Only rank of roots can go up (once $x$ a non-root, rank$[x]$ forzen forevermore)
    - Ranks strictly increase along a path to the root
    
### Rank lemma

- Consider an arbitrarty sequence of Union (+ Find) operations. For every $r \in {0,1,2,\dots}$, there are at most $n/2^{r}$ objects with rank $r$
- Corollary: max rank always $\le log_{2}n$
- Corollary: worst-case running time of Find, Union is $O(logn)$
- Claim: if $x,y$ have the same rank $r$, then their subtrees (objects from which can reach $x,y$) are disjoint
- Proof
    - Suppose subtrees of $x,y$ have object $z$ in common
        - $\exists$paths $z->x, z->y$
        - One of $x,y$ is an ancester of the other
        - The ancestor has strictly larger rank
- Claim: the subtree of a rank $r$ object has size $\ge 2^{r}$
- Proof
    - Rank $r$ => subtree size $\ge 2^{r}$
    - Base case: initialy all ranks $= 0$, all subtree sizes $= 1$
    - Inductive step: nothing to prove unless the rank of some object changes (subtree sizes only go up)
    - Interesting case: Union$(x,y)$, with $s_{1}=$ Find$(x)$, $s_{2}=$ Find$(y)$, and rank$[s_{1}] =$ rank$[s_{2}] = r$ => $s_{2}$'s new rank $= r+1$ => $s_{2}$'s new subtree size $= s_{2}$'s old subtree size $+ s_{1}$'s old subtree size (each at least $2^{r}$ by the inductive hypothesis) $\ge 2^{r+1}$ 
    
### Path compression

- Idea: why bother traversing a leaf-root path multiple-times? after Find$(x)$, install shortcuts (ex. revise parent pointers) to $x$'s root all along the $x$ => root path
- Con: constant-factor overhead to Find (from "multitasking")
- Pro: speeds up subsequent Finds

### On ranks
- Important: maintain all rank fields exactly as without path compression
    - Rank initially all 0
    - In Union, new root = old root with bigger rank
    - When merging two nodes of common rank $r$, reset new root's rank to $r+1$
- Bad news: now rank$[x]$ is only an upper boud on the maximum number of hops on a path from a leaf to $x$ (which could be much less)
- Good news: rank lemma still holds ($\le n/2^{r}$ objects with rank $r$) still always have rank$[$parent$[x]]$ > rank$[x]$ for all non-roots $x$

### Hopcroft-Ullman theorem

- With union by rank and path compression, $m$ Union + Find operations take $O(mlog^{*}n)$ time, where $log^{*}n$ = the number of times you need to apply $log$ to $n$ before the result is $\le 1$

### Measuring progress

- Initution: installing shortcuts should significantly speed up subsequent Finds + Unions
- Question: how to track this progress and quantify the benefit? 
- Idea: consider a non-root object $x$
    - Progress measre: rank$[$parent$[x]]$ - rank$[x]$
- Path compression increases this progress measure: if $x$ has old parent $p$, new parent $p' \ne p$, then rank$[p^{'}] \gt$rank$[p]$

### Proof setup

- Rank blocks: $\{0\},\{1\},\{2,3,4\},\{5 \dots 2^{4}\},\{17,18 \dots 2^{16}\},\{65537 \dots 2^{65536}\} \dots \{\dots n\}$
- Note: there are $O(log^{*}n)$ different rank blocks
- Semantics: traversal $x$ -> parent$(x)$ is "fast progress" <=> rank$[$parent$[x]]$ is larger block than rank$[x]$
- Definition: at a given point in time, call object $x$ "good" if 
    - $x$ or $x$'s parent is a root OR
    - Rank[parent$[x]$] in larger block than rank$[x]$
    
### Proof of Hopcroft-Ullman

- Point: every Find visits only $O(log^{*}n)$ good nodes $(2 + $number of rank blocks = $O(log^{*}n)$ $)$
- Upshot: total work done during $m$ operations = $O(mlog^{*}n)$ (visits to good objects) + total number of visits to bad nodes (need to bound globally by separate argument)
- Consider: a rank block $\{k+1, k+1 \dots 2^{k}\}$
- Note: when a bad node is visited, its parent is changed to one with strictly larger rank => can only happen $2^{k}$ times before $x$ becomes good (forevermore)
- Rank lemma: total number of objects $x$ with final rank in this rank block is $\displaystyle\sum_{i=k+1}^{2^{k}}n/2^{i} \le n/2^{k}$
- Recall: only $O(log^{*}n)$ rank blocks
- Total work: $O((m+n)log^{*}n)$

### Tarjan's bound

- Theorem: with union by rank and path compression, $m$ Union + Find operations take $O(m\alpha(n))$ time, where $\alpha(n)$ is the inverse Ackerman function

### Ackerman function

- Define $A_{k}(r)$ for all integers $k$ and $r \ge 1$ (recursively)
- Base case: $A_{0}(r) = r+1$ for all $r \ge 1$
- In general, for $k,r \ge 1$
    - $A_{k}(r)$ = apply $A_{k-1}(r)$ times to $r = (A_{k-1} \circ A_{k-1} \circ \dots \circ A_{k-1})(r)$
    
### Inverse Ackerman function

- Definition: for every $n \ge 4, \alpha(n)$ = minimum value of $k$ such that $A_{k}(2) \ge n$

### Building blocks of Hopcroft-Ullman analysis

- Block #1: rank lemma (at most $n/2^{r}$ objects of rank $r$)
- Block #2: path compression => If $x$'s parent pointer updated from $p$ to $p'$, then rank$(p')$ $\ge$ rank$(p)+1$
- New idea: stronger version of building block #2. in most cases, rank of new parent much bigger than rank of old parent (not just by 1)

### Quantifying rank gaps

- Definition: consider a non-root object $x$ (so rank$[x]$ fixed forevermore)
- Define: $\delta(x)$ = max value of $k$ such that rank$[$parent$[x]] \ge A_{k}($rank$[x])$ 
- Ex. always have $\delta(x) \ge 0$
    - $\delta(x) \ge 1$ <=> rank$[$parent$[x]] \ge 2$rank$[x]$
    - $\delta(x) \ge 2$ <=> rank$[$parent$[x]] \ge $rank$[x]2^{rank[x]}$
for all objects $x$ with rank$[x] \ge 2$, then $\delta(x) \le \alpha(n)$ (since $A_{\alpha(n)}(2) \le n$) 

### Bad objects

- Definition: an object is bad if all of the following holds
    - $x$ is not a root
    - Parent$(x)$ is not a root
    - Rank$(2) \ge 2$
    - $x$ has an ancestor $y$ with $\delta(y) = \delta(x)$
    
### Proof of Tarjan's bound

- Upshot: total work of $m$ operations = $O(m\alpha(n))$ (visits to good objects) + total number of visits to bad objects (will show is $O(n\alpha(n))$)
- Main argument: suppose a Find operation visits a bad object $x$
- Path compression: $x$'s new parent will be $p^{'}$ or even higher
    - Rank$[x$'s new parent$] \ge$ rank$[p^{'}] \ge A_{k}($rank$[y]) \ge A_{k}($rank$[p])$ 
- Point: path compression (at least) applies the $A_{k}$ function to rank$[x$'s parent$]$
- Consequence: if $r = $rank$[x] (\ge 2)$, then after $r$ such pointer updates we have 
    - Rank$[x$'s parent$] \ge (A_{k} \circ \dots r$ times $ \dots \circ A_{k})(r) = A_{k+1}(r)$
- Thus, while $x$ is bad, every $r$ vistis increases $\delta(x)$
    - $\le r\alpha(n)$ visits to $x$ while it's bad
- Total number of visits to bad objects $\le \displaystyle\sum_{objects\ x}$ rank$[x]\alpha(n) = \alpha(n)\displaystyle\sum_{r \ge 0}r$ (number of objects with rank $r$) = $n\alpha(n)\displaystyle\sum_{r \ge 0} r/2^{r} = O(n\alpha(n))$ 

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path (string) -- location of file to read

    Returns:
    tuple_date (tuple) -- an array that holds data and an integer representing size of data
    """

    data_array = []
    num_nodes = 0

    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0])
        del array_of_array[0] # delete first element, which is just the length of data
        for array in array_of_array:
            subarray = array.split(" ")
            node1 = int(subarray[0])
            node2 = int(subarray[1])
            cost = int(subarray[2])
            data_array.append((node1, node2, cost))
            
    tuple_date = (data_array, num_nodes)
    return tuple_date

In [None]:
def find_closest_pair_and_merge(sorted_array, T):
    """
    Find two nodes that are in different clusters, and merge them into a single cluster

    Args:
    sorted_array (list) -- holds tuple that is sorted by its thrid element (which is cost between two nodes)
    T (list of list) -- contains "clusters"

    Returns:
    None
    """

    node1 = sorted_array[0][0]
    node2 = sorted_array[0][1]
    cost = sorted_array[0][2]

    index_of_cluster_to_expand = find_cluster(node1, T)
    index_of_cluster_to_remove = find_cluster(node2, T)

    if index_of_cluster_to_expand != index_of_cluster_to_remove: # if two nodes are already in the same cluster, no need to perform merge on T
        for node in T[index_of_cluster_to_remove]:
            T[index_of_cluster_to_expand].append(node) # add all nodes in the cluster where node2 belongs to node1's cluster
        del T[index_of_cluster_to_remove] # remove node2's cluster
        del sorted_array[0] # remove current tuple
    else:
        del sorted_array[0] # remove current tuple

In [None]:
def find_cluster(node, T):
    """
    Find a list inside T that the node belongs to

    Args:
    node (integer) -- represents a node in a graph
    T (list of list) -- contains "clusters"

    Returns:
    i (integer) -- index of cluster of T
    """

    for i in range(0, len(T)):
        if node in T[i]:
            return i
    return -1

In [None]:
def get_max_spacing(T, sorted_array):
    """
    Return the minimum distance between two nodes that are in different clusters

    Args:
    sorted_array (list) -- holds tuple that is sorted by its thrid element (which is cost between two nodes)
    T (list of list) -- contains "clusters"

    Returns:
    min_cost (integer) -- the minimum cost
    """

    for item in sorted_array:
        cluster_of_node1 = find_cluster(item[0], T)
        cluster_of_node2 = find_cluster(item[1], T)
        if cluster_of_node1 != cluster_of_node2:
            min_cost = item[2]
            return min_cost

In [None]:
def clustering(file_path):
    """
    Implements clustering algorithm
    
    Args:
    file_path (string) -- location of file to read
    
    Returns:
    max_spacing (integer) -- maximum distance between elements in different clusters
    """
    
    tuple_obj = open_file(file_path)
    
    array = tuple_obj[0]
    sorted_array = sorted(array, key=lambda x: (x[2])) # sort by third element
    num_nodes = tuple_obj[1]
    
    T = []
    for node in range(1, num_nodes+1):
        T.append([node])

    while len(T) > 4 and len(sorted_array) > 0:
        find_closest_pair_and_merge(sorted_array, T)

    max_spacing = get_max_spacing(T, sorted_array)
    return max_spacing

In [None]:
assert(clustering("data/1-3-2-clustering.txt") == 106)

In [None]:
from networkx.utils.union_find import UnionFind

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows with weight and length, and compute difference and ratio

    Args:
    file_path -- location of file to read

    Returns:
    data_array -- an array of tuples representing a graph
    """

    data_dict = {}
    data_array = []
    num_nodes = 0

    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0])
        num_bits = int(array_of_array[0].split(" ")[1])
        del array_of_array[0] # delete first element, which is just metadata
        for i in range(0, len(array_of_array)):
            number = int(array_of_array[i].replace(" ", ""))
            data_array.append(number)
            if number not in data_dict:
                data_dict[number] = set()
            data_dict[number].add(i+1)
                  
    return (data_array, data_dict, num_nodes, num_bits)

In [None]:
def convert_base_10_to_2(array):
    """
    Convert a list of integers (base 10) to a list of integers (base 2)
    
    Args:
    array - list of integers
    
    Returns:
    None
    """
    for i in range(0, len(array)):
        array[i] = int(bin(array[i])[2:])

In [None]:
def produce_heming_distance_1(num_bits):
    """
    Produce heming distance of 1
    
    Args: 
    num_bits (integer) -- number of bits in binary
    
    Returns:
    heming_distance_1 (list) -- array that hold heming distance of 1
    """
    
    heming_distance_1 = [1 << i for i in range(num_bits)]

    convert_base_10_to_2(heming_distance_1)
    
    return heming_distance_1

In [None]:
def produce_heming_distance_2(heming_distance_1):
    """
    Produce heming distance of 2
    
    Args: 
    heming_distance_1 (list) -- array that hold heming distance of 1
    
    Returns:
    heming_distance_2 (list) -- array that hold heming distance of 2
    """
    
    heming_distance_2 = []
    for i in range(0, len(heming_distance_1)):
        for j in range(0, len(heming_distance_1)):
            if j > i:
                dist = int(str(heming_distance_1[i]),2) ^ int(str(heming_distance_1[j]),2)
                heming_distance_2.append(dist)

    convert_base_10_to_2(heming_distance_2)
    
    return heming_distance_2

In [None]:
def clustering_big(file_path):
    """
    Read-in a file containing rows with weight and length, and compute difference and ratio

    Args:
    file_path -- location of file to read

    Returns:
    data_array -- an array of tuplesrepresenting a graph
    """
    
    tuple_obj = open_file(file_path)
    data_array = tuple_obj[0]
    data_dict = tuple_obj[1]
    num_nodes = tuple_obj[2]
    num_bits = tuple_obj[3]

    unionFind = UnionFind()

    heming_distance_1 = produce_heming_distance_1(num_bits)
    heming_distance_2 = produce_heming_distance_2(heming_distance_1)
    distances = heming_distance_1 + heming_distance_2 

    for distance in distances:
        for key1 in data_dict:
            key2 = int(str(distance),2) ^ int(str(key1),2)
            key2 = int(bin(key2)[2:]) 
            if key2 in data_dict:
                unionFind.union(key1, key2)

    pointer_set = set([unionFind[x] for x in data_dict])
    num_clusters = len(pointer_set)
    
    return num_clusters

In [None]:
assert(clustering_big("data/1-3-2-clustering-big1.txt") == 3)
assert(clustering_big("data/1-3-2-clustering-big2.txt") == 15)
assert(clustering_big("data/1-3-2-clustering-big.txt") == 6118)

## Huffman Codes

### Binary code

- Maps each character of an alphabet $\Sigma$ to binary string. For example
- Ex. $\Sigma$ = a-z variaous punctuation  (size 32 overall, say)
- Obvious encoding: use 32 5-bit binary strings to encode this $\Sigma$
- Can we do better? yes, if same characters of $\Sigma$ are much more frequent than others, using a variable-length code

### Prefix-free codes

- Problem: with variable length codes, not clear where one character ends + the next one begins
- Solution: make sure that for every pair $i,j \in \Sigma$, neither of the encodings $f(i),f(j)$ is a prefix of the other
- Ex. {0, 10, 110, 111}
- Why useful? can give shorter encodings with non-uniform character frequencies

### Code as trees

- Goal: best binary prefix-gree encoding for a given set of character frequencies
- Useful fact: binary codes <-> binary trees
- Example: ($\Sigma = \{A,B,C,D\}$)

### Prefix-free codes as trees

- In general, left child edges get "0" and right child edges get "1"
- For each $i \in \Sigma$, there is exactly one node labelled "$i$"
- Encoding: bits along path from root to node $i$
- Decoding: repeatedly follow path from root until hitting a leaf (ex. 0110111 <-> ACD)
- Encoding length of $i$ = depth of $i$ in a tree

### Problem definition

- Given probability $p_{i}$ for each character $i$, find Tree $T$ that minimize the length of encoding defined by

$L(T) = \displaystyle\sum_{i}P_{i}$(depth of $i$ in T)

Idea #1
- Top-down / divide and conquer
- Partition $\Sigma$ into $\Sigma_{1},\Sigma_{2}$ each with ~50% of total frequency
- Recursively compute $T_{1}$ for $\Sigma_{1}$, $T_{2}$ for $\Sigma_{2}$
- This is sub-optimal

Idea #2 
- Build the tree bottom up using successive mergers

### A greedy approach

- Question: which pair of symbols is "safe" to merge?
- Observation: final encoding length of $i \in \Sigma$ = number of mergers its subtree endures (each merger increases encoding length of participating symbols by 1)
- Greedy heuristic: in first iteration, merge the two symbols with the smallest frequencies

### Huffman's algorithem

- (Given frequencies $p_{i}$ as input)
- If len(set) = $|\Sigma|$ = 2, return
- Let $a$,$b \in \Sigma$ have the smallest frequencies
- Let new_set = $\Sigma^{'}$ = set with $a$, $b$ replaced by $ab$
- Define $p_{ab} = p_{a} + p_{b}$ 
- Recursively compute $T^{'}$ (for new_set $\Sigma^{'}$)
- Extend $T^{'}$ to $T$ by splitting leaf $ab$ into two leave $a$ & $b$
- Return $T$

### Correctness

- By induction on $n = |\Sigma|$ (can assume $n \ge 2$)
- Base case: when $n = 2$, algorithm outpus the optimal tree (needs 1 bit per symbol)
- Inductive step: fix input with $n = |\Sigma| \gt 2$
- By inductive hypothesis: algorithm solves smaller subproblems (for $\Sigma^{'}$ optimally)

### Inductive step

- Let $\Sigma^{'} = \Sigma$ with $a,b$ (symbols with smallest frequencies) replaced by meta-symbol $ab$
- Define $p_{ab} = p_{a} + p_{b}$
- For $T^{'}$ and $T, L(T)-L(T^{'}) = p_{a}(d+1) + p_{b}(d+1) - (p_{a} + p_{b})d = p_{a} + p_{b}$ (independent of $T,T^{'}$) 
- Let $X_{ab}$ = trees for $\Sigma$ that have $a,b$ as siblings

### Summarizing

- Inductive hypothesis: Huffman's algorithm computes a tree $\hat{T^{'}}$ that minimizes $L(T^{'})$ for $\Sigma^{'}$
- Upshot: corresponding tree $\hat{T^{'}}$ minimizes $L(T)$ for $\Sigma$ over all trees in $X_{ab}$ (where $a,b$ are siblings)
- Key lemma: (completes proof of theorem) there is an optimal tree (for $\Sigma$) in $X_{ab}$ ($a,b$ were "safe" to merge)
- Intuition: can make an optimal tree better by pushing $a,b$ as deep as possible (since $a,b$ have smallest frequencies)

### Proof of key lemma

- By exchange argument. let $T^{*}$ be any tree that minimizes $L(T)$ for $\Sigma$. let $x,y$ be siblings at the deepest level of $T^{*}$
- The exchange: obtain $\hat{T^{'}}$ from $T^{*}$ by swapping $a$ <-> $x$, $b$ <-> $y$
- Bote: $\hat{T} \in X_{ab}$ (by choice of $x,y$)
- To finish: will show that $L(\hat{T}) \le L(T^{*})$
    - $\hat{T}$ also optimal, completes 
- Reason
    - $L(T^{*}) - L(\hat{T}) = (p_{x}-p_{a}) + (p_{y}-p_{b}) \ge 0$ 
    
### Running time

- Naive implementation: $O(n^{2})$ where $n = |\Sigma|$
- Speed up: heap! (to perform repeated minimum computations)
    - Use keys = frequencies
    - After excluding the two-smallest-frequency symbols, re-insert the new meta-symbol (new key = sum of the 2 old ones)
    - Iterative, O(nlogn)
- Even faster: sorting + $O(n)$ additional work
    - Manage (meta-)symbols using two queues

In [None]:
import itertools

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path (string) -- location of file to read

    Returns:
    tuple_data (tuple) -- dictionary representing nodes in a tree and integer reprsenting number of nodes
    """

    data_dict = {}
    num_nodes = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        num_nodes = int(data_array[0].split(" ")[0])
        del data_array[0] # delete first element, which is just the length of data
        for item in data_array:
            data_dict[str(index)] = int(item)
            index += 1
            
    tuple_data = (data_dict, num_nodes)
    return tuple_data

In [None]:
def huffman(data_dict):
    """
    Implement Huffman encoding
    
    Args:
    data_dict (dictionary) -- stores key [index] - value [item] pair
    
    Returns:
    return_tuple (tuple) -- stores max and min occurance of nodes in merge operation
    """
    
    sorted_dict_by_value = {k: v for k, v in sorted(data_dict.items(), key=lambda item: item[1])}
    tree_merge_track = []

    while len(sorted_dict_by_value) > 2:
        first_two_items = dict(itertools.islice(sorted_dict_by_value.items(), 2)) # get two smallest values
        first_node = ""
        second_node = ""
        new_weight = 0
        for key, value in first_two_items.items():
            if first_node == "":
                first_node = key
            else:
                second_node = key
            new_weight += value
            del sorted_dict_by_value[key] # delete two smallest nodes

        new_node = first_node + " " + second_node 
        tree_merge_track.append(new_node) 
        sorted_dict_by_value[new_node] = new_weight # create a new node that is a combination of the two smallest nodes
        sorted_dict_by_value = {k: v for k, v in sorted(sorted_dict_by_value.items(), key=lambda item: item[1])}

    # Find "occurance" of each node in merge operation
    count_dict = {}
    for item in tree_merge_track:
        for char in item.split(" "):
            if char not in count_dict:
                count_dict[char] = 1
            count_dict[char] += 1
        
    sorted_count_dict_by_value = {k: v for k, v in sorted(count_dict.items(), key=lambda item: item[1])}
    insepction_array = []
    for key,value in sorted_count_dict_by_value.items():
        insepction_array.append(value)
    
    return_tuple = (max(insepction_array), min(insepction_array))
    return return_tuple

In [None]:
tuple_obj = open_file("data/1-3-3-huffman.txt")
assert(huffman(tuple_obj[0])[0] == 19)
assert(huffman(tuple_obj[0])[1] == 9)

    
## Dynamic programming

### weighted independent sets

- Input: a path graph $G = (V,E)$ with non-negative weights on vertices
- Desired output: subset of nonadjacent vertices - an independent set of maximum total weight
- Brute force: exponential time

### Optimal structure

- Reason about structure of an optimal solution
- Let $S \le V$ be a max-weight independent set (IS)
- Let $v_{n}$ = last vertex of path

### A case analysis

- Case #1: suppose $v_{n} \in S$. let $G^{'} = G$ with $v_{n}$ deleted
    - Note: $S$ also an IS of $G^{'}$
    - Note: $S$ must be a max-weight IS of $G^{'}$ - if $S^{*}$ was better, it would also be better than $S$ in $G$ (contradiction)
- Case #2: suppose $v_{n} \in S$
    - Note: previous vertex $v_{n-1} \notin S$ (by definition of IS). let $G^{''} = G$ with $v_{n-1}, v_{n}$ deleted
    - Note: $S-\{v_{n}\}$ is an IS of $G^{''}$
    - Note: must in fact be a max-weight IS of $G^{''}$ - if $S{*}$ is better than $S$ in $G^{''}$, then $S^{*}\cup\{v_{n}\}$ is better than $S$ in $G$ (contradiction)
    
### Proposed algorithm

- Recursively compute $s_{1}$ = max-weight IS of $G^{'}$
- Recursively compute $s_{2}$ = max-weight IS of $G^{''}$
- Return $s_{1}$ or $s_{2}\cup\{v_{n}\}$, whichever is better
- Runs in exponential time

### Eliminating redundancy

- Reformulate as a bottom-up iterative algorithm. let $G_{i}$ = 1st vertices of $G$
- Populate array $A$ left to right with $A[i]$ = value of max-weight IS of $G_{i}$
- Initialize $A[0] = 0, A[1] = w_{1}$
- For $i = 2,3 \dots n$
    - $A[i] = max{A[i-1], A[i-2]+w_{i}}$
- Runs in $O(n)$

### Optimal solution

- Trace back through filled-in array to reconstruct optimal solution
- Key point: we know that a vertex $v_{i}$ belongs to a max-weight IS of $G_{i}$ <=> $w_{i}$ + max-weight IS of $G_{i-2} \ge$ max-weight IS of $G_{i-1}$

### A reconstruction algorithm

Then trace back through filled-in array to reconstruct optimal solution
- Let $A$ = filled-in array
- Let $S$ = empty set
- While $i \ge 1$ 
    - If $A[i-1] \ge A[i-2] + w_{i}$ (case 1 wins)
        - Decrease i by 1
    - Else (case 2 wins)
        - Add $v_{i}$ to $S$ 
        - Decrease $i$ by 2
- Return $S$

### Principle of Dynamic Programming
1. Identify a small number of sub-problems
2. Can quickly + correctly solve "larger" sub-problems given the solutions to "smaller sub-problems"
3. Solving all sub-problems computes final solution

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path (string) -- location of file to read

    Returns:
    tuple_data (tuple) -- dictionary representing a graph and integer reprsenting number of nodes
    """

    data_dict = {}
    num_nodes = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        num_nodes = int(data_array[0].split(" ")[0])
        del data_array[0] # delete first element, which is just the length of data
        for item in data_array:
            data_dict[index] = int(item)
            index += 1
            
    tuple_data = (data_dict, num_nodes)
    return (data_dict, num_nodes)

In [None]:
def max_weight_independent_set(data_dict, num_nodes):
    """
    Find max-weight independent set using dynamic programming
    
    Args:
    data_dict (dictionary) -- stores key [index] - value [item] pair
    num_nodes (integer) -- total number of nodes in the set
    
    Returns:
    ret (string) -- binary represeting occurance of particualr integers in the set
    """
    
    A = {}
    A[0] = 0
    A[1] = data_dict[1]
    for i in range(2, num_nodes + 1):
        A[i] = max(A[i-1], A[i-2] + data_dict[i])

    S = set()
    while num_nodes > 1:
        if A[num_nodes-1] >= A[num_nodes-2] + data_dict[num_nodes]:
            num_nodes -= 1
        else:
            S.add(num_nodes)
            num_nodes -= 2
    if 2 not in S:
        S.add(1)

    ret = ""
    for i in [1, 2, 3, 4, 17, 117, 517, 997]:
        if i in S:
            ret += "1"
        else:
            ret += "0"
            
    return ret

In [None]:
tuple_obj = open_file("data/1-3-3-max-weight-independent-set1.txt")
assert(max_weight_independent_set(tuple_obj[0], tuple_obj[1]) == "01010000")

tuple_obj = open_file("data/1-3-3-max-weight-independent-set2.txt")
assert(max_weight_independent_set(tuple_obj[0], tuple_obj[1]) == "10100000")

tuple_obj = open_file("data/1-3-3-max-weight-independent-set.txt")
assert(max_weight_independent_set(tuple_obj[0], tuple_obj[1]) == "10100110")

## Knapsack problem

- Input: $n$ items
    - Value $v_{i}$ (non-negative)
    - Size $w_{i}$ (non-negative and integral)
    - Capacity $W$ (non-negative integer)
- Output: subset $S \in \{1 \dots n\}$ that maximizes $\displaystyle\sum_{i}v_{i}$ subject to $\displaystyle\sum_{i}w_{i} \le W$

Step #1
- Let $S$ = a max-value solution
- Suppose item $n \notin S$. $S$ must be optimal with first $n-1$ items with capacity $W$
    - If $S^{*}$ were better than $S$ with respect to 1st $n-1$ items, then this equally true with respect to all $n$ items -> contradiction
- Suppose item $n \in S$. $S-\{n\}$ must be optimal with first $n-1$ items with capacity $W-w_{n}$
    - If $S^{*}$ has higher value than $S-\{n\}$ + totla size $\le W-w_{n}$, then $S\cup\{n\}$ has size $\le W$ and value more than $S$ -> contradiction

Step #2
- Let $v_{i,x}$ = value of the best solution that
- Uses only the first $i$ items
- Has total size $\le x$
- For i = 1 to n and any x
    - $v_{i,x}$ = max{$v_{i-1,x}$ (case when item $i$ in excluded), $v_{i} + v_{i-1,x-w_{i}}$ (case when item $i$ in included)}
- If $w_{i} > x$, then $v_{i,x} = v_{i-1,x}$

Step #3
- Let $A$ = 2D array
- Init $A[0,x] = 0$ for $x = 0 \dots W$
- For $i = 1 \dots n$
    - For $x = 0 \dots W$
        - $A[i,x] = max\{A[i-1, x], A[i-1, x-w_{i}] + v_{i}\}$ (ignore second term if $w_{i} \gt x$)
- Return $A[n,W]$
- Runs in $\theta(nW)$

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path (string) -- location of file to read

    Returns:
    tuple_data (tuple) -- dictionary representing value and weight, integer reprsenting total knapsack-size, integer reprsenting number of items 
    """

    data_dict = {}
    knapsack_size = 0
    num_items = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        knapsack_size = int(data_array[0].split(" ")[0])
        num_items = int(data_array[0].split(" ")[1])
        del data_array[0] # delete first element, which is just metadata
        for item in data_array:
            value = int(item.split(" ")[0])
            weight = int(item.split(" ")[1])
            data_dict[index] = (value, weight)
            index += 1
            
    tuple_data = (data_dict, knapsack_size, num_items)
    return tuple_data

In [None]:
def knapsack(data_dict, knapsack_size, num_items):
    """
    Implement dynamic programming algorithm for knapsack problem
    
    Args:
    data_dict (dictionary) -- has value and weight of each item
    knapsack_size (integer) -- total knapsack size/weight
    num_items (integer) -- total number of items
    
    Returns:
    result (integer) -- maximum value achievable given the size
    """
    
    A = []
    
    for i in range(0, num_items + 1):
        A.append([])
        for j in range(0, knapsack_size + 1):
            A[i].append(0)


    for i in range(1, num_items + 1):
        for j in range(0, knapsack_size + 1):
            if data_dict[i][1] > j:
                A[i][j] = A[i-1][j]
            else:
                A[i][j] = max(A[i-1][j], A[i-1][j-data_dict[i][1]] + data_dict[i][0])

    result = A[num_items][knapsack_size]
    return result      

In [None]:
tuple_obj = open_file("data/1-3-4-knapsack1.txt")
assert(knapsack(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 14)

tuple_obj = open_file("data/1-3-4-knapsack2.txt")
assert(knapsack(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 150)

tuple_obj = open_file("data/1-3-4-knapsack3.txt")
assert(knapsack(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 147)

tuple_obj = open_file("data/1-3-4-knapsack4.txt")
assert(knapsack(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 8)

tuple_obj = open_file("data/1-3-4-knapsack.txt")
assert(knapsack(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 2493893)

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path (string) -- location of file to read

    Returns:
    tuple_data (tuple) -- dictionary representing value and weight, integer reprsenting total knapsack-size, integer reprsenting number of items
    """

    data_dict = {}
    knapsack_size = 0
    num_items = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        knapsack_size = int(data_array[0].split(" ")[0])
        num_items = int(data_array[0].split(" ")[1])
        del data_array[0] # delete first element, which is just metadata
        for item in data_array:
            value = int(item.split(" ")[0])
            weight = int(item.split(" ")[1])
            data_dict[index] = (value, weight)
            index += 1
            
    tuple_data = (data_dict, knapsack_size, num_items)
    return tuple_data

In [None]:
def knapsack_big(data_dict, knapsack_size, num_items):
    """
    Implement (optimized) dynamic programming algorithm for large knapsack problem
    
    Args:
    data_dict (dictionary) -- has value and weight of each item
    knapsack_size (integer) -- total knapsack size/weight
    num_items (integer) -- totla number of items
    
    Returns:
    result (integer) -- maximum value achievable given the size
    """
    
    A = []
    for i in range(0, 2):
        A.append([]) 
        for j in range(0, knapsack_size + 1):
            A[i].append(0)

    i = 1
    while i <= num_items:
        A[1][0:data_dict[i][1]] = A[0][0:data_dict[i][1]][:]
        for j in range(data_dict[i][1], knapsack_size + 1):
            if data_dict[i][1] > j:
                A[1][j] = A[0][j]
            else:
                A[1][j] = max(A[0][j], A[0][j-data_dict[i][1]] + data_dict[i][0])
        A[0] = A[1][:] # copy array by value, not reference
        print(str(i) + " -> " + str(A[1][knapsack_size]))
        i += 1
        
    result = A[num_items][knapsack_size]
    return result

In [None]:
tuple_obj = open_file("data/1-3-4-knapsack-big.txt")
assert(knapsack_big(tuple_obj[0], tuple_obj[1], tuple_obj[2]) == 4243395)

## Sequence alignment

- Input: strings $X = x_{1} \dots x_{m}$, $Y = y_{1} \dots y_{m}$ over some alphabet (like $\{A,C,G,T\}$)
    - Penalty $\alpha_{gap} \ge 0$ for inserting a gap, $\alpha_{ab}$ for matching $a$ and $b$
- Alignment: insert gaps to equalize length of string
- Goal: alignment with minimum possible total penalty

Final position of string can be one of
- Case1: $x_{m}$ and $y_{n}$ matched
- Case2: $x_{m}$ is matched with a gap
- Case3: $y_{n}$ is matched with a gap

Let $X^{'} = X - x_{m}$ and $Y^{'} = Y - y_{m}$ 
- Case1: alignment of $X^{'}$ and $Y^{'}$ is optimal
- Case2: alignment of $X^{'}$ and $Y$ is optimal
- Case3: alignment of $X$ and $Y^{'}$ is optimal

Subproblem $(X_{i}m Y_{j})$
- $X_{i}$ = 1st $i$ letters of $X$
- $Y_{j}$ = 1st $j$ letters of $Y$

### Recurrence

- Let $P_{ij}$ = penalty of optimal alignment of $X_{i}$ and $Y_{j}$
- For all i = 1 to n and j = 1 to n, $P_{ij}$ is the **minimun** of the following three cases
- Case1: $\alpha_{x_{i}y_{j}}$ + $P_{i-1,j-1}$
- Case2: $\alpha_{gap}$ + $P_{i-1,j}$
- Case3: $\alpha_{gap}$ + $P_{i,j-1}$

### Algorithm

- Let $A$ = 2D array
- $A[i,0] = A[0,j] = i * \alpha_{gap} \forall i \ge 0$
- For $i = 1 \dots m$
    - For $j = 1 \dots n$
        - $A[i,j]$ = $min\{A[i-1,j-1]+\alpha_{x_{i}y_{j}}, A[i-1,j]+\alpha_{gap}, A[i,j-1]+\alpha_{gap}\}$
- Runs in $O(mn)$

### Reconstructing a solution
        
- Trace back through filled-in table $A_{i}$ starting at $A[m,n]$
- When reaching subproblem $A[i,j]$
    - If $A[i,j]$ filled using case1, match $x_{i}$ and $y_{j}$, and go to $A[i-1, j-1]$
    - If $A[i,j]$ filled using case2, match $x_{i}$ and a gap, and go to $A[i-1, j]$
    - If $A[i,j]$ filled using case3, match $y_{j}$ and a gap, and go to $A[i, j-1]$
- If $i=0$ or $j=0$, match remaining substring with gaps
- Runs in $O(m+n)$

## Optimal binary search tree

- What is the best search tree for a given set of keys?
- Input: frequencies $p_{1} \dots p_{n}$ for items $1 \dots n$ (assume items in sorted order $1 \lt \dots \lt n$)
- Goal: compute a valid search tree that minimizes weighted search time

$C(T) = \displaystyle\sum_{i}P_{i}*$[search time for $i$ in $T$]

- Ex. if $T$ is a red-black tree, then $C(T) = O(logn)$ (assuming $\displaystyle\sum_{i}P_{i} = 1$)

### Comparison with Huffman codes

- Similarities 
    - Output = a binary tree
    - Goal is (essentially) to minimize average depth with respect to given probabilities
- Difference
    - With Huffman codes, contraint was prefix-freeness, but here contraint is search tree property

### Optimal structure

- Suppose an optimal BST for keys $\{1,2 \dots n\}$ has root $r$, left subtree $T_{1}$, right subtree $T_{2}$
- Then, subtrees $T_{1}$ and $T_{2}$ are optimal BSTs for the keys $\{1 \dots r-1\}$ and $\{r+1 \dots n\}$
- Proof
    - Let $T$ be an optimal BST for keys $\{1 \dots n\}$ with frequencies $p_{1} \dots p_{n}$
    - Suppose $T$ has root $r$
    - Suppose for contradiction that $T_{1}$ is not optimal for $\{1,2 \dots r-1\}$ (other case is similar) with $C(T^{*}_{1}) \lt C(T_{1})$
    - Obtain $T^{*}$ from $T$ by "cutting + pasting" $T^{*}_{1}$ in for $T_{1}$
    - Need to show $C(T^{*}) \lt C(T)$
    - $C(T) = \displaystyle\sum_{i=1}^{n}p_{i}$[search time for $i$ in $T$] = $p_{r} + \displaystyle\sum_{i=1}^{r-1}p_{i}$[search time for $i$ in $T$]$ + \displaystyle\sum_{i=r+1}^{n}p_{i}$[search time for $i$ in $T$]$ = \displaystyle\sum_{i=1}^{n}p_{i} + \displaystyle\sum_{i=1}^{r-1}p_{i}$[search time for $i$ in $T_{1}$]$ + \displaystyle\sum_{i=r+1}^{n}p_{i}$[search time for $i$ in $T_{2}$] = a constant (independent of $T$) + $C(T_{1}) + C(T_{2})$
    - $C(T^{*}_{1}) \lt C(T_{1})$ implies $C(T^{*}) \lt C(T)$, contradicting optimality of $T$

### Relevant subproblems

- Key $\{1 \dots n\}$ = original items. For which subsets $S \in \{1 \dots n\}$ might we need to compute the optimal BST for $S$?
    - Continuous interval ($S = \{i, i+1 \dots j-1, j\}$) for every $i \le j$
    
### Recurrence

- For $1 \ge i \ge j \ge n$, let $C_{ij}$ = weighted search cost of optimal BST for items $\{i, i+1 \dots j-1, j\}$ with properties $\{p_{i}, p_{i+1} \dots p_{j}\}$
- For every $1 \ge i \ge j \ge n$

$C_{ij} = \underset{r=i}{\text{min}}\left[\displaystyle\sum_{k=1}^{j}P_{k}+C_{i,r-1}+C_{r+1,j}\right]$ where $C_{i,r-1}, C_{r+1,j} = 0$ if $x>y$

- Correctness: optimal substructure narrows candidates down to $(j-i+1)$ possibilities, recurrence picks the best by brute force

### Algorithm

- Let $A$ = 2D array
- For $s = 0 \dots n-1$ ($s$ represent $j-i$)
    - for $i = 1 \dots n$ (so $i+s$ plays role of $j$)
        - $A[i, i+s]$ = $\underset{r=i}{\text{min}}\left[\displaystyle\sum_{k=i}^{i+s}P_{k}+A[i,r-1]+A[r+1,i+s]\right]$ where $A[i,r-1]+A[r+1,i+s] = 0$ if first index $\ge$ second index
- Return $A[1,n]$
- Runs in $\theta({n^{3}})$ ($\theta({n^{2}})$ subproblems, $\theta(j-i)$ time to compute $A[i,j]$)