# Greedy Algorithm, Minimum Spanning Tree, and Dynamic Programming

## Application - internet routing

- ex. Stanford gateway router needs to send data to the Cornell gateway router
- Djikstra's algorithm does this (with nonnegative edge length)
- issue is that Stanford gateway router would need to know entire Internet
- need a shortest-path algorithm that uses only local computation
- solution is Bellman-Ford algorithm (also handles negative edge costs)

## Application - sequence alignment

- input: two strings over the alphabet {A,C,G,T}
- problem: figure out how similar the two strings are
- measure similarity via quality of "best" alignment
    - penalty $pen_{gap} \ge 0$ for each gap
    - penalty $pen_{AT} \ge 0$ for mismatching A and T
    - etc
- output: alignment of the strings that minimizes the total penalty (Needleman-Wunsch score)
- solution: straightforward dynamic programming

## Greedy algorithm

- ex. Dijkstra's shortest path algorithm
- easy to propose
- easy runtime analysis
- hard to eatablish correctness
- most greedy algorithms are not correct

## Application - optimal caching

- cache is faster than memory
- on a fault (cache miss), need to evict something from cache to make room
- theorem: the "furthest-in-future" algorithm is optimal (minimizes the number of cache misses)
- serves as guideline for practical algorithm ("Least Recently Used" should do well provided data exhibits locality of reference)
- serves as idealized benchmark for caching algorithms

## Application - scheduling

- setup
    - one shared resource(ex. a processor)
    - many "jobs" to do (ex. processes)
- question
    - in what order should we sequence the jobs?
- assume: each job has a 
    - weight $w_{j}$ ("priority")
    - length $l_{j}$
- definition: the completion time $c_{j}$ of job $j$ = sum of job lengths up to and including $j$
- goal: minimizes the weighted sum of completion times $min \displaystyle\sum_{j=1}^{n}w_{j}c_{j}$
- intuition
    - with equal lengths, schedule larger or smaller weight jobs earlier? larger
    - with equal weights, schedule shorter or longer jobs earlier? shorter
- what if $w_{i} \gt w_{j}$ but $l_{i} \gt l_{j}$?
    - assign "scores" to jobs that are
        - increasing in weight
        - decreasing in length
- guess #1: order jobs by decreasing value of $w_{j} - l_{j}$ (not always correct)
- guess #2: order jobs by decreasing raio $\dfrac{w_{j}}{l_{j}}$ (always correct, runs in $O(nlogn)$ - just need to sort)

### claim - guess #2 is alway correct

- by an exchange argument
- fix arbitrary input of $n$ jobs
- consider proof by contradiction
- let $\sigma$ = greedy schedule, $\sigma*$ = optimal schedule (with $\sigma*$ better than $\sigma$)
- assume all $\dfrac{w_{j}}{l_{j}}$'s are distinct
- assume by just renaming jobs $\dfrac{w_{1}}{l_{1}} \gt \dfrac{w_{2}}{l_{2}} \gt \dots \gt \dfrac{w_{n}}{l_{n}}$
- thus, greedy schedule $\sigma$ is just $1,2,3 \dots n$
- thus, if optimal schedule $\sigma^{*} \ne \sigma$, then there are consecutive jobs $i,j$ with $i>j$
- suppose we exchange order of $i,j$ in $\sigma^{*}$ (leaving other jobs unchanged)
    - cost of exchange is $w_{i}l_{j}$ ($c_{i}$ goes up by $l_{j}$)
    - benefit of exchange is $w_{j}l_{i}$ ($c_{j}$ goes down by $l_{i}$)
    - $i \gt j => \dfrac{w_{i}}{l_{i}} \lt \dfrac{w_{j}}{l_{j}} => \dfrac{w_{i}}{l_{j}} \lt \dfrac{w_{j}}{l_{i}}$
        - cost $\lt$ benefit, meaning swap improves $\sigma^{*}$, contradicts optimality of $\sigma^{*}$ 
        
### claim - guess #2 is correct even with ties

- fix arbitrary input of $n$ jobs
- let $\sigma$ = greedy schedule, $\sigma^{*}$ = any other schedule
- will show $\sigma$ at least as good as $\sigma^{*}$
    - implies that greedy schedule is optimal
- assume by just renaming jobs, greedy schedule $\sigma$ is just $1,2,3 \dots n$ (and so $\dfrac{w_{1}}{l_{1}} \gt \dfrac{w_{2}}{l_{2}} \gt \dots \gt \dfrac{w_{n}}{l_{n}}$)
- consider arbitrary schedule $\sigma*$. If $\sigma^{*} = \sigma$, done
- else recall there exists consecutive jobs $i,j$ in $\sigma^{*}$ with $i \gt j$
- exchanging $i$ and $j$ in $\sigma^{*}$ has net benefit of $w_{j}l_{i}-w_{i}l_{j} \ge 0$
- exchanging an "adjacent inversion" like $i,j$ only makes $\sigma^{*}$ better, and it decreases the number of inverted pairs (jobs $i,j$ with $i \gt j$ and $i$ scheduled earlier)
- after at most $n\choose{2}$ such exchanges, can transform $\sigma^{*}$ into $\sigma$
- $\sigma$ at least as good as $\sigma^{*}$
- greedy is optimal

## Minimum spanning trees

- input: "undirected" graph" $G=(V,E)$ and a cost (for each edge $e \in E$)
    - assume adjacency list representation 
    - OK if edge cost are negative
- output: minimum cost tree $T \subseteq E$ that spans all vertices 
    - $T$ has no cycles
    - subgraph $(U,T)$ is connected
- assumption #1: input graph $G$ is connected
    - else no spanning trees
    - easy to check in preprocessing (ex. depth-first search)
- assumption #2: edge costs are distinct
    - Prim + Kruskal remain correct with ties (which can be broken arbitrarily)
    
## Prim's MST algorithm

- runs in $O(mn)$
- initialize $X = \{s\}$ # $s \in V$ chosen arbitrary
- T = empty set # invariant: $X$ = vertices spanned by tree-so-far $T$
- while $X \ne V$ # increases the number of spanned vertices in cheapest way possible
    - let edge$(u,v)$ be the cheapest edge with $u \in X$ and $v \notin X$
    - add $e$ to $T$
    - add $v$ to $X$
    
### claim - Prim's algorithm outputs a spanning tree

- definition: a cut of a graph $G = (V,E)$ is a partition of $V$ into 2 non-empty sets
- empty cut lemma
    - a graph is not connected <=> there exists a cut$(A,B)$ with no crossing edges
    - proof: (<=)
        - assume RHS
        - pick any $u \in A$ and $v \in B$
        - since no edges cross $(A,B)$, there is no $u,v$ path in $G$ 
        - thus, $G$ not connected
    - proof: (=>)
        - suppose $G$ has no $u,v$ path
        - define $A$ = {vertices reachable from $u$ in $G$} ($u$'s connected component)
        - define $B$ = {all other vertices} (all other connected components)
        - note: no edges cross out $(A,B)$ (otherwise $A$ would be bigger!)
- double-crossing lemma
    - suppose the cycle $C \subseteq E$ has an edge crossign the cut$(A,B)$, then so does some other edge of $C$
- lonely cut corollary
    - if $e$ is the only edge crossing some cut$(A,B)$, then it is not in any cycle (if it were in a cycle, some other edge would have to cross the cut!)
- in summary
    - (1) algorithm maintains invariant that $T$ spans $X$
    - (2) can't get stuck with $X \ne V$ (otherwise the cut $(X, V-X)$ must be empty - by empty cut lemma, input graph $G$ is disconnected)
    - (3) no cycles ever get created in $T$
        - consider any iteration with current sets $X$ and $T$
        - suppose $e$ gets added
        - $e$ is the first edge crossing $(X, V-X)$ that gets added to $T$ => its addition can't create a cycle in $T$ (by lonely cut corollary)
        
### claim - Prim's algorithm always outputs a minimum-cost spanning tree

- cut property: consider an edge $e$ of $G$. suppose there is a cut $(A,B)$ such that $e$ is the cheapest edge of $G$ that crosses it. then $e$ belongs to the MST of $G$
- claim: cut property => Prim's algorithm is correct
    - already proved Prim's algorithm outputs a spanning tree $T^{*}$
    - key point: every edge $e \in T^{*}$ is explicitly justified by the cut property
        - $T^{*}$ is a subset of the MST
        - since $T^{*}$ is already a spanning tree, it must be the MST
- proof of cut property
    - suppose there is an edge $e$ that is the cheapest one crossing a cut$(A,B)$, yet $e$ is not in the MST $T^{*}$
    - idea: exchange $e$ with another edge in $T^{*}$ to make it even cheaper (contradiction)
    - since $T^{*}$ is connected, must construct an edge $f (\ne e)$ crossing $(A,B)$
    - idea: exchange $e$ and $f$ to get a spanning tree cheaper than $T^{*}$ (contradiction)
    - let $C$ = cycle created by adding $e$ to $T^{*}$
    - by double-crossing lemma: some other edge $e^{'}$ of $C$ (with $e^{'} \ne e$ and $e^{'} \in T^{*}$) crosses $(A,B)$ 
    - note: $T = T^{*}\cup\{e\}-\{e^{'}\}$ is also a spanning tree
    - since $C_{e} \lt C_{e^{'}}$, $T$ is cheaper than purported MST $T^{*}$, contradiction!
    
## Prim's algorithm with heaps

- invariant #1: elements in heap = vertices of $V-X$
- invariant #2: for $v \in V-X$, $key[v]$ = cheapest edge $(u,v)$ with $i \in X$ (or $+\infty$ if no such edges exist)
- check: can initialize heap with $O(m+nlogn) = O(mlogn)$ preprocessing
- note: given invariants, extract-min yields next vertex $v \notin X$ and edge $(u,v)$ crossing $(X, V-X)$ to add to $X$ and $T$, respectively
- issue: might need to recognize some keys to maintain invariant #2 after each extract-min

When $v$ added to $X$
- for each edge $(v,w) \in E$
    - if $w \in V-X$
        - (update key if needed)
        - delete $w$ from heap
        - recompute $key[w] = min[key[w], c_{vw}]$
        - re-insert into heap
        
Running time with heaps
- dominated by time required for heap operations
- $(n-1)$ inserts during preprocessing
- $(n-1)$ extract-mins (one per iteration of while loop)
- each edge $(v,w)$ triggers one delete/insert combo (when its first endpoint is sucked into $X$)
- $O(m)$ heap operations (recall $m \ge n-1$ since $G$ connected)
- $O(mlogn)$ time

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data
    
    Args:
    file_path -- location of file to read
    
    Returns:
    (data_array, num_nodes) -- a tuple with an array representing a graph and an integer reprsenting number of nodes
    """
    
    data_array = []
    num_nodes = 0
    
    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0]) 
        del array_of_array[0] # delete first element, which is just the length of data
        for array in array_of_array:
            subarray = array.split(" ")
            node1 = int(subarray[0])
            node2 = int(subarray[1])
            cost = int(subarray[2])
            data_array.append((node1, node2, cost))
    return (data_array, num_nodes)


def greedy_search(array, X, T):
    """
    For all node1 in X, find node2 that is not in X, that makes the cheapest edge between node1 and node2
    
    Args:
    array -- a list of tuples representing a graph
    X -- a list to store all vertices that consist minimun spanning tree
    T -- a list to store all costs of edges that consist minimun spanning tree
    
    Returns:
    None
    """
    
    minimum_cost = 1000000
    minimum_node1 = 0
    minimum_node2 = 0
    for node1 in X:
        for node2 in get_connected_node(node1, array):
            if node2 not in X:
                cost = get_cost(node1, node2, array)
                if cost < minimum_cost:
                    minimum_node1 = node1
                    minimum_node2 = node2
                    minimum_cost = cost
    
    X.append(minimum_node2)
    T.append(minimum_cost)
    
    
def get_connected_node(node1, array):
    """
    Find all nodes that are connected by an edge for node1
    
    Args:
    node1 -- input node
    array -- a list of tuples representing a graph
    """
    
    nodes = []
    
    for item in array:
        if item[0] == node1:
            nodes.append(item[1])
        elif item[1] == node1:
            nodes.append(item[0])
            
    return nodes


def get_cost(node1, node2, array):
    """
    Find cost of edge between node1 and node2
    
    Args:
    node1 -- first node of an edge
    node2 -- second node of an edge
    array -- a list of tuples representing a graph
    
    Returns:
    cost -- cost of edge between node1 and node2
    """
    
    cost = 0
    
    for item in array:
        if item[0] == node1 and item[1] == node2:
            cost = item[2]
        if item[0] == node2 and item[1] == node1:
            cost = item[2]
            
    return cost
            
    
tuple_obj = open_file("data/edge.txt")
# tuple_obj = open_file("data/edge-test1.txt") #7
# tuple_obj = open_file("data/edge-test2.txt") #15
# tuple_obj = open_file("data/edge-test3.txt") #14
array = tuple_obj[0]
num_nodes = tuple_obj[1]
s = array[0][0] # pick random node
X = [] # store explored nodes
X.append(s)
T = [] # store costs
T.append(0)

while len(X) < num_nodes:
    greedy_search(array, X, T)
    
print(sum(T))
# -3612829

## Kruskal's MST Algorithm

- sort edges in order of increasing cost (rename edges 1,2,3,... so that $c_{1} < c_{2} < \dots < c_{m}$)
- let T = empty set
- for i=1 to m
    - if T + {i} has no cycles
        - add i to T
- return T

Union-Find
- $Union(C_{i}, C_{j}$): fuse graph $C_{i}, C_{j}$ into a single one
- maintain one linked structure
- each vertex points to the leader of its component (none of a component inherited from leader vertex)
- given edge(u,v), can check if u and v are already in some component in $O(1)$ time (iff leader pointers of u and v match <=> Find(u) = Find(v)
- when new edge(u,v) added to T, connected components of u and v merge
- when two components merge, have smaller one inherit the leader of the larger one

Clustering
- given n points, classify into coherent groups
- initially, each point in a separate cluster
- repeat until only k clusters
    - let p,q = closest paif of separate points
    - merge the cluster containing p and q into a single cluster

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path -- location of file to read

    Returns:
    (data_array, num_nodes) -- a tuple with an array representing a graph and an integer reprsenting number of nodes
    """

    data_array = []
    num_nodes = 0

    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0])
        del array_of_array[0] # delete first element, which is just the length of data
        for array in array_of_array:
            subarray = array.split(" ")
            node1 = int(subarray[0])
            node2 = int(subarray[1])
            cost = int(subarray[2])
            data_array.append((node1, node2, cost))
    return (data_array, num_nodes)


def find_closest_pair_and_merge(sorted_array, T):
    """
    Find two nodes that are in different clusters, and merge them into a single cluster

    Args:
    sorted_array -- a list of tuple what is sorted by its thrid element (that is cost between two nodes)
    T -- a list of list that contains "clusers"

    Returns:
    None
    """

    node1 = sorted_array[0][0]
    node2 = sorted_array[0][1]
    cost = sorted_array[0][2]

    index_of_cluster_to_expand = find_cluster(node1, T)
    index_of_cluster_to_remove = find_cluster(node2, T)

    print(str(node1) + " and " + str(node2) + ": " + str(index_of_cluster_to_expand) + " => " + str(index_of_cluster_to_remove))

    if index_of_cluster_to_expand != index_of_cluster_to_remove: # if two nodes are already in the same cluster, no need to perform merge on T
        for node in T[index_of_cluster_to_remove]:
            T[index_of_cluster_to_expand].append(node) # add all nodes in the cluster where node2 belongs to node1's cluster
        del T[index_of_cluster_to_remove] # remove node2's cluster
        del sorted_array[0] # remove current tuple
    else:
        del sorted_array[0] # remove current tuple


def find_cluster(node, T):
    """
    Find a list inside T where node belongs

    Args:
    node -- an integer representing a node in a graph
    T -- a list of list that contains "clusers"

    Returns:
    i -- index of cluster of T
    """

    for i in range(0, len(T)):
        if node in T[i]:
            return i
    return -1


def get_max_spacing(T, sorted_array):
    """
    Return the minimum distance of two nodes that are in different clusters

    Args:
    sorted_array -- a list of tuple what is sorted by its thrid element (that is cost between two nodes)
    T -- a list of list that contains "clusers"

    Returns:
    item[2] -- the minimum cost
    """

    for item in sorted_array:
        cluster_of_node1 = find_cluster(item[0], T)
        cluster_of_node2 = find_cluster(item[1], T)
        if cluster_of_node1 != cluster_of_node2:
            return item[2]


tuple_obj = open_file("data/clustering.txt")
# tuple_obj = open_file("data/clustering-test1.txt")
array = tuple_obj[0]
sorted_array = sorted(array, key=lambda x: (x[2])) # sort by third element
num_nodes = tuple_obj[1]
print("len(array):" + str(len(sorted_array)))


T = []
for node in range(1, num_nodes+1):
    T.append([node])
print("len(T): " + str(len(T)))
print(T)

while len(T) > 4 and len(sorted_array) > 0:
    find_closest_pair_and_merge(sorted_array, T)

print(get_max_spacing(T, sorted_array))

# Max-spacing:100, two clusters: Nodes(1,2) Nodes(3,4,5)
# Max-spacing:105, four clusters
# Max-spacing:106

In [None]:
from networkx.utils.union_find import UnionFind


def open_file(file_path):
    """
    Read-in a file containing rows with weight and length, and compute difference and ratio

    Args:
    file_path -- location of file to read

    Returns:
    data_array -- an array of tuplesrepresenting a graph
    """

    data_dict = {}
    data_array = []
    num_nodes = 0

    with open(file_path, 'r') as line:
        array_of_array = line.read().split("\n")
        num_nodes = int(array_of_array[0].split(" ")[0])
        num_bits = int(array_of_array[0].split(" ")[1])
        del array_of_array[0] # delete first element, which is just metadata
        for i in range(0, len(array_of_array)):
            number = int(array_of_array[i].replace(" ", ""))
            data_array.append(number)
            if number not in data_dict:
                data_dict[number] = set()
            data_dict[number].add(i+1)
                  
    return (data_array, data_dict, num_nodes, num_bits)


def convert_base_10_to_2(array):
    """
    Convert a list of integers (base 10) to a list of integers (base 2)
    
    Args:
    array - list of integers
    
    Returns:
    None
    """
    for i in range(0, len(array)):
        array[i] = int(bin(array[i])[2:])
    
    
tuple_obj = open_file("data/clustering-big.txt")
# tuple_obj = open_file("data/clustering-big-test1.txt")
# tuple_obj = open_file("data/clustering-big-test2.txt")
data_array = tuple_obj[0]
data_dict = tuple_obj[1]
num_nodes = tuple_obj[2]
num_bits = tuple_obj[3]
print("len(data_array): " + str(len(data_array)))
print("len(data_dict): " + str(len(data_dict)))
print("num_nodes: " + str(num_nodes))
print("num_bits: " + str(num_bits))

unionFind = UnionFind()

# Hemming distance of 1
heming_distance_1 = [1 << i for i in range(num_bits)]

convert_base_10_to_2(heming_distance_1)
print(len(heming_distance_1)) #24

# Hemming distance of 2
heming_distance_2 = []
for i in range(0, len(heming_distance_1)):
    for j in range(0, len(heming_distance_1)):
        if j > i:
            dist = int(str(heming_distance_1[i]),2) ^ int(str(heming_distance_1[j]),2)
            heming_distance_2.append(dist)
        
convert_base_10_to_2(heming_distance_2)
print(len(heming_distance_2)) # 276

distances = heming_distance_1 + heming_distance_2
print(len(distances))

for distance in distances:
    for key1 in data_dict:
        key2 = int(str(distance),2) ^ int(str(key1),2)
        key2 = int(bin(key2)[2:]) 
        if key2 in data_dict:
            unionFind.union(key1, key2)

pointer_set = set([unionFind[x] for x in data_dict])
num_clusters = len(pointer_set)
print(num_clusters)

# 3
# 15
# 6118

# Huffman Codes

Binary code: maps alphabet to binary string. For example, {A, B, C, D} => {00, 01, 10, 11}
How about use instead "prefix-free" such that {A, B, C, D} => {0, 10, 110, 111}

In general
- left child edges get "0"
- right child edges get "1"
- for each $i$, there is exactly one node labelled $i$
- encoding: bits along path from root to node $i$
- decoding: repeatedly follow path from root until hitting a leaf
- encoding length of $i$ = depth of $i$ in a tree

Given probability $p_{i}$ for each character $i$, find Tree $T$ that minimize the length of encoding defined by

$L(T) = \displaystyle\sum_{i}P_{i}$(depth of $i$ in T)

Idea: build the tree bottom up using successive merges
- if len(set) = 2, return
- let $a$,$b$ have the smallest frequencies
- let new_set = set with $a$ & $b$ replaced by $ab$
- define $p_{ab} = p_{a} + p_{b}$ 
- recursively comput $T^{'}$ (for new_set)
- extend $T^{'}$ to $T$ by splitting leaf $ab$ into two leave $a$ & $b$
- return $T$

In [None]:
import itertools 


def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path -- location of file to read

    Returns:
    (data_dict, num_nodes) -- a tuple with a dictionary representing a graph and an integer reprsenting number of nodes
    """

    data_dict = {}
    num_nodes = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        num_nodes = int(data_array[0].split(" ")[0])
        del data_array[0] # delete first element, which is just the length of data
        for item in data_array:
            data_dict[str(index)] = int(item)
            index += 1
    return (data_dict, num_nodes)


tuple_obj = open_file("data/huffman.txt")
# tuple_obj = open_file("data/huffman-test1.txt")
data_dict = tuple_obj[0]
num_nodes = tuple_obj[1]

sorted_dict_by_value = {k: v for k, v in sorted(data_dict.items(), key=lambda item: item[1])}
tree_merge_track = []

while len(sorted_dict_by_value) > 2:
    first_two_items = dict(itertools.islice(sorted_dict_by_value.items(), 2)) # get two smallest values
    first_node = ""
    second_node = ""
    new_weight = 0
    for key, value in first_two_items.items():
        if first_node == "":
            first_node = key
        else:
            second_node = key
        new_weight += value
        del sorted_dict_by_value[key] # delete two smallest nodes
        
    new_node = first_node + " " + second_node 
    tree_merge_track.append(new_node) 
    sorted_dict_by_value[new_node] = new_weight # create a new node that is a combination of the two smallest nodes
    sorted_dict_by_value = {k: v for k, v in sorted(sorted_dict_by_value.items(), key=lambda item: item[1])}
    
# print(sorted_dict_by_value)

# Find "occurance" of each node in merge operation
count_dict = {}
for item in tree_merge_track:
    for char in item.split(" "):
        if char not in count_dict:
            count_dict[char] = 1
        count_dict[char] += 1
        
print(len(count_dict))
print(count_dict)
sorted_count_dict_by_value = {k: v for k, v in sorted(count_dict.items(), key=lambda item: item[1])}
# Max: 19, Min: 9

# Dynamic Programming

Ex. Graph G = (V,E) with non-negative weights on vertices. Compute subset of non-adjacent vertices that constitute the maximum total weight

Let $S$ (in $V$) be a max-weight independent set
- suppose $v_{n}$ not in $S$
- let $G^{'}$ = $G$ with $v_{n}$ deleted
- S is also an independent set of $G^{'}$
- S must be a max-weight independent set of $G^{'}$

This time
- suppose $v_{n}$ in $S$
- then, previous vertex $v_{n-1}$ not in $S$
- let $G^{''}$ = $G$ with $v_{n}$ and $v_{n-1}$ deleted
- S-{$v_{n}$} is also an independent set of $G^{'}$
- S-{$v_{n}$} must be a max-weight independent set of $G^{''}$

Thus, max-weight independent set must be either
- max-weight independent set of $G^{'}$ or
- $v_{n}$ + max-weight independent set of $G^{''}$

Algorithm
- let $G_{i}$ = 1st $i$ vertices of $G$
- populate array $A$ left to right with $A[i]$ = value of max-weight independent set of $G_{i}$
- init: $A[0] = 0$ and $A[1] = w_{1}$
- main loop: for $i = 2,3,4 \dots n$, $A[i] = max[A[i-1], A[i-2]+w_{i}]$

Then trace back through filled-in array to reconstruct optimal solution
- let $A$ = filled-in array
- let $S$ = empty set
- while $i \ge 1$ 
    - if $A[i-1] \ge A[i-2] + w_{i}$
        - decrease i by 1
    - else
        - add $v_{i}$ to $S$ 
        - decrease $i$ by 2
- return $S$

### Principle of Dynamic Programming
1. Identify a small number of sub-problems
2. Given solutions to smaller sub-problems, can solve larger sub-problems
3. Solving all sub-problems computes final solution

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path -- location of file to read

    Returns:
    (data_dict, num_nodes) -- a tuple with a dictionary representing a graph and an integer reprsenting number of nodes
    """

    data_dict = {}
    num_nodes = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        num_nodes = int(data_array[0].split(" ")[0])
        del data_array[0] # delete first element, which is just the length of data
        for item in data_array:
            data_dict[index] = int(item)
            index += 1
    return (data_dict, num_nodes)


tuple_obj = open_file("data/max-weight-independent-set.txt")
# tuple_obj = open_file("data/max-weight-independent-set-test1.txt")
# tuple_obj = open_file("data/max-weight-independent-set-test2.txt")
data_dict = tuple_obj[0]
num_nodes = tuple_obj[1]

A = {}
A[0] = 0
A[1] = data_dict[1]
for i in range(2, num_nodes + 1):
    A[i] = max(A[i-1], A[i-2] + data_dict[i])

S = set()
while num_nodes > 1:
    if A[num_nodes-1] >= A[num_nodes-2] + data_dict[num_nodes]:
        num_nodes -= 1
    else:
        S.add(num_nodes)
        num_nodes -= 2
if 2 not in S:
    S.add(1)

ret = ""
for i in [1, 2, 3, 4, 17, 117, 517, 997]:
    if i in S:
        ret += "1"
    else:
        ret += "0"
print(ret)
# 10100110

## Knapsack Problem

Ex. n items
- value $v_{i}$ (non-negative)
- size $w_{i}$ (non-negative and integral)
- capacity $W$ (non-negative integer)

Find subset $S$ in ${1 \dots n}$ that maximizes $\displaystyle\sum_{i}v_{i}$ subject to $\displaystyle\sum_{i}w_{i} \le W$

Let S = a max-value solution
- suppose item n not in $S$. Then $S$ must be optimal with first $n-1$ items with capacity $W$
- suppose item n in $S$. Then $S-\{n\}$ must be optimal with first $n-1$ items with capacity $W-w_{n}$

Let $v_{i,x}$ = value of the best solution that
- uses only the first $i$ items
- has total size $\le x$

Then,
- for i = 1 to n and any x
    - $v_{i,x}$ = max{$v_{i-1,x}$ (case when item $i$ in excluded), $v_{i} + v_{i-1,x-w_{i}}$ (case when item $i$ in included)}
- if $w_{i} > x$, then $v_{i,x} = v_{i-1,x}$

Pseudo code
- let A = 2-D array
- init $A[0,x] = 0$ for $x = 0 \dots W$
- for i = 1 to n
    - for $x = 0 \dots W$
        - $A[i,x] = max\{A[i-1, x], A[i-1, x-w_{i}] + v_{i}\}$
- return $A[n,W]$

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path -- location of file to read

    Returns:
    (data_dict, num_nodes) -- a tuple with a dictionary representing a graph and an integer reprsenting number of nodes
    """

    data_dict = {}
    knapsack_size = 0
    num_items = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        knapsack_size = int(data_array[0].split(" ")[0])
        num_items = int(data_array[0].split(" ")[1])
        del data_array[0] # delete first element, which is just metadata
        for item in data_array:
            value = int(item.split(" ")[0])
            weight = int(item.split(" ")[1])
            data_dict[index] = (value, weight)
            index += 1
    return (data_dict, knapsack_size, num_items)


# tuple_obj = open_file("data/knapsack-test1.txt")
# tuple_obj = open_file("data/knapsack-test2.txt")
# tuple_obj = open_file("data/knapsack-test3.txt")
tuple_obj = open_file("data/knapsack-test4.txt")
# tuple_obj = open_file("data/knapsack.txt")
data_dict = tuple_obj[0]
knapsack_size = tuple_obj[1]
num_items = tuple_obj[2]
print(data_dict)
print(knapsack_size)
print(num_items)

A = []
for i in range(0, num_items + 1):
    A.append([])
    for j in range(0, knapsack_size + 1):
        A[i].append(0)
    
    
for i in range(1, num_items + 1):
    for j in range(0, knapsack_size + 1):
#         print(str(A[i-1][j]) +" vs "+ str(A[i-1][j-data_dict[i][1]]) + " + " + str(data_dict[i][0]))
        if data_dict[i][1] > j:
            A[i][j] = A[i-1][j]
        else:
            A[i][j] = max(A[i-1][j], A[i-1][j-data_dict[i][1]] + data_dict[i][0])
        
print(A[num_items][knapsack_size])

# 14
# 150
# 147
# 8
# 2493893

In [None]:
def open_file(file_path):
    """
    Read-in a file containing rows of data

    Args:
    file_path -- location of file to read

    Returns:
    (data_dict, num_nodes) -- a tuple with a dictionary representing a graph and an integer reprsenting number of nodes
    """

    data_dict = {}
    knapsack_size = 0
    num_items = 0
    index = 1

    with open(file_path, 'r') as line:
        data_array = line.read().split("\n")
        knapsack_size = int(data_array[0].split(" ")[0])
        num_items = int(data_array[0].split(" ")[1])
        del data_array[0] # delete first element, which is just metadata
        for item in data_array:
            value = int(item.split(" ")[0])
            weight = int(item.split(" ")[1])
            data_dict[index] = (value, weight)
            index += 1
    return (data_dict, knapsack_size, num_items)


# tuple_obj = open_file("data/knapsack-test1.txt")
# tuple_obj = open_file("data/knapsack-test2.txt")
# tuple_obj = open_file("data/knapsack-test3.txt")
# tuple_obj = open_file("data/knapsack-test4.txt")
tuple_obj = open_file("data/knapsack-big.txt")
data_dict = tuple_obj[0]
knapsack_size = tuple_obj[1]
num_items = tuple_obj[2]
# print(data_dict)
# print(knapsack_size)
# print(num_items)

A = []
for i in range(0, 2):
    A.append([]) 
    for j in range(0, knapsack_size + 1):
        A[i].append(0)
    
i = 1
while i <= num_items:
    A[1][0:data_dict[i][1]] = A[0][0:data_dict[i][1]][:]
    for j in range(data_dict[i][1], knapsack_size + 1):
        if data_dict[i][1] > j:
            A[1][j] = A[0][j]
        else:
#             print(str(A[0][j]) +" vs "+ str(A[0][j-data_dict[i][1]]) + " + " + str(data_dict[i][0]))
            A[1][j] = max(A[0][j], A[0][j-data_dict[i][1]] + data_dict[i][0])
    A[0] = A[1][:] # copy array by value, not reference
    print(str(i) + " -> " + str(A[1][knapsack_size]))
    i += 1
     
# 14
# 150
# 147
# 8
# 4243395

## Sequence Alignment

- strings $X = x_{1} \dots x_{m}$, $Y = y_{1} \dots y_{m}$
- penalty $\alpha_{gap} \ge 0$ for inserting a gap, $\alpha_{ab}$ for matching $a$ and $b$
- insert gaps to equalize length of string

Final position of string can be one of
- case1: $x_{m}$ and $y_{n}$ matched
- case2: $x_{m}$ is matched with a gap
- case3: $y_{n}$ is matched with a gap

Let $X^{'} = X - x_{m}$ and $Y^{'} = Y - y_{m}$ 
- case1: alignment of $X^{'}$ and $Y^{'}$ is optimal
- case2: alignment of $X^{'}$ and $Y$ is optimal
- case3: alignment of $X$ and $Y^{'}$ is optimal

Subproblem $(X_{i}m Y_{j})$
- $X_{i}$ = 1st $i$ letters of $X$
- $Y_{j}$ = 1st $j$ letters of $Y$

Let $P_{ij}$ = penalty of optimal alignment of $X_{i}$ and $Y_{j}$
- For all i = 1 to n and j = 1 to n, $P_{ij}$ is the **minimun** of the following three cases
- case1: $\alpha_{x_{i}y_{j}}$ + $P_{i-1,j-1}$
- case2: $\alpha_{gap}$ + $P_{i-1,j}$
- case3: $\alpha_{gap}$ + $P_{i,j-1}$

Pseudo code
- let A = 2-D array
- $A[i,0] = A[0,j] = i * \alpha_{gap}$ for all $i \ge 0$
- for i = 1 to m
    - for j = 1 to n
        - $A[i,j]$ = $min\{A[i-1,j-1]+\alpha_{x_{i}y_{j}}, A[i-1,j]+\alpha_{gap}, A[i,j-1]+\alpha_{gap}\}$
        
Trace back through filled-in table $A_{i}$ starting at $A[m,n]$
- when reaching subproblem $A[i,j]$
    - if $A[i,j]$ filled using case1, match $x_{i}$ and $y_{j}$, and go to $A[i-1, j-1]$
    - if $A[i,j]$ filled using case2, match $x_{i}$ and a gap, and go to $A[i-1, j]$
    - if $A[i,j]$ filled using case3, match $y_{j}$ and a gap, and go to $A[i, j-1]$
- if $i=0$ or $j=0$, match remaining substring with gaps

## Optimal Binary Search Tree

- what is the best search tree for a given set of keys?
- let frequencies $p_{1} \dots p_{n}$ for items $1 \dots n$
- valid search tree that minimizes weighted search time

$C(T) = \displaystyle\sum_{i}P_{i}*$[search time for in i T]

- subtrees $T_{1}$ and $T_{2}$ are optimal BSTs for the keys $\{1 \dots r-1\}$ and $\{r+1 \dots n\}$
- for $1 \ge i \ge j \ge n$, let $C_{ij}$ = weighted search cost of optimal BST for items $\{i, i+1 \dots j-1, j\}$ with properties $\{p_{i}, p_{i+1} \dots p_{j}\}$
- for every $1 \ge i \ge j \ge n$

$C_{ij} = \underset{r=i}{\text{min}}\left[\displaystyle\sum_{k=1}^{j}P_{k}+C_{i,r-1}+C_{r+1,j}\right]$ where $C_{i,r-1}, C_{r+1,j} = 0$ if $x>y$

- let A=2-D array
- for s = 0 to n-1 (s represent j-i)
    - for i =1 to n
        - $A[i, i+s]$ = $\underset{r=i}{\text{min}}\left[\displaystyle\sum_{k=i}^{i+s}P_{k}+A[i,r-1]+A[r+1,i+s]\right]$ where $A[i,r-1]+A[r+1,i+s] = 0$ if first index $\ge$ second index
- return $A[1,n]$