In [None]:
# setup
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML(open('rise.css').read()))

# imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(style="whitegrid", font_scale=1.5, rc={'figure.figsize':(12, 6)})


# CMPS 6610
# Algorithms

## Edit Distance, Longest Increasing Subsequence


Today's agenda:

- Edit Distance and Longest Increasing Subsequence

### Edit Distance

Given two strings $S, T \in \Sigma^*$, how similar are they?

We can measure this using *edit distance*, which is the number of insertions and deletions needed to turn $S$ into $T$. Note that we can also go from $T$ to $S$ if we just reverse the edits (by turning insertions into deletions)

Example: $S$ = `abcdefghijkl`, $T$ = `abcdghikjl`. How many edits are needed?

Consider following edit sequence:

$S$: `abcdefghijkl---`<br>
$T$: `abcd--ghi---kjl`

This has 5 deletions and 3 insertions, for a total of 8 edits. What about this one:

$S$: `abcdefghijk-l`<br>
$T$: `abcd--ghi-kjl`

We have 3 deletions and 1 insertion for a total of 4 edits.

Our goal is to compute the **minimum edit distance** between two strings $S$ and $T$ of lengths $m$ and $n$, respectively.

It might seem like a toy problem, but this is a critical problem in comparing gene and protein sequences. By attaching  weights to insertions and deletions, we can assess the evolutionary distance between two sequences.



Notice that once again, if we greedily apply edits to the beginning or end of the string we might miss a set of edits interspersed throughout the string. 


**Can we identify an optimal substructure property for this problem?**

<br>

Let's use case-based reasoning about the optimal solution as we did for Knapsack. Let $\mathit{MED}(S, T)$ be the optimal number of edits between $S$ and $T$. 

<br><br><br>

In an optimal sequence of edits, how would we deal with the first two characters of $S$ and $T$, respectively?

<br><br>

For the base cases, if $S$ is empty and $T$ is not, what is the edit cost?  
S=` ` T=`abcde`

<br><br><br>


If either string is empty, then the edit cost is simply the length of the other string.

<br><br>

What if $S[0] = T[0]$?  
S=`abc` T=`ade`

<br><br>

then there is no benefit to editing and $\mathit{MED}(S, T) = \mathit{MED}(S[1:], T[1:])$. 

<br><br>
What if $S[0] \neq T[0]$?  
S=`abc` T=`bde`

<br><br><br>
then we must incur 1 edit. The less costly edit is either:  

$\rightarrow 1+\mathit{MED}(S[1:], T)~~~~$    e.g, $1+\mathit{MED}($ `bc` , `bde` $)$  
or   
$\rightarrow 1+\mathit{MED}(S, T[1:])~~~~$  e.g, $1+\mathit{MED}($ `abc` , `de` $)$  


<br><br>

**Optimal Substructure for Edit Distance**: Let $S$ and $T$ be strings of length $m$ and $n$. Then,

$$\mathit{MED}(S, T) = 
\begin{cases}
\mathit{MED}(S[1:], T[1:]), \mbox{if}~~~S[0]=T[0] \\
1+\min\{\mathit{MED}(S[1:], T),\mathit{MED}(S, T[1:])\}, \mbox{otherwise} \\
\end{cases}
$$

Just as with Knapsack, this recursion tree for this recurrence yields an exponential number of nodes. How many nodes are there, and what is the depth? 


The recursion tree has $O(2^{m+n})$ nodes and depth $O(m+n)$. Are there shared subproblems?

For $S$=`ABC` and $T$=`DBC` we have the following DAG:

<img src="edit_distance_DAG.jpg" width="60%">

How much sharing is possible? In other words, how many distinct subproblems are there?

In any recursive call, the subproblems we consider consist of strings with one less character. So there are $O(mn)$ subproblems, each of which can each be computed in $O(1)$ time (if we have precomputed the necessary dependencies). The longest path in the recursion DAG is $O(m+n)$.


In [None]:
def MED(S, T):
    #print("S:%s, T:%s" % (S, T))
    if (S == ""):
        return(len(T))
    elif (T == ""):
        return(len(S))
    else:
        if (S[0] == T[0]):
            return(MED(S[1:], T[1:]))
        else:
            return(1 + min(MED(S, T[1:]), MED(S[1:], T)))

S= "abcdefghijkl"
T= "abcdghikjl"
S = 'kitten'
T = 'sitting'
print(MED(S, T))

## Longest Increasing Subsequence

We previously looked at approaches to identify trends in a sequence: longest run, maximum contiguous subsequence, longest gap. 

Let's look another trend we might want to identify from a sequence. Given a sequence $S = \langle s_0, s_1, \ldots, s_{n-1} \rangle$, what is the longest increasing subsequence? Note that subsequences don't need to be contiguous.

Example: $S=\langle 5, 2, 8, 6, 3, 6, 9, 7\rangle$. Every subsequence of length 1 is trivially increasing. Also $\langle 2, 6, 9 \rangle$, $\langle 2, 8, 9 \rangle$ are increasing, as is $\langle 5, 6, 7\rangle$. What is the longest?



Let's reduce this problem to something slightly simpler with the observation that the longest increasing subsequence must start somewhere in $S$.

Let $\mathit{LIS}(S, i)$ be the longest increasing subsequence for $S$ that starts with $S[i]$ as the first element. 

How can we use the function $\mathit{LIS}(S, i)$ to solve the original problem?



If we can compute $LIS(S, i)$ then we can compute $ \mathit{LIS}(S) = \max_{0\leq i < n} \mathit{LIS}(S, i).$


- If $S[i]$ is the first element, then the next element $j$ must have $j>i$ and $S[j] > S[i]$. 
- Whichever element is next, we must have $\mathit{LIS}(S, i) = 1 + \max_{j: S[j] > S[i]} \mathit{LIS}(S[j:]).$




**Optimal Substructure for Longest Increasing Subsequence**: Given a sequence $S$, we have that the longest increasing subesquence of $S$ is $ \mathit{LIS}(S) = \max_{0\leq i < n} \mathit{LIS}(S, i)$ where
$$\mathit{LIS}(S, i) = 1 + \max_{j: S[j] > S[i]} \mathit{LIS}(S, j).$$

To compute this optimal substructure property, how many distinct subproblems must be computed from scratch? 


This optimal substructure property is little different than what we've seen so far. There are only a linear number of starting points for an optimal solution. But for each subproblem the work to compute an optimal solution, even if we have computed all subproblems already, is actually linear in the size of the sequence we consider (instead of $O(1)$). 

However there are only a linear number of starting points for an optimal solution. 



In [None]:
# longest increasing subsequence starting at position 0
def LIS_helper(S):
    if (S == []):
        return(0)
    else:
        # find elements in the sequence that are larger than S[0]
        rest = [j for j in range(1,len(S)) if S[j]>S[0]]
        if (rest == []):
            return(1)
        else:
            results = [LIS_helper(S[i:]) for i in rest]
            if (results == []):
                return(1)
            else:
                return(1 + max(results))
    
def LIS(S):
    return(max([LIS_helper(L[i:]) for i in range(len(L))]))

L = [5,2,8,6,3,6,9,7]
print(LIS(L))


So, for a list $S$ of length $n$, we incur $O(n^2)$ work if we reuse the results from already visited subproblems. Since we decrease the length of the list by at least one element in each recursive call, the longest path in the DAG is $n$. At each node, we require $O(\lg n)$ span to compute the max (e.g., using `reduce`), and $O(1)$ span to compute `rest` (using filter), so the span is $O(n \lg n).$
