<h1 align="center">The Real Problem of DNA Comparison</h1>

- **Biological applications** often **need** to **compare** the **DNA** of **two different organisms**.


- A strand of **DNA consists** of a string of molecules called **bases**, where the possible bases are **Adenine, Thymine, Cytosine**, and **Guanine**.


  <center><img src="images/L7_DNA.png" width="500" alt="Example" /></center>


- **Representing** each of these **bases** by its **initial letter**, we **can express** a strand of **DNA** as a **string** over the finite set $\{A, T, C, G\}$.


- For example, the DNA of two organisms may be:

  $$S_1 = ACCGGTCGAGTGCGCGGAAGCCGGCCG,$$
  
  $$S_2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA.$$
  
  
- **One reason** to compare two strands of DNA is to **determine how similar** the two strands are.


- One can **define similarity** in **many different ways**. Can you give an example?


- We **measure the similarity** of strands $S_1$ and $S_2$ by **finding a third** strand $S_3$ in which the **bases** in $S_3$ **appear in each** of $S_1$ and $S_2$.  

  These **bases must appear in the same order**, but **not necessarily consecutively**.


- Thus, the longer the strand $S_3$ we can find in our example is $S_3 = GTCGTCGGAAGCCGGCCG$.


- We **formalize** this last notion of similarity as the **longest-common-subsequence problem**.


<h1 align="center">Longest Common Subsequence</h1>


- Given a sequence $X = \left \langle x_1, x_2, ..., x_m \right \rangle$. 

  Another sequence $Z = \left \langle z_1, z_2, ..., z_k \right \rangle$ is a **subsequence** of $X$ if there exists a strictly increasing sequence $\left \langle i_1, i_2, ..., i_k \right \rangle$ of indices of $X$ such that for all $j = 1, 2, ... , k$, we have $x_{i_j} = z_j$.
  
  
- For example, $Z = \left \langle B, C, D, B \right \rangle$ is a **subsequence** of $X = \left \langle A, B, C, B, D, A, B \right \rangle$ with corresponding index sequence $\left \langle 2, 3, 5, 7\right \rangle$.


- Given **two sequences** $X$ and $Y$, we say that a sequence $Z$ is a **common subsequence** of $X$ and $Y$ if $Z$ is a **subsequence** of both $X$ and  $Y$.
 
 
 - For example, if $X = \left \langle A, B, C, B, D, A, B \right \rangle$ and $Y = \left \langle B, D, C, A, B, A \right \rangle$, then the sequence $Z = \left \langle B,C, A\right \rangle$ is a **common subsequence** of $X$ and $Y$.
 
   However, $Z$ is **not a Longest Common Subsequence** (**LCS**) of  $X$ and $Y$.
   
   
- **Longest-Common-Subsequence problem**:

  Given **two sequences** $X = \left \langle x_1, x_2, ..., x_m \right \rangle$ and $Y = \left \langle y_1, y_2, ..., y_n \right \rangle$, find a **maximumlength common subsequence** of $X$ and $Y$ .
  

- In a **brute-force approach** to solving the **LCS problem**, we would **enumerate all subsequences** of $X$ and **check each subsequence** to see whether **it is also a subsequence** of $Y$, keeping track of the longest subsequence we find. 

  Each subsequence of $X$ corresponds to a subset of the indices $\{1, 2, ..., m\}$ of $X$. 

  Because $X$ has $2^m$ **subsequences**, this approach requires **exponential time**, making it impractical for long sequences.

<h1 align="center">Dynamic Programming for LCS</h1>

- Lets show how to **efficiently solve** the **LCS** problem **using dynamic programming**.

  For this we  need to follow a sequence of **four steps**:
  1. **Characterize** the structure of an optimal solution.
  2. **Define** recursively  the value of an optimal solution.
  3. **Compute** the value of an optimal solution (typically in a bottom-up fashion).
  4. **Construct** an optimal solution from computed information.

<h1 align="center">Step 1: Characterizing the Optimal Solution of LCS</h1>

- Lets introduce notations for simplicity. 

  To be precise, given a sequence $X = \left \langle x_1, x_2, ..., x_m \right \rangle$, we define the $i$-th **prefix** of $X$, for $i = 0, 1, ..., m$, as $X_i = \left \langle x_1, x_2, ..., x_i \right \rangle$.
  

- For example, if $X = \left \langle A, B, C, B, D, A, B \right \rangle$, then $X_4 = \left \langle A, B, C, B \right \rangle$ and $X_0 = \{\varnothing \}$.


 - **Optimal substructure Theorem of an LCS**:
 
   Let $X = \left \langle x_1, x_2, ..., x_m \right \rangle$ and $Y = \left \langle y_1, y_2, ..., y_n \right \rangle$ be sequences, and let $Z = \left \langle z_1, z_2, ..., z_k \right \rangle$ be any LCS of $X$ and $Y$.
   
   1. If $x_m = y_n$, then $z_k = x_m = y_n$ and $Z_{k-1}$ is an LCS of $X_{m-1}$ and $Y_{n-1}$.
   2. If $x_m \neq y_n$, then $z_k \neq x_m$ implies that $Z$ is an LCS of $X_{m-1}$ and $Y$.
   3. If $x_m \neq y_n$, then $z_k \neq y_n$ implies that $Z$ is an LCS of $X_{m}$ and $Y_{n-1}$.
   

- **Let's prove it!**

- The way that **theorem** characterizes LCS **tells us** that an **LCS** of **two sequences** **contains** within it an **LCS** of **prefixes** of the **two sequences**.


- Thus, the **LCS problem has an optimal-substructure property**.

<h1 align="center">Step 2: Define the Recursive Solution of LCS</h1>

- We can readily see the **overlapping-subproblems property** in the LCS problem.


- If $x_m  = y_n$, we must find an LCS of $X_{m-1}$ and $Y_{n-1}$. Appending $x_m = y_n$ to this LCS yields an LCS of $X$ and $Y$.


- In case $x_m \neq y_n$, to find an LCS of $X$ and $Y$, we may **need to find** the LCSs of $X$ and $Y_{n-1}$ and of $X_{m-1}$ and $Y$. 

  **Whichever** of these **two LCSs** is **longer** is an **LCS** of $X$ and $Y$.

  But each of these subproblems has the subsubproblem of finding an LCS of $X_{m-1}$ and $Y_{n-1}$. Many other subproblems share subsubproblems.
  
  
- Let us define $c[i, j]$ to be the **length of an LCS** of the sequences $X_i$ and $Y_j$.

  The optimal substructure of the LCS problem gives the recursive formula:
  
  $$c[i, j] = 
  \left\{\begin{matrix}
  0, & \text{ if } i=0 \text{ or } j=0, \\
  c[i-1,j-1] + 1, & \text{ if } i,j>0 \text{ and } x_i = y_j, \\
  \max \left ( c[i, j-1], c[i-1, j] \right ), & \text{ if } i,j >0 \text{ and } x_i \neq y_j. 
  \end{matrix}\right.  
  $$
  

- The **main difference between** the **LCS dynamic programming algorithms** and the dynamic programming algorithms we have previously examined for **cutting rods** and **multiplying a matrix by a chain**, is that we **ruled out subproblems due to the conditions in the problem**.

<h1 align="center">Step 3: Computing the Length of an LCS</h1>

- Using the recurence equation, we **can use dynamic programming** to compute the solutions **bottom up**:

  Procedure `lengthLCS(X,Y)` takes two sequences $X = \left \langle x_1, x_2, ..., x_m \right \rangle$ and $Y = \left \langle y_1, y_2, ..., y_n \right \rangle$ as inputs. 

  It stores the $c[i,j]$ values in a table $c[0..m, 0..n]$, and it computes the entries in **row-major** order.
  
  The procedure also maintains the table $b[1..m, 1..n]$ to help us **construct an optimal solution**.

In [78]:
import numpy as np


def lengthLCS(X,Y):
    m = len(X)
    n = len(Y)
    b = np.zeros((m+1,n+1), dtype = str)
    c = np.zeros((m+1,n+1))
    for i in range (0, m):
        for j in range(0,n):
            if X[i] == Y[j]:
                c[i+1,j+1] = c[i,j] + 1
                b[i+1,j+1] = 'D' # left up diagonal arrow
            elif c[i,j+1] >= c[i+1,j]:
                c[i+1,j+1] = c[i,j+1]
                b[i+1,j+1] = 'U' # up arrow
            else:
                c[i+1,j+1] = c[i+1, j]
                b[i+1,j+1] = 'L' # left arrow
    return b, c
                
                    

In [82]:
X = ['A', 'B', 'C', 'B', 'D', 'A', 'B']
Y = ['B', 'D', 'C', 'A', 'B', 'A']

b,c  = lengthLCS(X,Y)
print("b = \n", b, "\n \n c = \n", c)

b = 
 [['' '' '' '' '' '' '']
 ['' 'U' 'U' 'U' 'D' 'L' 'D']
 ['' 'D' 'L' 'L' 'U' 'D' 'L']
 ['' 'U' 'U' 'D' 'L' 'U' 'U']
 ['' 'D' 'U' 'U' 'U' 'D' 'L']
 ['' 'U' 'D' 'U' 'U' 'U' 'U']
 ['' 'U' 'U' 'U' 'D' 'U' 'D']
 ['' 'D' 'U' 'U' 'U' 'D' 'U']] 
 
 c = 
 [[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1. 1. 1.]
 [0. 1. 1. 1. 1. 2. 2.]
 [0. 1. 1. 2. 2. 2. 2.]
 [0. 1. 1. 2. 2. 3. 3.]
 [0. 1. 2. 2. 2. 3. 3.]
 [0. 1. 2. 2. 3. 3. 4.]
 [0. 1. 2. 2. 3. 4. 4.]]


- Figure below shows the tables produced by `lengthLCS(X,Y)` procedure on the sequences $X = \left \langle A, B, C, B, D, A, B\right \rangle$ and $Y = \left \langle B, D, C, A, B, A \right \rangle$. 

  The **running time** of the procedure is $\Theta(mn)$, since each **table entry takes** $\Theta(1)$ **time to compute**.
  
  
<center><img src="images/L7_Table.png" width="500" alt="Example" /></center>


<h1 align="center">Step 4: Constructing an LCS</h1>

- The $b$ **table** returned by `lengthLCS(X,Y)` procedure  enables us to quickly **construct** an LCS of $X = \left \langle x_1, x_2, ..., x_m \right \rangle$ and $Y = \left \langle y_1, y_2, ..., y_n \right \rangle$.


- We simply begin at $b[m,n]$ and **trace through the table** by **following** the **arrows**.

  Whenever we encounter a "$\nwarrow$" (or $1$)  in entry $b[i; j]$, it implies that $x_i = y_j$ is an element of the LCS that `lengthLCS(X,Y)` procedure found.
  

  With this method, we encounter the elements of this LCS in reverse order.
  

- The following recursive procedure **prints out** an LCS of $X$ and $Y$ in the proper, forward order.


- The procedure takes time $O(m+n)$, since it decrements at least one of $i$ and $j$ in each recursive call.

  
- The **initial call** is `printLCS(b, X, len(X), len(Y) )`.

In [83]:
def printLCS(b, X, i, j):
    if i == 0 or j == 0:
        return
    if b[i,j] == 'D':
        printLCS(b, X, i-1, j-1)
        print(X[i-1])
    elif b[i,j] == "U":
        printLCS(b, X, i-1, j)
    else:
        printLCS(b, X, i, j-1)
        
printLCS(b, X, len(X), len(Y))

B
C
B
A


<h1 align="center">Improving the Code</h1>

- Once you have developed an algorithm, you will often find that **you can improve on** the **time** or **space** it uses.

  **Some** changes **can** simplify the code and **improve constant factors** but otherwise **yield no asymptotic improvement** in performance. 
  
  
- In the **LCS algorithm**, for example, we **can eliminate** the $b$ **table** altogether.

  Each $c[i,j]$ entry **depends on only three other $c$ table entries**: $c[i-1, j-1]$, $c[i-1, j]$, and $c[i,j-1]$. 
  
  Given the value of $c[i,j]$, we can determine in $O(1)$ time which of these three values was used to compute $c[i,j]$, without inspecting table b. 
  
  Thus, we can reconstruct an LCS in $O(m+n)$ time using a procedure similar to `printLCS`.
  
  Although we save $\Theta(mn)$ space by this method, the **auxiliary space requirement** for computing an LCS **does not asymptotically decrease**, since **we need** $\Theta(mn)$ **space** for the $c$ **table** anyway.


- We **can**, however, **reduce the asymptotic space requirements for** `lengthLCS`, since **it needs only two rows of table** $c$ **at a time**.

  This **improvement works** if we **need only** the **length** of an **LCS**.
  
  If we need to reconstruct the elements of an LCS, the smaller table does not keep enough information to retrace our steps in $O(m+n)$ time.

<h1 align="center">End of Lecture</h1>