### String comparison

- This problem was inspired by gene matching in biology

- **Problem:** Imagine you have 2 strings, and you want to measure the similarity between them. Starting from the beginning of both strings, you can take the following actions
    - You can pop both starting letters from the strings
    - You can insert a space into 1 of the strings
    - You can delete a letter from 1 of the strings
    
    - If you pop both letters and they match, you get +1 score. For every mismatch, you get $-\mu$ score, and every insertion/deletion gets $-\sigma$ score
        - $\text{Score} = \text{matches} - \mu * \text{mismatch} - \sigma * \text{insert or delete}$

    - What is the maximum score for 2 given strings?

- **Input:** Two strings $s_1, s_2$, mismatch penalty $\mu$, indel (insert/delete) penalty $\sigma$

- **Output:** An alignment of the strings maximising the score

### Approach

- We can recast this problem into another simpler problem; notice that maximising the length of a common subsequence between 2 strings will maximise the alignment score if $\mu = \sigma = 0$
    - Seeing this is trivial; if there is no penalty for misalignent and for insert/del operations, then we want to perform as many actions as possible to maximise our common letters

- So the problem becomes: given 2 strings, what is the minimum number of inserts/deletes/substitutions that are needed to transform one string to another
    - Example: Given `EDITING` and `DISTANCE` as strings, the optimal alignment is
    - `E - D - I - _ - T - I - N - G - _`
    - `_ - D - I - S - T - A - N - C - E`
    - There are
        - 4 matches: `D`, `I`, `T`, `N`
        - 2 mismatches: `I <-> A`, `G <-> C`
        - 2 insertion + 1 deletion (or vice versa depending on which way you go for)

- Let's represent this as a table

|       | _ | D | I | S | T | A | N | C | E | 
|   -   | - | - | - | - | - | - | - | - | - |
| **_** | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| **E** | 1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 7 |
| **D** | 2 | **(ii) 1** | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| **I** | 3 | 2 | 1 | **(i) 2** | 3 | 4 | 5 | 6 | 7 |
| **T** | 4 | 3 | **(iii) 2** | **(iv) 2** | 2 | 3 | 4 | 5 | 6 |
| **I** | 5 | 4 | 3 | 3 | 3 | 3 | 4 | 5 | 6 |
| **N** | 6 | 5 | 4 | 4 | 4 | 4 | 3 | 4 | 5 |
| **G** | 7 | 6 | 5 | 5 | 5 | 5 | 4 | 4 | 5 |

- The first row and first column is trivial
    - Going from an empty string to either of the words is simply adding the required number of characters. So going from `_` to `E` in `EDITING` will require 1 operation

- What about the other rows and columns?
    - Remember, a dynamic programming problem is simply a **recursion problem** with cached intermediate solutions
    - So if we want a DP solution, we need to set up a recursive relationship between each element and its earlier elements!
    - What is the recursive solution here? 
        1. Note that to get from 1 string to another string, we must first get to its substring - 1
            - For example, to get from `ABC --> ABCDE`, we must first get from `ABC --> ABCD --> ABCDE`
            - So the optimal path to get from string A to string B is simply the optimal path to get from string A to string B' plus 1 step, where B' has 1 less character than B
- It is easiest to work with examples:
    - Looking at element **(i)** in the table
        - We are asking for the minimum number of steps `EDI -> DIS`
        - To make this transition, `EDI` has 3 possible routes representing the 3 possible actions; insertion, deletion, substitution
            (i) **Insertion:** `EDI -> DI ->  DIS`
                - This is classified as the "insertion" option
                - Because to get from `EDI` to `DIS`, we count the number of steps to get from `EDI` to `DI`, then add 1 
                - The `EDI -> DI` route takes 1 step, and getting from `DI -> DIS` takes 1 insertion. 
                - Hence, this is considered an insertino
            (ii) **Match:** `ED -> D`
                - Since the last letters of both strings match, we don't need take any action once we know the subcase, which `E -> _`
                - Since `E -> _` takes 1 step, `ED -> D` also takes the same step
            (iii) **Deletion:** `EDIT -> EDI -> DI`
                - This is classified as the "deletion" option
                - Because going from `EDIT` to `EDI` removes a letter from our first word
            (iv) **Mismatch:** `EDIT -> DIT -> DIS`
                - This is considered a substitution case
                - Because we take the `EDIT` and do the same transformation as the one diagonally above to the left (chop of the `E`)
                - Then we sub the `T` for `S`

In [38]:
import numpy as np 
str1 = 'EDITING'
str2 = 'DISTANCE'

def edit_distance(str1, str2, return_path=False):
    memoize = np.zeros((len(str1)+1, len(str2)+1))
    str1_split = [' '] + list(str1)
    str2_split = [' '] + list(str2)

    for i in range(len(str1_split)):
        for j in range(len(str2_split)):
            if (i == 0):
                memoize[i][j] = j
                continue
            if (j == 0):
                memoize[i][j] = i
                continue
            
            insertion = memoize[i, j-1] + 1
            deletion = memoize[i-1, j] + 1
            mismatch = memoize[i-1, j-1] + 1
            match = memoize[i-1, j-1]
            # print(f"{insertion=}, {deletion=}, {mismatch=}, {match=}")

            if str1_split[i] == str2_split[j]:
                # print(f'{str1_split[i]=} | {str2_split[j]=}')
                memoize[i][j] = min(insertion, deletion, match)
                # print(memoize)
            else:
                # print(f'{str1_split[i]=} | {str2_split[j]=}')
                memoize[i][j] = min(insertion, deletion, mismatch)
                # print(memoize)
    
    if return_path:
        return memoize, memoize[len(str1)][len(str2)]
    else:
        return memoize[len(str1)][len(str2)]

memoize, distance = edit_distance(str1, str2, return_path=True)

In [48]:
def trace_path(memoize, row, col):
    insert = memoize[row][col-1]
    delete = memoize[row-1][col]
    replace = memoize[row-1][col-1]

    if min(insert, delete, replace) == delete:
        print(f"({row-1}, {col}) --> ({row}, {col}) ")
        return row-1, col
    elif min(insert, delete, replace) == insert:
        print(f"({row}, {col-1}) --> ({row}, {col}) ")
        return row, col-1
    elif min(insert, delete, replace) == replace:
        print(f"({row-1}, {col-1}) --> ({row}, {col}) ")
        return row-1, col-1
    
row_count, col_count = memoize.shape
curr_row, curr_col = row_count-1, col_count-1

while (curr_row != 0) & (curr_col != 0):
    curr_row, curr_col = trace_path(memoize, curr_row, curr_col)

(7, 7) --> (7, 8) 
(6, 6) --> (7, 7) 
(5, 5) --> (6, 6) 
(4, 4) --> (5, 5) 
(4, 3) --> (4, 4) 
(3, 2) --> (4, 3) 
(2, 1) --> (3, 2) 
(1, 1) --> (2, 1) 
(0, 0) --> (1, 1) 


### Approach 2

- We have looked at memoization that starts from the beginning of the string and iteratively expands the result for longer strings. This is known as a **bottom up** solution, since it works its way upwards from a null string as the base case 

- What would a **top down** approach look like in a DP solution? Simple! We just do the same call recursively, starting from the top of the stack

- This is much less preferred, because you need to store stack frames for every recursive call you make, and this can lead to memory problems due to the recursion

In [1]:
# import numpy as np 
# str1 = 'EDITING'
# str2 = 'DISTANCE'

# def edit_distance(str1, str2):
