# Day 9 notebook

The objectives of this notebook are to practice running (by hand) the dynamic programming algorithms for 

* global alignment with linear gap penalty
* local alignment with linear gap penalty
* global alignment with affine gap penalty

## Sequences to align

In this activity, you will align the same pair of sequences multiple times, but with different alignment algorithms.  The two sequences to align are: `CAATATG` and `CATA`.

You may find the included [worksheet](day09_activity_worksheet.pdf) useful for running the dynamic programming algorithms.

### PROBLEM 1: Global alignment with linear gap penalty (3 POINTS)

Align the sequences by hand using the Needleman–Wunsch algorithm (global alignment with linear gap penalty).  Use the following scoring scheme:
* Match: +1
* Mismatch: -1
* Space: -2

To submit your solution, do the following variable assignments in the solution cell below:

* assign to the variable `global_linear_opt_score` the optimal alignment *score* 
* assign to the variable `global_linear_opt_alignments` a *list* of *all* alignments that achieve that optimal score
* assign to the variable `global_linear_last_row` a *list* representing the entries in the last row of the dynamic programming matrix.

Each alignment should be represented by a list of two strings.  The first sequence, `CAATATG`, should be represented by the first string.  For example, here is an example of a list of alignments (non-optimal alignments):

In [105]:
# example of a list of alignments
[["CAATATG",
  "CATA---"],
 ["CAATATG",
  "--C-ATA"],
 ["CA-ATATG",
  "CATA----"]]

[['CAATATG', 'CATA---'], ['CAATATG', '--C-ATA'], ['CA-ATATG', 'CATA----']]

In [208]:
NEG_INF = float("-inf")
match_score = 1
mismatch_score = -1
g = -3
s = -2
x = "CAATATG"
y = "CATA"

def memoize(func):
    outputs = dict()
    def memoized(arg1, arg2):
        if (arg1, arg2) in outputs:
            return outputs[(arg1, arg2)]
        outputs[(arg1, arg2)] = func(arg1, arg2)
        return outputs[(arg1, arg2)]
    return memoized

def S(i, j):
    if i > len(x) or j > len(y):
        return NEG_INF
    return match_score if x[i - 1] == y[j - 1] else mismatch_score

def print_matrix(F):
    for i in range(len(x) + 1):
        for j in range(len(y) + 1):
            print(F(i, j), end="\t")
        print()
        
def compute_score(stra, strb):
    in_gap = False
    length = max(len(stra), len(strb))
    score = 0
    for i in range(length):
        if stra[i] == strb[i]:
            in_gap = False
            score += match_score
        elif stra[i] == '-' or strb[i] == '-':
            score += s
            if in_gap == False:
                score += g
            in_gap = True
        else:
            score += mismatch_score
    return score

In [209]:
@memoize
def F(i, j):
    if i == 0 or j == 0:
        return s * (i + j)
    return max(
        F(i - 1, j - 1) + S(i, j),
        F(i - 1, j) + s,
        F(i, j - 1) + s,
    ) 

print_matrix(F)

0	-2	-4	-6	-8	
-2	1	-1	-3	-5	
-4	-1	2	0	-2	
-6	-3	0	1	1	
-8	-5	-2	1	0	
-10	-7	-4	-1	2	
-12	-9	-6	-3	0	
-14	-11	-8	-5	-2	


In [210]:
global_linear_opt_score = -2

global_linear_opt_alignments = [
    [
        "CAATATG",
        "CA-TA--"
    ],
    [
        "CAATATG",
        "C-ATA--"
    ]
]

global_linear_last_row = [-14, -11, -8, -5]

In [211]:
# tests for global_linear_opt_score
assert isinstance(global_linear_opt_score, int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [212]:
# test for global_linear_opt_alignments
assert isinstance(global_linear_opt_alignments, list)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [213]:
# test for global_linear_last_row_entry_0
assert isinstance(global_linear_last_row[0], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [214]:
# test for global_linear_last_row_entry_1
assert isinstance(global_linear_last_row[1], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [215]:
# test for global_linear_last_row_entry_2
assert isinstance(global_linear_last_row[2], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [216]:
# test for global_linear_last_row_entry_3
assert isinstance(global_linear_last_row[3], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### PROBLEM 2: Local alignment with linear gap penalty (3 POINTS)

Align the sequences by hand using the Smith–Waterman algorithm (local alignment with linear gap penalty).  Use the following scoring scheme:
* Match: +1
* Mismatch: -1
* Space: -2

To submit your solution, do the following variable assignments in the solution cell below:

* assign to the variable `local_linear_opt_score` the optimal alignment *score* 
* assign to the variable `local_linear_opt_alignments` a *list* of *all* alignments that achieve that optimal score
* assign to the variable `local_linear_last_row` a *list* representing the entries in the last row of the dynamic programming matrix.

In [217]:
@memoize
def F(i, j):
    if i == 0 or j == 0:
        return 0
    return max(
        F(i - 1, j - 1) + S(i, j),
        F(i - 1, j) + s,
        F(i, j - 1) + s,
        0
    ) 

print_matrix(F)

0	0	0	0	0	
0	1	0	0	0	
0	0	2	0	1	
0	0	1	1	1	
0	0	0	2	0	
0	0	1	0	3	
0	0	0	2	1	
0	0	0	0	1	


In [218]:
local_linear_opt_score = 3

local_linear_opt_alignments = [[
    "ATA",
    "ATA"
]]

local_linear_last_row = [0, 0, 0, 0]

In [219]:
# tests for local_linear_opt_score
assert isinstance(local_linear_opt_score, int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [220]:
# test for local_linear_opt_alignments
assert isinstance(local_linear_opt_alignments, list)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [221]:
# test for local_linear_last_row_entry_0
assert isinstance(local_linear_last_row[0], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [222]:
# test for local_linear_last_row_entry_1
assert isinstance(local_linear_last_row[1], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [223]:
# test for local_linear_last_row_entry_2
assert isinstance(local_linear_last_row[0], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [224]:
# test for local_linear_last_row_entry_3
assert isinstance(local_linear_last_row[0], int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### PROBLEM 3: Global alignment with affine gap penalty (3 POINTS)

Align the sequences by hand using the global alignment with affine gap penalty algorithm.  Use the following scoring scheme:
* Match: +1
* Mismatch: -1
* Gap: -3
* Space: -2

To submit your solution, do the following variable assignments in the solution cell below:

* assign to the variable `global_affine_opt_score` the optimal alignment *score* 
* assign to the variable `global_affine_opt_alignments` a *list* of *all* alignments that achieve that optimal score
* assign to the variable `global_affine_last_row` a *list* representing the entries in the last row of the dynamic programming matrix.

For the last row, we will imagine that the three matrices, $M$, $I_x$, and $I_y$ have been collapsed into a single matrix, where the entry in each cell of the collapsed matrix has the entries from the three matrices represented as a tuple.  That is, if $C$ is the collapsed matrix, then $C[i, j] = (M[i,j], I_x[i,j], I_y[i,j])$.

In [225]:
# Constant variable with the value of negative infinity to use in specifying entries of the last row
NEG_INF = float("-inf")

In [226]:
@memoize
def M(i, j):
    if i == 0 and j == 0:
        return 0
    if i == 0 or j == 0:
        return NEG_INF
    return max(
        M(i - 1, j - 1) + S(i, j),
        Ix(i - 1, j - 1) + S(i, j),
        Iy(i - 1, j - 1) + S(i, j),
    ) 

@memoize
def Ix(i, j):
    if j == 0:
        return g + s * i
    if i == 0:
        return NEG_INF
    return max(
        M(i - 1, j) + g + s,
        Ix(i - 1, j) + s,
    ) 

@memoize
def Iy(i, j):
    if i == 0:
        return g + s * j
    if j == 0:
        return NEG_INF
    return max(
        M(i, j - 1) + g + s,
        Iy(i, j - 1) + s,
    ) 

print_matrix(M)
print()
print_matrix(Ix)
print()
print_matrix(Iy)

0	-inf	-inf	-inf	-inf	
-inf	1	-6	-8	-10	
-inf	-6	2	-5	-5	
-inf	-8	-3	1	-2	
-inf	-10	-7	-2	0	
-inf	-12	-7	-6	-1	
-inf	-14	-11	-6	-7	
-inf	-16	-13	-10	-7	

-3	-inf	-inf	-inf	-inf	
-5	-inf	-inf	-inf	-inf	
-7	-4	-11	-13	-15	
-9	-6	-3	-10	-10	
-11	-8	-5	-4	-7	
-13	-10	-7	-6	-5	
-15	-12	-9	-8	-6	
-17	-14	-11	-10	-8	

-3	-5	-7	-9	-11	
-inf	-inf	-4	-6	-8	
-inf	-inf	-11	-3	-5	
-inf	-inf	-13	-8	-4	
-inf	-inf	-15	-12	-7	
-inf	-inf	-17	-12	-11	
-inf	-inf	-19	-16	-11	
-inf	-inf	-21	-18	-15	


In [227]:
global_affine_opt_score = -7

global_affine_opt_alignments = [
    [
        "CAATATG",
        "C---ATA"
    ],
    [
        "CAATATG", 
        "CA---TA"
    ]
]

global_affine_last_row = [
    (NEG_INF, -17, NEG_INF),
    (-16, -14, NEG_INF),
    (-13, -11, -21),
    (-10, -10, -18),
]

In [126]:
# tests for global_affine_opt_score
assert isinstance(global_affine_opt_score, int)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [77]:
# test for global_affine_opt_alignments
assert isinstance(global_affine_opt_alignments, list)
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [78]:
# test for global_affine_last_row_entry_0
assert isinstance(global_affine_last_row[0], tuple)
assert len(global_affine_last_row[0]) == 3
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [79]:
# test for global_affine_last_row_entry_1
assert isinstance(global_affine_last_row[1], tuple)
assert len(global_affine_last_row[1]) == 3
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [80]:
# test for global_affine_last_row_entry_2
assert isinstance(global_affine_last_row[2], tuple)
assert len(global_affine_last_row[2]) == 3
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [81]:
# test for global_affine_last_row_entry_3
assert isinstance(global_affine_last_row[3], tuple)
assert len(global_affine_last_row[3]) == 3
###
### AUTOGRADER TEST - DO NOT REMOVE
###
