# Pairwise Sequence Alignment

This workbook will illustrate a simple implementation of Needleman-Wunsch global pairwise alignment. The approach presented won't be the most efficient way possible in Python (for loops!), but it implements the pseudocode we discussed in class. 

See [Python for Bioinformatics](https://catalog.libraries.psu.edu/catalog/19170896) for more details. 

First, let's define the scoring schema. We need a similarity matrix for match and mismatch scores, and we need a gap penalty. We will also define indices for the valid base characters (so we can use them in the scoring matrix later). We will assume a DNA alphabet throughout. 

In [1]:
from numpy import *

In [15]:
## Indices for each base
index = {
    'A': 0, 
    'C': 1, 
    'G': 2, 
    'T': 3}

## Set the gap penalty
gap = -2

## Match/mismatch similarity matrix (defined as numpy array).
matchscore = 3
mismatchscore=-1
sim = full((4,4), mismatchscore)
fill_diagonal(sim, matchscore)

sim

array([[ 3, -1, -1, -1],
       [-1,  3, -1, -1],
       [-1, -1,  3, -1],
       [-1, -1, -1,  3]])

Now let's make a function that implements Needleman-Wunsch, returning the score matrix and the arrow matrix

In [22]:
#arguments: similarity matrix, gap penalty, sequence1, sequence2
def NeedlemanWunsch(sim_mat, gap_pen, seq1, seq2):
    l1, l2 = len(seq1), len(seq2)
    scores = zeros( (l1+1, l2+1), int )
    arrows = zeros( (l1+1, l2+1), int )

    #Define the scores for the first row and first column
    scores[0,:] = arange(l2+1) * gap_pen
    scores[:,0] = arange(l1+1) * gap_pen
    arrows[0,:] = ones(l2+1)
    arrows[:,0] = zeros(l1+1)

    #Dynamic programming part - iterate over rows & columns
    for i in range(1, l1+1):
        for j in range(1, l2+1):
            f=zeros(3) #f is going to store our three scoring options in the current cell
            
            f[0] = scores[i-1, j]+gap_pen  #Vertical gap option
            f[1] = scores[i, j-1]+gap_pen  #Horizontal gap option
            n1 = index[seq1[i-1]]
            n2 = index[seq2[j-1]]
            f[2] = scores[i-1, j-1] + sim_mat[n1, n2] #Diagonal match/mismatch option

            scores[i,j]= f.max()
            arrows[i,j] = f.argmax();

    return scores, arrows

Let's test the above function on a pair of sequences.

In [23]:
s, a = NeedlemanWunsch(sim, gap, 'ACT', 'ACGT')

print(s)
print(a)

[[ 0 -2 -4 -6 -8]
 [-2  3  1 -1 -3]
 [-4  1  6  4  2]
 [-6 -1  4  5  7]]
[[0 1 1 1 1]
 [0 2 1 1 1]
 [0 0 2 1 1]
 [0 0 0 2 2]]


As you can see, the score matrix contains all scores, culminating in the overall alignment score in the bottom right corner. The arrow matrix contains the choices taken to generate the score in each cell. 0 for vertical gap, 1 for horizontal gap, 2 for diagonal. 

Next, we need to generate the actual alignment by performing a traceback operation from the bottom right corner to the top left. 

In [26]:
#arguments: arrow matrix, sequence1, sequence2
def Backtrace(arrows, seq1, seq2):
    #st1 and st2 will store the aligned sequence strings
    st1, st2 = '',''

    # v and h will be indices that start at the bottom right position
    v = arrows.shape[0]-1
    h = arrows.shape[1]-1

    while v>0 or h>0 :
        if arrows[v,h] == 0:   #Vertical gap (insert gap in seq2)
            st1 += seq1[v-1]
            st2 += '-'
            v-=1
        elif arrows[v,h] == 1: #Horizontal gap (insert gap in seq1)
            st1 += '-'
            st2 += seq2[h-1]
            h-=1
        elif arrows[v,h] == 2: #Diagonal transition (aligned bases)
            st1 += seq1[v-1]
            st2 += seq2[h-1]
            v-=1
            h-=1
    
    #st1 and st2 now have the aligned sequences in reverse order
    st1 = st1[::-1]
    st2 = st2[::-1]

    return st1, st2

Let's try it on the arrow matrix that was computed above:

In [27]:
s1, s2 = Backtrace(a, 'ACT', 'ACGT')

print(s1)
print(s2)

AC-T
ACGT


Finally, put it all together for longer sequences

In [33]:
seq1 = "TGAACTCCTACTGTAAG"
seq2 = "TTGTTCTTACTGTCTAAG"

s, a = NeedlemanWunsch(sim, gap, seq1, seq2)
ascore = s[len(seq1), len(seq2)]

s1, s2 = Backtrace(a, seq1, seq2)

print("Alignment score = ", ascore)
print(s1)
print(s2)

Alignment score =  27
T-GAACTCCTACTGT--AAG
TTGTTCT--TACTGTCTAAG
