# Global (end-to-end) alignment (Needleman-Wunsch)

Global alignment is very similar to (alphabet-weight) edit distance. Key difference is that alignment is defined in terms of similarity, where we try to **maximize** the function score. On the other hand, in order to retreive edit distance we try to **minimize** the score (distance).

Global alignment can be transformed to alphabet-weighted edit distance problem if proper scoring matrix is used. 

### Scoring matrix

Global-alignment is definend in the terms of pairwise scoring matrix. Therefore we first need to define **scoring function (matrix)**.

In practice, we set values in scoring matrix such that they are positive (or equal to zero) for match and negative for mismatch. Then we try to maximize alignment value (as stated in the beginning of this lecture). 

We will define scoring function in the following manner:
* Highest score should be for two matching characters - e.g. 1
+ if that is not the case transitions should have higher score than transversions - e.g. -1
+ transversion score can then be  -2
+ and gaps (insertions/deletion) should be heavily penalized - let's use -7 for that.


In [1]:
def scoringMatrix(a, b):
    if a == b: return 1
    if a == '_' or b == '_' : return -7
    maxb, minb = max(a, b), min(a, b)
    if minb == 'A' and maxb == 'G': return -1
    if minb == 'C' and maxb == 'T': return -1
    return -2

In [2]:
scoringMatrix('C','C')

1

In [3]:
x='TACGTCAGC'
y='TATGTCATGC'

In [4]:
scoringMatrix(x[6],y[6])

1

### Tabular computation

Next we need to define global alignment function. What are the differences between global alignment function and edit distance example we have had previously?

Both base conditions and recurrence relation are defined in relation to scoring matrix (in edit distance all changes had constant penalty of 1).

Base conditions:
>V(0,j)=∑s(‘_’, y(j)), 1<=j<=k

>V(i,0)=∑s(x(i),’_’), 1<=i<=k


Reccurence relation is defined in relation to scoring matrix. Now we also try to **maximize** relation:
>V(i,j) = max[V(i-1,j)+s(S1(i), "\_"), V(i,j-1)+s("\_", S2(j)), V(i-1,j-1)+s(S1(i), S2(j))]


With going for maximum (instead for minimum) score and respect to the scoring function, here is the implementation:

In [5]:
import numpy


def globalAlignment(x, y, s):
    D = numpy.zeros((len(x) + 1, len(y) + 1), dtype=int)
    
    for i in range(1, len(x) + 1):
        D[i,0] = D[i-1,0] + s(x[i-1], '_')  
    for j in range(1, len(y)+1):
        D[0,j] = D[0,j-1] + s('_', y[j-1])
    
    for i in range(1, len(x) + 1):
        for j in range(1, len(y) + 1):
            D[i,j] = max(D[i-1,j]   + s(x[i-1], '_'),
                         D[i,j-1]   + s('_', y[j-1]), 
                         D[i-1,j-1] + s(x[i-1], y[j-1]))
            
    # function returns table and global alignment score
    #alignment score is in cell (n,m) of the matrix
    return D, D[len(x),len(y)] 

In [6]:
D, alignmentScore = globalAlignment('TACGTCAGC', 'TATGTCATGC', scoringMatrix)

In [7]:
print(alignmentScore)

0


In [8]:
print(D)

[[  0  -7 -14 -21 -28 -35 -42 -49 -56 -63 -70]
 [ -7   1  -6 -13 -20 -27 -34 -41 -48 -55 -62]
 [-14  -6   2  -5 -12 -19 -26 -33 -40 -47 -54]
 [-21 -13  -5   1  -6 -13 -18 -25 -32 -39 -46]
 [-28 -20 -12  -6   2  -5 -12 -19 -26 -31 -38]
 [-35 -27 -19 -11  -5   3  -4 -11 -18 -25 -32]
 [-42 -34 -26 -18 -12  -4   4  -3 -10 -17 -24]
 [-49 -41 -33 -25 -19 -11  -3   5  -2  -9 -16]
 [-56 -48 -40 -32 -24 -18 -10  -2   3  -1  -8]
 [-63 -55 -47 -39 -31 -25 -17  -9  -3   1   0]]


### Traceback

Traceback implementation is done same as in edit distance example. For the curent cells we compute scores of the three neigbouring cells from whom we might have gotten to our current cell. At each cell we ask - which predecessor gave the best score?

Scores are computed by using values of these cells and type of movement (horizontal, vertical, diagonal). Then we move to the cell which gave the best score. In the case of global aligment - we're looking for maximum value if such scoring matrix is beeing used. 

In our implementation, while doing that, we store edit operation on input strings and allignment between them.

In [9]:
def traceback(x,y,V,s):
    
    # initializing starting position cell(n,m)
    i=len(x)
    j=len(y)
    
    # initializing strings we use to represent alignments in x, y, edit transcript and global alignment
    ax, ay, am, tr = '', '', '', ''
    
    # exit condition is when we reach cell (0,0)
    while i > 0 or j > 0:
        
        # calculating diagonal, horizontal and vertical scores for current cell
        d, v, h = -100, -100, -100
        
        if i > 0 and j > 0:
            delta = 1 if x[i-1] == y[j-1] else 0
            d = V[i-1,j-1] + s(x[i-1], y[j-1])  # diagonal movement   
        if i > 0: v = V[i-1,j] + s(x[i-1], '_')  # vertical movement
        if j > 0: h = V[i,j-1] + s('_', y[j-1])  # horizontal movement
            
        # backtracing to next (previous) cell
        if d >= v and d >= h:
            ax += x[i-1]
            ay += y[j-1]
            if delta == 1:
                tr += 'M'
                am += '|'
            else:
                tr += 'R'
                am += ' '
            i -= 1
            j -= 1
        elif v >= h:
            ax += x[i-1]
            ay += '_'
            tr += 'D'
            am += ' '
            i -= 1
        else:
            ay += y[j-1]
            ax += '_'
            tr += 'I'
            am += ' '
            j -= 1
            
    alignment='\n'.join([ax[::-1], am[::-1], ay[::-1]])
    return alignment, tr[::-1]

In [10]:
x = 'TACGTCAGC'
y = 'TATGTCATGC'
D, alignmentScore = globalAlignment(x, y, scoringMatrix)

In [11]:
alignment, transcript = traceback(x, y, D, scoringMatrix)

In [12]:
print(alignment)
print(transcript)

TACGTCA_GC
|| |||| ||
TATGTCATGC
MMRMMMMIMM


In [13]:
print(D)

[[  0  -7 -14 -21 -28 -35 -42 -49 -56 -63 -70]
 [ -7   1  -6 -13 -20 -27 -34 -41 -48 -55 -62]
 [-14  -6   2  -5 -12 -19 -26 -33 -40 -47 -54]
 [-21 -13  -5   1  -6 -13 -18 -25 -32 -39 -46]
 [-28 -20 -12  -6   2  -5 -12 -19 -26 -31 -38]
 [-35 -27 -19 -11  -5   3  -4 -11 -18 -25 -32]
 [-42 -34 -26 -18 -12  -4   4  -3 -10 -17 -24]
 [-49 -41 -33 -25 -19 -11  -3   5  -2  -9 -16]
 [-56 -48 -40 -32 -24 -18 -10  -2   3  -1  -8]
 [-63 -55 -47 -39 -31 -25 -17  -9  -3   1   0]]


### Test

In [14]:
x = 'ACTGACTGACTG'
y = 'ACTGAGTGTTTG'
alignment, transcript = traceback(x, y, globalAlignment(x, y, scoringMatrix)[0], scoringMatrix)
print(alignment)
print(transcript)

ACTGACTGACTG
||||| ||  ||
ACTGAGTGTTTG
MMMMMRMMRRMM
