# Local (Smith-Watterman) alignment

We use local alignment to detect region of high similarity between two strings. We are not interesed in end-to-end string comparison, what we want to do is distinguish areas of high similarity from background "noise".

What is local alignment?

In simple words: Given strings x and y, what is the optimal global alignment value of a **substring of x** to a **substring of y**.  This is **local alignment**.


Main difference is that local alignment **must** be implemented as maximization of objective function.

**Scoring schema**

Scoring schema is defined in such way:
* all identities have score larger than 0 - e.g. 2
* all mismatches and gaps have score smaller than 0 - we'll use -6 for indels and -4 for mismatch

In [1]:
def scoringMatrix(a,b):
    if a==b: return 2 # match
    if a=='_' or b=='_': return -6 # indel
    return -4 # mismatch

### Tabular computation

**Objective function and base conditions**

In local aligment, as mentioned before we're trying to maximise objective function. Most important difference is that except all previous posibilities in the objective function (indels, match, mismatch) we've added **zero** as an option. This way, we won't have to go all the way back to cell (0,0) and are able to do global alignment of substrings fro x and y.

Here is the recurrence relation (objective function): 
> V(i,j)=max[ 0, v(i-1, j)+s(S1(i), ' \_ ' ),  v(i, j-1)+s(‘ \_ ’, S2(j)),  v(i-1, j-1)+s(S1(i), S2(j)) ]


**Base conditions** 
Base conditions are for the local alignment set to 0.

We write this as:
>v(0,i)=0

> v(0,j)=0, ∀ i, j


In order to find the maximum value of local alignment, we need to find the maximum in the matrix.

Bellow is the *localAlignment* function. It takes two strings as arguments and returns maximum value in the matrix (value of local alignment).
We initialize n x m matrix of zeros (since both first column and first row have zero-values) and then fill the rest of the matrix using objective function.

In [19]:
import numpy

def localAlignment(x,y, s):
    D = numpy.zeros((len(x)+1,len(y)+1), dtype=int)

    for i in range(1,len(x)+1):
        for j in range(1,len(y)+1):
            D[i,j]=max(0, 
                       D[i-1,j]  +s(x[i-1], '_'),
                       D[i,j-1]  +s('_',   y[j-1]), 
                       D[i-1,j-1]+s(x[i-1],y[j-1]))
    
    # find the cell in the table which has maximum value       
    localMax = D.max()
    return D, localMax

In [20]:
x = 'GGTATGCTGGCGCTA'
y = 'TATATGCGGCGTTT'
V, score = localAlignment(x, y, scoringMatrix)
print(V)

[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  2  0  2  2  0  2  0  0  0]
 [ 0  0  0  0  0  0  2  0  2  4  0  2  0  0  0]
 [ 0  2  0  2  0  2  0  0  0  0  0  0  4  2  2]
 [ 0  0  4  0  4  0  0  0  0  0  0  0  0  0  0]
 [ 0  2  0  6  0  6  0  0  0  0  0  0  2  2  2]
 [ 0  0  0  0  2  0  8  2  2  2  0  2  0  0  0]
 [ 0  0  0  0  0  0  2 10  4  0  4  0  0  0  0]
 [ 0  2  0  2  0  2  0  4  6  0  0  0  2  2  2]
 [ 0  0  0  0  0  0  4  0  6  8  2  2  0  0  0]
 [ 0  0  0  0  0  0  2  0  2  8  4  4  0  0  0]
 [ 0  0  0  0  0  0  0  4  0  2 10  4  0  0  0]
 [ 0  0  0  0  0  0  2  0  6  2  4 12  6  0  0]
 [ 0  0  0  0  0  0  0  4  0  2  4  6  8  2  0]
 [ 0  2  0  2  0  2  0  0  0  0  0  0  8 10  4]
 [ 0  0  4  0  4  0  0  0  0  0  0  0  2  4  6]]


In [21]:
print("Best score is: %d in cell %s" % (score, numpy.unravel_index(numpy.argmax(V), V.shape)))

Best score is: 12 in cell (12, 11)


### Traceback

We can see here what the value of local alignment is, but we can't see clearly which regions have maximum local alignment score. 

In order to find maximum local alignment regions we need to do the tracbeck. First we need to find maximum local alignment of the value (which is also maximum value in the matrix). Secondly we need to backtrace from it untill we reach any cell having value 0 (or hit any cell in the first row or column). 


We're doing backtrace same as in all previous examples - to backtrace from one to the next cell, we take three neighbouring cells and recalculate the values of movement from any of these cells to our current cels. Then we backtrace to the cell which gives the maximum value. Depending if it's horizontal, vertical or diagonal movement we insertions and deletions into string alignments.

In [22]:
def tracebak(x,y,V, s):
    maxValue=numpy.where(V==V.max())
    i=maxValue[0][0]
    j=maxValue[1][0]
    ax,ay,am, tr = '','','',''
    while (i>0 or j>0) and V[i,j]!=0:
        if i>0 and j>0:
            sigma = 1 if x[i-1]==y[j-1] else 0
            d=V[i-1,j-1]+s(x[i-1],y[j-1]) # diagonal move
        if i>0: v=V[i-1,j]+s(x[i-1],'_')  # vertical move
        if j>0: h=V[i,j-1]+s('_',y[j-1])  # horizontal move
        
        # diagonal is the best
        if d>=v and d>=h:
            ax += x[i-1]
            ay += y[j-1]
            if sigma==1:
                tr+='M'
                am+='|'
            else:
                tr+='R'
                am+=' '
            i-=1
            j-=1
        # vertical is the best
        elif v>=h:
            ax+=x[i-1]
            ay+='_'
            tr+='I'
            am+=' '
            i-=1
        # horizontal is the best
        else:
            ay+=y[j-1]
            ax+='_'
            tr+='D'
            am+=' '
            j-=1
    alignment= '\n'.join([ax[::-1], am[::-1], ay[::-1]])
    return alignment, tr[::-1]

In [23]:
alignment, transcript = tracebak(x,y,V, scoringMatrix)

In [24]:
print(alignment)
print(transcript)

TATGCTGGCG
||||| ||||
TATGC_GGCG
MMMMMIMMMM


In [31]:
x = 'brontosaurus if the bravest dinosaurus'
y = 'a brave little toster'
V, score = localAlignment(x, y, scoringMatrix)
print("Best score is: %d in cell %s" % (score, numpy.unravel_index(numpy.argmax(V), V.shape)))
alignment, transcript = tracebak(x,y,V, scoringMatrix)
print("Alignment is:")
print(alignment)
print(transcript)

Best score is: 12 in cell (25, 7)
Alignment is:
 brave
||||||
 brave
MMMMMM
