# Longest common subsequence

A subsequence of S is a sequence of characters from S in the same relative order as in S, but not necessarily consecutive.


**Longest common subsequence problem:** for strings S1 and S2 find longest common subsequence.

As *Gusfiled 1.16.2* states: With a scoring scheme that scores a one for each match and a zero for each mismatch or space, the matched characters in alignment of maximum value form a longest common subsequence.

What we also need to do is penalize mismatch more than insertion/deletion. We can do this in two ways:

1. Use such scoring matrix that penalizes mismatch the most. In LCS case that can be - one for match, minus one for mismatch, and zero for insertion/deletion. This is one way to ensure mismatch (replacement) edit operation won't be used.

2. Use such recurrence relation where mismathc opertation is prohibited. We check if S1(i) and S2(j) are matching. In that case we check score for three possible edit operations (match, insertion, deletion). If they don't match we check only insertion and deletion score, since mismatch is prohibited. 

"matched characters in alignment of maximum value form a longest common subsequence" - we're looking for all characters in alignment that match. That is, when we're doing the backtrace we're interested only in strings that match and we're storing only these ones.

In [1]:
def traceLCS(x,y,D):
    i=len(x)
    j=len(y)
    traceLCS = ''
    while i>0 and j>0:
        d, v, h = 0, 0, 0

        if j>0: h = D[i,j-1] #needs checking (if j>0)
        if i>0: v = D[i-1,j] #needs checking (if i >0)
        if i>0 and j>0: 
            delt = 1 if x[i-1] == y[j-1] else -1
            d = D[i-1,j-1] + delt
        if d >= v and d >= h:
            # diagonall move doesn't have to mean that character is match
            # we check if it's match, if yes, append LCS
            #since we've penalized mismatch the most we don't even need this checking
            if delt == 1: traceLCS+=(x[i-1])
            i=i-1
            j=j-1
        elif v>=h:
            i=i-1
        else:
            j=j-1
    # we need to invert function
    return traceLCS[::-1]

As mentioned above scoring schema gives one for each match, and zero for enything but match. Mismatch is penalized, so it should never be used as edit operation (instead we would use insertion/deletion). This means that value in lower right cell (n,m) is actually LCS length. Also, because of this particular scoring we're not going to implement separate scoring function, we'll just ajdust it in *findLCS*.

We'll write *findLCS* function which returns the table, longes common substring, and substring length.

In [2]:
import numpy

def findLCS(x,y, traceback):
    D = numpy.zeros((len(x)+1, len(y)+1), dtype=int)
    D[0,1:]=0;
    D[1:,0]=0;
    for i in range(1,len(x)+1):
        for j in range(1, len(y)+1):
            sigma = 1 if x[i-1]==y[j-1] else -1
            D[i,j]=max(D[i-1,j-1]+sigma, 
                       D[i-1,j], 
                       D[i, j-1])
    
    return D, traceback(x,y,D), D[len(x),len(y)]

In [3]:
D, lcs, lenLCS = findLCS('ATCTGAT','TGCATA', traceLCS)

In [4]:
print("Longest common substring has length %s and is %s." % (lenLCS, lcs))
print('Table is:\n', D)

Longest common substring has length 4 and is TCTA.
Table is:
 [[0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1]
 [0 1 1 1 1 2 2]
 [0 1 1 2 2 2 2]
 [0 1 1 2 2 3 3]
 [0 1 2 2 2 3 3]
 [0 1 2 2 3 3 4]
 [0 1 2 2 3 4 4]]


Second approach is changing recurrence relation (tabular computation) in such way that mismatch is prohibited. Example code is given bellow. As mentioned above is strings on given postion are not matching, we only calculate scores for insertion/deletion.

In [5]:
import numpy

def findLCS(x,y):
    D =  numpy.zeros((len(x)+1, len(y)+1), dtype=int)
    D[0,1:]=0;
    D[1:,0]=0;
    for i in range(1,len(x)+1):
        for j in range(1, len(y)+1):
            
            # we check if we have match on given postion
            # if we do we check scores for all three possible edit operations
            # otherwise, since mismatch is prohibited, we check only scores for insertion/deletion
            if x[i-1]==y[j-1]:
                D[i,j]=max(D[i-1,j-1]+1, 
                           D[i-1,j], 
                           D[i, j-1])
            else:
                D[i,j]=max(D[i-1,j], 
                           D[i, j-1])
                
    
    return D, D[len(x),len(y)]

For this scenario, we should also adjust traceback function properly.

In [6]:
D, lenLCS = findLCS('ATCTGAT','TGCATA')

In [7]:
print("Longest common substring has length %s." % lenLCS)
print('Table is:\n', D)

Longest common substring has length 4.
Table is:
 [[0 0 0 0 0 0 0]
 [0 0 0 0 1 1 1]
 [0 1 1 1 1 2 2]
 [0 1 1 2 2 2 2]
 [0 1 1 2 2 3 3]
 [0 1 2 2 2 3 3]
 [0 1 2 2 3 3 4]
 [0 1 2 2 3 4 4]]


# Occurence of P in T

Sometimes we're interested if there is a string similar enough to P in string T. We can determine this by using similarity metrics.

We say that there is an approximate occurence of P in T ending at position j of T if and only if V(n,j)≥𝝈. Moreover, T[k..j] is an approximate occurrence of P in T if and only if V(n, j) ≥𝝈 and there is a path of backpointers from cell (n,j) to cell (0,k).

If T is on horizontal axis and P is in vertical axis, then the initial conditions are:
> v(0,j)=0, ∀j

This way we enusre our P-alike string can start at any postion of T.

For the scoring schema in this example we're penalizing all of the mismatches with -1. Since P in on vertical axis, we're looking in the last row if there is  cell with value higher than 𝝈. In this specific example, we'll look only for maximum value in last row and ther we're checking if it's larger than treshold.

Our P in T function receives two strings and treshold as inputs, and returns edit transcript, occurence of P in T and similarity table.

In [8]:
import numpy
def occofPinT(p,t,treshold):
    D =  numpy.zeros((len(p)+1, len(t)+1), dtype=int)
    D[0,1:]=0;
    D[1:, 0]=[-i for i in range(1,len(p)+1)]
    for i in range(1,len(p)+1):
        for j in range(1, len(t)+1):
            sigma = 1 if p[i-1]==t[j-1] else -1
            D[i,j]=max(D[i-1,j-1]+sigma, D[i-1,j]-1, D[i, j-1]-1)
            
    index, maxVal = None, 0
    for j in range(0, len(t)+1):
        if D[len(p), j] > maxVal:
            index, maxVal = j, D[len(p), j]
    
    if maxVal < treshold:
        return D, 'There is no P in T.', 'There is no transcript.'
    else:
        trace, occ = PinTtrace(x,y[:index],D)
        return D, trace, occ

In the traceback function, since reference is on horizontal axis (instead of vertical as usual), each vertical move represents the deletion, while each horizontal move represents the insertion.

Backtrace function returns the edit transcript and the similar-to-P string. X is labeling the rows and Y is lableling the columns.

In [9]:
def PinTtrace(x,y,D):
    i=len(x)
    j=len(y)
    trace = ''
    eT = ''
    while i>0 and j>0:
        d, v, h = 0, 0, 0

        if j>0: h = D[i,j-1]-1  # horizontal movement score
        if i>0: v = D[i-1,j]-1  # vertical movement score
        if i>0 and j>0:         # diagonal movement score
            delt = 1 if x[i-1] == y[j-1] else -1
            d = D[i-1,j-1] + delt

        #ties are solved by chosing vertical over diagonal and diagonal over horizontal (Gusfield )
        if d > v and d >= h:
            trace+=(x[i-1])
            eT+='M' if delt == 1 else 'R'
            i=i-1
            j=j-1
        # in this case (since t is on horizonatal axis) vertical movement is insertion
        elif v>=d and v>=h:
            i=i-1
            eT+='I'
        else:
            j=j-1
            eT+='D'
    
    return trace[::-1], eT[::-1]   

In [10]:
x, y = 'TACGTCAGC', 'AACCCTATGTCATGCCTTGGA'
V, occurence, traceback = occofPinT(x, y, len(x)-3)
print(occurence, traceback)
print(V)

TACGTCAGC MMRMMMMDMM
[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [-1 -1 -1 -1 -1 -1  1  0  1  0  1  0 -1  1  0 -1 -1  1  1  0 -1 -1]
 [-2  0  0 -1 -2 -2  0  2  1  0  0  0  1  0  0 -1 -2  0  0  0 -1  0]
 [-3 -1 -1  1  0 -1 -1  1  1  0 -1  1  0  0 -1  1  0 -1 -1 -1 -1 -1]
 [-4 -2 -2  0  0 -1 -2  0  0  2  1  0  0 -1  1  0  0 -1 -2  0  0 -1]
 [-5 -3 -3 -1 -1 -1  0 -1  1  1  3  2  1  1  0  0 -1  1  0 -1 -1 -1]
 [-6 -4 -4 -2  0  0 -1 -1  0  0  2  4  3  2  1  1  1  0  0 -1 -2 -2]
 [-7 -5 -3 -3 -1 -1 -1  0 -1 -1  1  3  5  4  3  2  1  0 -1 -1 -2 -1]
 [-8 -6 -4 -4 -2 -2 -2 -1 -1  0  0  2  4  4  5  4  3  2  1  0  0 -1]
 [-9 -7 -5 -3 -3 -1 -2 -2 -2 -1 -1  1  3  3  4  6  5  4  3  2  1  0]]
