# Language Model

Models that assign probabilities to sequences of words, aka LM.

Tasks include:
- compute the probability of a sentence $P(W)$
- compute the probability of a word given several words $P(W_i|W_1W_2 \dots W_n)$

## Chain Rule

$P(W_1 W_2 \dots W_k) = P(W_1) P(W_2|W_1) P(W_3|W_1 W_2) \dots P(W_k | W_1 W_2 \dots W_{k-1})$

## Markov Assumption

The probability of a word depends only on the previous word.

$ P(W_i|W_1 \dots W_i-1) = P(W_i | W_{i-n+1}| \dots W_{i-1})$

## N-gram model

### 1. Unigram Model

The probability of a sentence is the product of probabilities of all words.

Words are independent. 

$P(W)=\prod_{i}P(W_i)$

### 2. Bigram Model

Word depends on its previous one word.

$P(W_i | W_1 W_2 \dots W_{i-1}) = P(W_i | W _{i-1})$

### tri-gram, 4-gram, 5-gram ...

## Flaws

- Language has long-distance dependencies.

## Implementation

### 计算字符串编辑距离 editing distance

1. Smith-Waterman algorithm

Non-negative socre is suitable for local alignment.

https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

$ H_{ij} = max \left \{
  \begin{aligned}
    H_{i-1,j-1} + S_{i,j} \\
    max_{k>=1} \{ H_{i-k,j} - W_k\} \\
    max_{l>=1} \{ H_{i,j-l} - W_l\} \\
    0
  \end{aligned} \right.$

2. Needleman-Wunsch algorithm

$ H_{ij} = max \left \{
  \begin{aligned}
    H_{i-1,j-1} + S_{i,j} \\
    H_{i-k,j} - W \\
    H_{i-1,j} - W \\
  \end{aligned} \right.$

In [41]:
import numpy as np

def smith_waterman(list_1, list_2):
    def substitution(a,b):
        if a == b:
            return 3
        else: 
            return -3

    def penalty(k):
        return 2*k

    def score(h, i, j):
        match = h[i-1,j-1]+substitution(list_1[i-1], list_2[j-1])
        delete = max([h[i-k,j]-penalty(k) for k in range(0,i)])
        insert = max([h[i,j-l]-penalty(l) for l in range(0,j)])
        return max(match,delete,insert,0)
    
    # init
    n = len(list_1)
    m = len(list_2)
    H = np.zeros((n+1,m+1))
    H[:, 0] = 0
    H[0, :] = 0

    # fill score matrix   
    i = j = 1
    for i in range(1,n+1):
        for j in range(1,m+1):
            H[i,j] = score(H, i, j)
    print(H)
    
    # backtrace
    H_flip = np.flip(np.flip(H, 0), 1)
    start_i, start_j = np.unravel_index(np.argmax(H_flip, axis=None), H_flip.shape)
    start_i = n - start_i
    start_j = m - start_j
    
    def pair(i,j):
        return {'i': i,'j': j,'pair': list_1[i-1] + ' - ' + list_2[j-1]}
    
    path = []
    while H[start_i, start_j] > 0:
        path.append(pair(start_i,start_j))
        v1 = H[start_i-1, start_j]
        v2 = H[start_i, start_j-1]
        v3 = H[start_i-1, start_j-1]
        if v1 >= v2 and v1 >= v3:
            start_i = start_i - 1
        elif v2 >= v1 and v2 >= v3:
            start_j = start_j - 1
        elif v3 >= v1 and v3 >= v2:
            start_i = start_i - 1
            start_j = start_j - 1
        
    path.reverse()
    return path


def needleman_wunsch(list_1,list_2):

    def substitution(a,b):
        if a == b:
            return 1
        else: 
            return -1
    def penalty(k):
        return k

    # diff than smith_waterman
    def score(h, i, j):
        match = h[i-1,j-1]+substitution(list_1[i-1], list_2[j-1])
        delete = h[i,j-1]-penalty(1)
        insert = h[i-1,j]-penalty(1)
        return max(match,delete,insert)

    n = len(list_1)
    m = len(list_2)
    H = np.zeros((n+1,m+1),dtype=np.int32)
    H[:, 0] = range(0,-n-1,-1)
    H[0, :] = range(0,-m-1,-1)
    
    i = j = 1
    for i in range(1,n+1):
        for j in range(1,m+1):
            H[i,j] = score(H, i, j)
    print(H)

    start_i = n
    start_j = m
    def pair(i,j):
        return {'i': i,'j': j,'pair': list_1[i-1] + ' - ' + list_2[j-1]}
    
    path = []
    while start_i >= 0 or start_j >= 0:
        path.append(pair(start_i,start_j))
        v1 = H[start_i-1, start_j]
        v2 = H[start_i, start_j-1]
        v3 = H[start_i-1, start_j-1]
        if v1 >= v2 and v1 >= v3:
            start_i = start_i - 1
        elif v2 >= v1 and v2 >= v3:
            start_j = start_j - 1
        elif v3 >= v1 and v3 >= v2:
            start_i = start_i - 1
            start_j = start_j - 1
        
    path.reverse()
    return path    

sw = smith_waterman(["G","G","T","T","G","A","C","T","A"], ["T","G","T","T","A","C","G","G","C"])
print(sw)
nw = needleman_wunsch(["G","A","C","T","T","A","C"], ["C","G","T","G","A","A","T","T","C","A","T"])
print(nw)

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  3.  1.  0.  0.  0.  3.  3.  1.]
 [ 0.  0.  3.  1.  0.  0.  0.  3.  6.  4.]
 [ 0.  3.  1.  6.  4.  2.  0.  1.  4.  3.]
 [ 0.  3.  1.  4.  9.  7.  5.  3.  2.  1.]
 [ 0.  1.  6.  4.  7.  6.  4.  8.  6.  4.]
 [ 0.  0.  4.  3.  5. 10.  8.  6.  5.  3.]
 [ 0.  0.  2.  1.  3.  8. 13. 11.  9.  8.]
 [ 0.  3.  1.  5.  4.  6. 11. 10.  8.  6.]
 [ 0.  1.  0.  3.  2.  7.  9.  8.  7.  5.]]
[{'i': 1, 'j': 2, 'pair': 'G - G'}, {'i': 2, 'j': 2, 'pair': 'G - G'}, {'i': 3, 'j': 3, 'pair': 'T - T'}, {'i': 4, 'j': 4, 'pair': 'T - T'}, {'i': 5, 'j': 4, 'pair': 'G - T'}, {'i': 6, 'j': 5, 'pair': 'A - A'}, {'i': 7, 'j': 6, 'pair': 'C - C'}]
[[  0  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10 -11]
 [ -1  -1   0  -1  -2  -3  -4  -5  -6  -7  -8  -9]
 [ -2  -2  -1  -1  -2  -1  -2  -3  -4  -5  -6  -7]
 [ -3  -1  -2  -2  -2  -2  -2  -3  -4  -3  -4  -5]
 [ -4  -2  -2  -1  -2  -3  -3  -1  -2  -3  -4  -3]
 [ -5  -3  -3  -1  -2  -3  -4  -2   0  -1  -2  -3]
 [ -6  -4  -4  -2