# Language Model

Models that assign probabilities to sequences of words, aka LM.

Tasks include:
- compute the probability of a sentence $P(W)$
- compute the probability of a word given several words $P(W_i|W_1W_2 \dots W_n)$

https://web.stanford.edu/~jurafsky/slp3/3.pdf

## Chain Rule

$P(W_1 W_2 \dots W_k) = P(W_1) P(W_2|W_1) P(W_3|W_1 W_2) \dots P(W_k | W_1 W_2 \dots W_{k-1})$

## Markov Assumption

The probability of a word depends only on the previous word.

$ P(W_i|W_1 \dots W_i-1) = P(W_i | W_{i-n+1}| \dots W_{i-1})$

## N-gram model

### 1. Unigram Model

The probability of a sentence is the product of probabilities of all words.

Words are independent. 

$P(W)=\prod_{i}P(W_i)$

### 2. Bigram Model

Word depends on its previous one word.

$P(W_i | W_1 W_2 \dots W_{i-1}) = P(W_i | W _{i-1})$

### tri-gram, 4-gram, 5-gram ...

## Flaws

- Language has long-distance dependencies.

## 计算字符串编辑距离 edit distance

Smith-Waterman algorithm

Non-negative socre is suitable for local alignment.

$ H_{ij} = max \left \{
  \begin{aligned}
    H_{i-1,j-1} + S_{i,j} \\
    max_{k>=1} \{ H_{i-k,j} - W_k\} \\
    max_{l>=1} \{ H_{i,j-l} - W_l\} \\
    0
  \end{aligned} \right.$


In [32]:
import numpy as np
np.set_printoptions(precision=2)

In [33]:
def smith_waterman(list_1, list_2):
    def substitution(a,b):
        if a == b:
            return 3
        else: 
            return -3

    def penalty(k):
        return 2*k

    def score(h, i, j):
        match = h[i-1,j-1]+substitution(list_1[i-1], list_2[j-1])
        delete = max([h[i-k,j]-penalty(k) for k in range(0,i)])
        insert = max([h[i,j-l]-penalty(l) for l in range(0,j)])
        return max(match,delete,insert,0)
    
    # init
    n = len(list_1)
    m = len(list_2)
    H = np.zeros((n+1,m+1))
    H[:, 0] = 0
    H[0, :] = 0

    # fill score matrix   
    i = j = 1
    for i in range(1,n+1):
        for j in range(1,m+1):
            H[i,j] = score(H, i, j)
    print(H)
    
    # backtrace
    H_flip = np.flip(np.flip(H, 0), 1)
    start_i, start_j = np.unravel_index(np.argmax(H_flip, axis=None), H_flip.shape)
    start_i = n - start_i
    start_j = m - start_j
    
    def pair(i,j):
        return {'i': i,'j': j,'pair': list_1[i-1] + ' - ' + list_2[j-1]}
    
    path = []
    while H[start_i, start_j] > 0:
        path.append(pair(start_i,start_j))
        v1 = H[start_i-1, start_j]
        v2 = H[start_i, start_j-1]
        v3 = H[start_i-1, start_j-1]
        if v1 >= v2 and v1 >= v3:
            start_i = start_i - 1
        elif v2 >= v1 and v2 >= v3:
            start_j = start_j - 1
        elif v3 >= v1 and v3 >= v2:
            start_i = start_i - 1
            start_j = start_j - 1
        
    path.reverse()
    return path

In [34]:
sw = smith_waterman(["G","G","T","T","G","A","C","T","A"], ["T","G","T","T","A","C","G","G","C"])
print(sw,'\n')

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  3.  1.  0.  0.  0.  3.  3.  1.]
 [ 0.  0.  3.  1.  0.  0.  0.  3.  6.  4.]
 [ 0.  3.  1.  6.  4.  2.  0.  1.  4.  3.]
 [ 0.  3.  1.  4.  9.  7.  5.  3.  2.  1.]
 [ 0.  1.  6.  4.  7.  6.  4.  8.  6.  4.]
 [ 0.  0.  4.  3.  5. 10.  8.  6.  5.  3.]
 [ 0.  0.  2.  1.  3.  8. 13. 11.  9.  8.]
 [ 0.  3.  1.  5.  4.  6. 11. 10.  8.  6.]
 [ 0.  1.  0.  3.  2.  7.  9.  8.  7.  5.]]
[{'i': 1, 'j': 2, 'pair': 'G - G'}, {'i': 2, 'j': 2, 'pair': 'G - G'}, {'i': 3, 'j': 3, 'pair': 'T - T'}, {'i': 4, 'j': 4, 'pair': 'T - T'}, {'i': 5, 'j': 4, 'pair': 'G - T'}, {'i': 6, 'j': 5, 'pair': 'A - A'}, {'i': 7, 'j': 6, 'pair': 'C - C'}] 



Needleman-Wunsch algorithm

$ H_{ij} = max \left \{
  \begin{aligned}
    H_{i-1,j-1} + S_{i,j} \\
    H_{i-k,j} - W \\
    H_{i-1,j} - W \\
  \end{aligned} \right.$


In [37]:
def needleman_wunsch(list_1,list_2):

    def substitution(a,b):
        if a == b:
            return 1
        else: 
            return -1
    def penalty(k):
        return k

    # diff than smith_waterman
    def score(h, i, j):
        match = h[i-1,j-1]+substitution(list_1[i-1], list_2[j-1])
        delete = h[i,j-1]-penalty(1)
        insert = h[i-1,j]-penalty(1)
        return max(match,delete,insert)

    n = len(list_1)
    m = len(list_2)
    H = np.zeros((n+1,m+1),dtype=np.int32)
    H[:, 0] = range(0,-n-1,-1)
    H[0, :] = range(0,-m-1,-1)
    
    i = j = 1
    for i in range(1,n+1):
        for j in range(1,m+1):
            H[i,j] = score(H, i, j)
    print(H)

    # backtrace
    # might branch into diff paths
    start_i = n
    start_j = m
    def pair(i,j):
        return {'i': i,'j': j,'pair': list_1[i-1] + ' - ' + list_2[j-1]}
    
    path = []
    while start_i >= 0 or start_j >= 0:
        path.append(pair(start_i,start_j))
        v1 = H[start_i-1, start_j]
        v2 = H[start_i, start_j-1]
        v3 = H[start_i-1, start_j-1]
        if v1 >= v2 and v1 >= v3:
            start_i = start_i - 1
        elif v2 >= v1 and v2 >= v3:
            start_j = start_j - 1
        elif v3 >= v1 and v3 >= v2:
            start_i = start_i - 1
            start_j = start_j - 1
        
    path.reverse()
    return path

In [39]:
nw = needleman_wunsch(["G","A","C","T","T","A","C"], ["C","G","T","G","A","A","T","T","C","A","T"])
print(nw,'\n')

[[  0  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10 -11]
 [ -1  -1   0  -1  -2  -3  -4  -5  -6  -7  -8  -9]
 [ -2  -2  -1  -1  -2  -1  -2  -3  -4  -5  -6  -7]
 [ -3  -1  -2  -2  -2  -2  -2  -3  -4  -3  -4  -5]
 [ -4  -2  -2  -1  -2  -3  -3  -1  -2  -3  -4  -3]
 [ -5  -3  -3  -1  -2  -3  -4  -2   0  -1  -2  -3]
 [ -6  -4  -4  -2  -2  -1  -2  -3  -1  -1   0  -1]
 [ -7  -5  -5  -3  -3  -2  -2  -3  -2   0  -1  -1]]
[{'i': 0, 'j': 0, 'pair': 'C - T'}, {'i': 1, 'j': 1, 'pair': 'G - C'}, {'i': 1, 'j': 2, 'pair': 'G - G'}, {'i': 2, 'j': 3, 'pair': 'A - T'}, {'i': 2, 'j': 4, 'pair': 'A - G'}, {'i': 2, 'j': 5, 'pair': 'A - A'}, {'i': 3, 'j': 6, 'pair': 'C - A'}, {'i': 4, 'j': 7, 'pair': 'T - T'}, {'i': 5, 'j': 8, 'pair': 'T - T'}, {'i': 6, 'j': 9, 'pair': 'A - C'}, {'i': 6, 'j': 10, 'pair': 'A - A'}, {'i': 7, 'j': 11, 'pair': 'C - T'}] 



N-gram algorithm

https://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf

three variants: binary, comprehensive, positional

In [40]:
# unigram edit distance
# x, y are single characters
# X, Y are listsa
def d_single(x,y):
    if x != None and y == None:
        return 1
    if x == None and y != None:
        return 1
    if x == y:
        return 0
    else:
        return 1
def d1(X,Y):    
    k = len(X)
    l = len(Y)
    gamma = np.zeros((k+1,l+1))
    gamma[:,0] = [i for i in range(k+1)]
    gamma[0,:] = [j for j in range(l+1)]
    for i in range(1,k+1):
        for j in range(1,l+1):
            gamma[i,j] = min(
                gamma[i-1,j]+1,
                gamma[i,j-1]+1,
                gamma[i-1,j-1] + d_single(X[i-1],Y[j-1])
            )
    return gamma[k,l]

# n-gram edit distance

# 3 variants
def positional(list_x, list_y):
    n = len(list_x)
    s = [d_single(list_x[i],list_y[i]) for i in range(n)]
    return sum(s) / n

def comprehensive(X, Y):
    n = len(X)
    return d1(X,Y) / n

def binary(X,Y):
    for i in range(len(X)):
        if X[i] != Y[i]:
            return 1
    return 0
 
def dn(X, Y, n, variant):
    k = len(X)
    l = len(Y)
    X = [None]*(n-1) + X
    Y = [None]*(n-1) + Y
    D = np.empty((k+1,l+1))
    D[:,0] = [i for i in range(k+1)]
    D[0,:] = [j for j in range(l+1)]
    for i in range(1,k+1):
        for j in range(1,l+1):
            D[i,j] = min(
                D[i-1,j]+1,
                D[i,j-1]+1,
                D[i-1,j-1] + variant(X[i-1:i+n-1],Y[j-1:j+n-1])
            )            
    print(D)
    return D[k,l] / max(k,l)

In [41]:
l1 = [l for l in "qbcdefarjn"]
l2 = [l for l in "abcdeaweqg"]
n = 4
print(l1,'\n')
print(l2,'\n')

bdn = dn(l1,l2,n,binary)
print('binary \n' ,bdn)

cdn = dn(l1,l2,n,comprehensive)
print('comprehensive \n', cdn)

pdn = dn(l1,l2,n,positional)
print('positional \n', pdn)

# HAS BUG
# cdn is always equal to pdn

['q', 'b', 'c', 'd', 'e', 'f', 'a', 'r', 'j', 'n'] 

['a', 'b', 'c', 'd', 'e', 'a', 'w', 'e', 'q', 'g'] 

[[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 1.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 2.  2.  2.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 3.  3.  3.  3.  4.  5.  6.  7.  8.  9. 10.]
 [ 4.  4.  4.  4.  4.  5.  6.  7.  8.  9. 10.]
 [ 5.  5.  5.  5.  5.  4.  5.  6.  7.  8.  9.]
 [ 6.  6.  6.  6.  6.  5.  5.  6.  7.  8.  9.]
 [ 7.  7.  7.  7.  7.  6.  6.  6.  7.  8.  9.]
 [ 8.  8.  8.  8.  8.  7.  7.  7.  7.  8.  9.]
 [ 9.  9.  9.  9.  9.  8.  8.  8.  8.  8.  9.]
 [10. 10. 10. 10. 10.  9.  9.  9.  9.  9.  9.]]
binary 
 0.9
[[ 0.    1.    2.    3.    4.    5.    6.    7.    8.    9.   10.  ]
 [ 1.    0.25  1.25  2.25  3.25  4.25  5.25  6.25  7.25  8.25  9.25]
 [ 2.    1.25  0.5   1.5   2.5   3.5   4.5   5.5   6.5   7.5   8.5 ]
 [ 3.    2.25  1.5   0.75  1.75  2.75  3.75  4.75  5.75  6.75  7.75]
 [ 4.    3.25  2.5   1.75  1.    2.    3.    4.    5.    6.    7.  ]
 [ 5.    4.25  3.5

## Smoothing

- an example of kn smoothing https://medium.com/@seccon/a-simple-numerical-example-for-kneser-ney-smoothing-nlp-4600addf38b8

- kn library https://github.com/smilli/kneser-ney

- good paper on all smoothing http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf

- kn smoothing video https://www.youtube.com/watch?v=eNLUo3AIvcQ

- textbook https://web.stanford.edu/~jurafsky/slp3/3.pdf



In [45]:
import re

SPLIT_PATTERN = r"([!?.,;])"
WORD_PATTERN = r"[a-zA-Z]+"

CORPUS = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc. In reality, though, the unity and coherence of ideas among sentences is what constitutes a paragraph. A paragraph is defined as “a group of sentences or a single sentence that forms a unit”. Length and appearance do not determine whether a section in a paper is a paragraph. For instance, in some styles of writing, particularly journalistic styles, a paragraph can be just one sentence long. Ultimately, a paragraph is a sentence or group of sentences that support one main idea. In this handout, we will refer to this as the “controlling idea,” because it controls what happens in the rest of the paragraph."

lines = [] 


# tokenize
for t in re.split(SPLIT_PATTERN, CORPUS):
    l = re.findall(WORD_PATTERN,t.lower())
    if len(l)>0:
        lines.append(l)

print('lines \n', lines, '\n')

lines 
 [['paragraphs', 'are', 'the', 'building', 'blocks', 'of', 'papers'], ['many', 'students', 'define', 'paragraphs', 'in', 'terms', 'of', 'length', 'a', 'paragraph', 'is', 'a', 'group', 'of', 'at', 'least', 'five', 'sentences'], ['a', 'paragraph', 'is', 'half', 'a', 'page', 'long'], ['etc'], ['in', 'reality'], ['though'], ['the', 'unity', 'and', 'coherence', 'of', 'ideas', 'among', 'sentences', 'is', 'what', 'constitutes', 'a', 'paragraph'], ['a', 'paragraph', 'is', 'defined', 'as', 'a', 'group', 'of', 'sentences', 'or', 'a', 'single', 'sentence', 'that', 'forms', 'a', 'unit'], ['length', 'and', 'appearance', 'do', 'not', 'determine', 'whether', 'a', 'section', 'in', 'a', 'paper', 'is', 'a', 'paragraph'], ['for', 'instance'], ['in', 'some', 'styles', 'of', 'writing'], ['particularly', 'journalistic', 'styles'], ['a', 'paragraph', 'can', 'be', 'just', 'one', 'sentence', 'long'], ['ultimately'], ['a', 'paragraph', 'is', 'a', 'sentence', 'or', 'group', 'of', 'sentences', 'that', 'sup

In [46]:
# build gram table
table1Gram = {}
table2Gram = {}
table3Gram = {}

def increment(dictionary, word):
    if word in dictionary:
        dictionary[word] = dictionary[word] + 1
    else:
        dictionary[word] = 1
        
# get ngram table
for line in lines:
    m = len(line)
    for index, word in enumerate(line):
        increment(table1Gram,word)
        if index < (m - 1):
            increment(table2Gram, word + " " + line[index+1])
        if index < (m - 2):
            increment(table3Gram, word + " " + line[index+1] + " " + line[index+2])
            
print('table1Gram \n', table1Gram , '\n')
print('table2Gram \n', table2Gram , '\n')
print('table3Gram \n', table3Gram , '\n')

table1Gram 
 {'paragraphs': 2, 'are': 1, 'the': 5, 'building': 1, 'blocks': 1, 'of': 8, 'papers': 1, 'many': 1, 'students': 1, 'define': 1, 'in': 6, 'terms': 1, 'length': 2, 'a': 15, 'paragraph': 8, 'is': 6, 'group': 3, 'at': 1, 'least': 1, 'five': 1, 'sentences': 4, 'half': 1, 'page': 1, 'long': 2, 'etc': 1, 'reality': 1, 'though': 1, 'unity': 1, 'and': 2, 'coherence': 1, 'ideas': 1, 'among': 1, 'what': 2, 'constitutes': 1, 'defined': 1, 'as': 2, 'or': 2, 'single': 1, 'sentence': 3, 'that': 2, 'forms': 1, 'unit': 1, 'appearance': 1, 'do': 1, 'not': 1, 'determine': 1, 'whether': 1, 'section': 1, 'paper': 1, 'for': 1, 'instance': 1, 'some': 1, 'styles': 2, 'writing': 1, 'particularly': 1, 'journalistic': 1, 'can': 1, 'be': 1, 'just': 1, 'one': 2, 'ultimately': 1, 'support': 1, 'main': 1, 'idea': 2, 'this': 2, 'handout': 1, 'we': 1, 'will': 1, 'refer': 1, 'to': 1, 'controlling': 1, 'because': 1, 'it': 1, 'controls': 1, 'happens': 1, 'rest': 1} 

table2Gram 
 {'paragraphs are': 1, 'are th

### Absolute Discounting Interpolation

$$
\begin{split}
P_{ad}(w_i | w_{i-1}) & = \frac{c(w_{i-1},w_i)-d}{c(w_{i-1}} + \lambda(w_{i-1})P(w) \\ 
& = \frac{discounted \ bigram}{sth} + interpolation \ weight * unigram
\end{split}
$$

where d could be 0.75 from experience

### Kneser-Ney Smoothing

Change: **P(w)** "how likely is w" => **P_continuation(w)** "how likely is w to appear as a novel continuation"


#### bigram formulation

$$
P_{KN}(w_i | w_{i-1}) = \frac{max(c(w_{i-1}, w_i) -d, 0)}{c(w_{i-1})}  + \lambda(w_{i-1}) P_{continuation}(w_i) 
$$

where

$
\begin{split}
\lambda(w_{i-1}) & = \frac{d}{c(w_{i-1})}|\{w: c(w_{i-1},w) > 0 \}| \\
& = the \ normalized \ discount * the \ number \ of \ word \ types \ that \ follow \ w_{i-1} \\
& = low \ order \ weight \ or \ backoff \ weight
\end{split}
$

$
\begin{split}
P_{continuation}(w_i) & = \frac{|\{w: c(w,w_{i}) > 0 \}|}{|\{(w_{j-1},w_j): c(w_{j-1},w_j) > 0 \}|} \\
& = \frac{number \ when \ word \ w_{i} \ is \ a \ novel \ continuation}{total \ number \ of \ word \ bigram \ types} \\
& = low \ order \ probability
\end{split}
$


In [60]:
def p_kn_2(pre_word, current_word, d = 0.75):
    
    w_i_1_set = [key for key in list(table2Gram.keys()) if key.split(" ")[0] == pre_word]
    w_i_set = [key for key in list(table2Gram.keys()) if key.split(" ")[1] == current_word]

    low_order_weight = d / table1Gram[pre_word] * len(w_i_1_set)
    p_continuation = len(w_i_set) / len(table2Gram.keys())
    
    w_i_1_w_i = pre_word + " " + current_word
    c = table2Gram[w_i_1_w_i] if w_i_1_w_i in table2Gram else 0
    return max(c-d, 0) / table1Gram[pre_word] + low_order_weight * p_continuation

print(p_kn_2('paragraphs','are'))
print(p_kn_2('paragraphs','of'))

0.13221153846153846
0.04326923076923077


#### general recursive formulation

$$
P_{KN}(w_i | w_{i-n+1}^{i-1}) = \frac{max(C_{KN}(w_{i-n+1}^{i}) -d, 0) }{C_{KN}(w_{i-n+1}^{i-1})}  + \lambda(w_{i-n+1}^{i-1}) P_{KN}(w_i | w_{i-n+2}^{i-1}) 
$$

where

$ 
C_{KN}(\bullet) = \left \{
\begin{aligned}
    count(\bullet) \ for \ the \ highest \ order \\
    continutation \ count(\bullet) \ for \ lower \  order \\
\end{aligned}
\right.
$

$
w_{i-n+1}^{i-1} \ denotes \ word \ w_{i-n+1}w_{i-n+2}\dots w_{i-1}
$

#### modified Kneser-Ney smoothing

$$
P_{KN}(w_i | w_{i-n+1}^{i-1}) = 
\frac
    {C(w_{i-n+1}^{i}) - D(C(w_{i-n+1}^{i}))}
    {\sum_{w_i} C(w_{i-n+1}^i)}  + \lambda(w_{i-n+1}^{i-1}) P_{KN}(w_i | w_{i-n+2}^{i-1}) 
$$

where
$$
D(c) = \left \{
\begin{aligned}
    & 0 \ & if \ c=0 \\
    & D_1 \ & if \ c=1 \\
    & D_2 \ & if \ c=2 \\
    & D_{3+} \ & if \ c\geq3 \\
\end{aligned}
\right.
$$

$$
\lambda(w_{i-n+1}^{i-1}) = 
\frac
    {D_1N_1(w_{i-n+1}^{i-1}\cdot)+D_2N_2(w_{i-n+1}^{i-1}\cdot)+D_{3+}N_{3+}(w_{i-n+1}^{i-1}\cdot)}
    {\sum_{w_i}c(w_{i-n+1}^i)}
$$

where

$$
N_{1+}(\bullet w_{i-n+2}^i) = |\{w_{i-n+1} : C(w_{i-n+1}^i > 0\}|
$$
