# Language Model

## Resources

textbook https://web.stanford.edu/~jurafsky/slp3/3.pdf

Dan Jurafsky series of videos https://www.youtube.com/watch?v=Saq1QagC8KY&list=PLQiyVNMpDLKnZYBTUOlSI9mi9wAErFtFm&index=13&t=15s

## Basic

Language Model is a model that assign probabilities to sequences of words, aka LM.

Tasks include:
- compute the probability of a sentence $P(W)$
- compute the probability of a word given several words $P(W_i|W_1W_2 \dots W_n)$


## Chain Rule

$P(W_1 W_2 \dots W_k) = P(W_1) P(W_2|W_1) P(W_3|W_1 W_2) \dots P(W_k | W_1 W_2 \dots W_{k-1})$

## Markov Assumption

The probability of a word depends only on the previous word.

$ P(W_i|W_1 \dots W_i-1) \approx P(W_i | W_{i-n+1}| \dots W_{i-1})$

## N-gram model

### 1. Unigram Model

The probability of a sentence is the product of probabilities of all words.

Words are independent. 

$P(w_1 w_2 \cdots w_n)=\prod_{i}P(w_i)$

### 2. Bigram Model

Word depends on its previous one word.

$P(w_i | w_1 w_2 \dots w_{i-1}) \approx P( w_i |  w_{i-1})$

more, tri-gram, 4-gram, 5-gram ...

### Flaws

- Language has long-distance dependencies.

### Note

- Use log instead of multiple to avoid underflow

## Maximum Likelihood Estimate

- bigram form
$$
P_{MLE}(w_i | w_{i-1}) = 
\frac
    {C(w_{i-1},w_i)}
    {C(w_{i-1})}
$$

- general form

$$
P_{MLE}(w_i | w_{i-n+1}^{i-1}) = 
\frac
    {C(w_{i-n+1}^i)}
    {C(w_{i-n+1}^{i-1})}
=
\frac
    {C(w_{i-n+1}^i)}
    {\sum_{w_i} C(w_{i-n+1}^i)}
$$

where

$$
w_{i-n+1}^{i-1} \ denotes \ word \ w_{i-n+1}w_{i-n+2}\dots w_{i-1}
$$

MLE fails when the n-gram is not in the data. 

The solution is **Smoothing**.

## Smoothing

- an example of kn smoothing https://medium.com/@seccon/a-simple-numerical-example-for-kneser-ney-smoothing-nlp-4600addf38b8

- a kn implementation with bug https://github.com/smilli/kneser-ney and the blog http://smithamilli.com/blog/kneser-ney/

- paper gives overview of all smoothing algos http://u.cs.biu.ac.il/~yogo/courses/mt2013/papers/chen-goodman-99.pdf

- wiki https://en.wikipedia.org/wiki/Kneser%E2%80%93Ney_smoothing

### Absolute Discounting Interpolation

$$
\begin{split}
P_{ad}(w_i | w_{i-1}) & = \frac{c(w_{i-1},w_i)-d}{c(w_{i-1}} + \lambda(w_{i-1})P(w) \\ 
& = \frac{discounted \ bigram}{sth} + interpolation \ weight * unigram
\end{split}
$$

where d could be 0.75 from experience

### Kneser-Ney Smoothing

Change: **P(w)** "how likely is w" => **P_continuation(w)** "how likely is w to appear as a novel continuation"


#### 1. bigram formulation

$$
P_{KN}(w_i | w_{i-1}) = \frac{max(c(w_{i-1}, w_i) -d, 0)}{c(w_{i-1})}  + \lambda(w_{i-1}) P_{continuation}(w_i) 
$$

where

$
\begin{split}
\lambda(w_{i-1}) & = \frac{d}{c(w_{i-1})}|\{w: c(w_{i-1},w) > 0 \}| \\
& = the \ normalized \ discount * the \ number \ of \ word \ types \ that \ follow \ w_{i-1} \\
& = low \ order \ weight \ or \ backoff \ weight
\end{split}
$

$
\begin{split}
P_{continuation}(w_i) & = \frac{|\{w: c(w,w_{i}) > 0 \}|}{|\{(w_{j-1},w_j): c(w_{j-1},w_j) > 0 \}|} \\
& = \frac{number \ when \ word \ w_{i} \ is \ a \ novel \ continuation}{total \ number \ of \ word \ bigram \ types} \\
& = low \ order \ probability
\end{split}
$



In [283]:
# a simple bigram model

import re

SPLIT_PATTERN = r"([!?.,;])"
WORD_PATTERN = r"[a-zA-Z]+"

CORPUS = "Paragraphs are the building blocks of papers. Many students define paragraphs in terms of length: a paragraph is a group of at least five sentences, a paragraph is half a page long, etc. In reality, though, the unity and coherence of ideas among sentences is what constitutes a paragraph. A paragraph is defined as “a group of sentences or a single sentence that forms a unit”. Length and appearance do not determine whether a section in a paper is a paragraph. For instance, in some styles of writing, particularly journalistic styles, a paragraph can be just one sentence long. Ultimately, a paragraph is a sentence or group of sentences that support one main idea. In this handout, we will refer to this as the “controlling idea,” because it controls what happens in the rest of the paragraph."

lines = [] 

# tokenize
for t in re.split(SPLIT_PATTERN, CORPUS):
    l = re.findall(WORD_PATTERN,t.lower())
    if len(l)>0:
        lines.append(l)

# build gram table
table1Gram = {}
table2Gram = {}
table3Gram = {}

def increment(dictionary, word):
    if word in dictionary:
        dictionary[word] = dictionary[word] + 1
    else:
        dictionary[word] = 1
        
# get ngram table
for line in lines:
    m = len(line)
    for index, word in enumerate(line):
        increment(table1Gram,word)
        if index < (m - 1):
            increment(table2Gram, word + " " + line[index+1])
        if index < (m - 2):
            increment(table3Gram, word + " " + line[index+1] + " " + line[index+2])
            
print('table1Gram \n', table1Gram , '\n')
print('table2Gram \n', table2Gram , '\n')
print('table3Gram \n', table3Gram , '\n')

# bigram probabilitys
def p_kn_2(pre_word, current_word, d = 0.75):
    
    w_i_1_set = [key for key in list(table2Gram.keys()) if key.split(" ")[0] == pre_word]
    w_i_set = [key for key in list(table2Gram.keys()) if key.split(" ")[1] == current_word]

    low_order_weight = d / table1Gram[pre_word] * len(w_i_1_set)
    p_continuation = len(w_i_set) / len(table2Gram.keys())
    
    w_i_1_w_i = pre_word + " " + current_word
    c = table2Gram[w_i_1_w_i] if w_i_1_w_i in table2Gram else 0
    return max(c-d, 0) / table1Gram[pre_word] + low_order_weight * p_continuation

print("score of ('paragraphs','are') is", p_kn_2('paragraphs','are'))
print("score of ('paragraphs','of') is ", p_kn_2('paragraphs','of'))

table1Gram 
 {'paragraphs': 2, 'are': 1, 'the': 5, 'building': 1, 'blocks': 1, 'of': 8, 'papers': 1, 'many': 1, 'students': 1, 'define': 1, 'in': 6, 'terms': 1, 'length': 2, 'a': 15, 'paragraph': 8, 'is': 6, 'group': 3, 'at': 1, 'least': 1, 'five': 1, 'sentences': 4, 'half': 1, 'page': 1, 'long': 2, 'etc': 1, 'reality': 1, 'though': 1, 'unity': 1, 'and': 2, 'coherence': 1, 'ideas': 1, 'among': 1, 'what': 2, 'constitutes': 1, 'defined': 1, 'as': 2, 'or': 2, 'single': 1, 'sentence': 3, 'that': 2, 'forms': 1, 'unit': 1, 'appearance': 1, 'do': 1, 'not': 1, 'determine': 1, 'whether': 1, 'section': 1, 'paper': 1, 'for': 1, 'instance': 1, 'some': 1, 'styles': 2, 'writing': 1, 'particularly': 1, 'journalistic': 1, 'can': 1, 'be': 1, 'just': 1, 'one': 2, 'ultimately': 1, 'support': 1, 'main': 1, 'idea': 2, 'this': 2, 'handout': 1, 'we': 1, 'will': 1, 'refer': 1, 'to': 1, 'controlling': 1, 'because': 1, 'it': 1, 'controls': 1, 'happens': 1, 'rest': 1} 

table2Gram 
 {'paragraphs are': 1, 'are th

#### 2. general recursive formulation

$$
P_{KN}(w_i | w_{i-n+1}^{i-1}) = \frac{max(C_{KN}(w_{i-n+1}^{i}) -d, 0) }{C_{KN}(w_{i-n+1}^{i-1})}  + \lambda(w_{i-n+1}^{i-1}) P_{KN}(w_i | w_{i-n+2}^{i-1}) 
$$

where

$ 
C_{KN}(\bullet) = \left \{
\begin{aligned}
    count(\bullet) \ for \ the \ highest \ order \\
    continutation \ count(\bullet) \ for \ lower \  order \\
\end{aligned}
\right.
$



### modified Kneser-Ney smoothing

$$
P_{KN}(w_i | w_{i-n+1}^{i-1}) = 
    \frac
        {C(w_{i-n+1}^{i}) - D(C(w_{i-n+1}^{i}))}
        {\sum_{w_i} C(w_{i-n+1}^i)}  + \lambda(w_{i-n+1}^{i-1}) P_{KN}(w_i | w_{i-n+2}^{i-1})
$$

$$
full \ gram \ probability = 
    \frac
        {full \ gram \ count - discount}
        {prefix \ sum}
    + backoff(prefix)* suffix \ probability
$$


where
$$
D(c) = \left \{
\begin{aligned}
    & 0 \ & if \ c=0 \\
    & D_1 \ & if \ c=1 \\
    & D_2 \ & if \ c=2 \\
    & D_{3+} \ & if \ c\geq3 \\
\end{aligned}
\right.
$$

$$
D_1 = 1 -2 Y \frac{n_2}{n_1} \\
D_2 = 2 -3 Y \frac{n_3}{n_2} \\
D_{3+} = 3 -4 Y \frac{n_4}{n_3} \\
Y = \frac{n_1}{n_1 + 2n_2}
$$

and

$$
\lambda(w_{i-n+1}^{i-1}) = 
\frac
    {D_1N_1(w_{i-n+1}^{i-1}\cdot)+D_2N_2(w_{i-n+1}^{i-1}\cdot)+D_{3+}N_{3+}(w_{i-n+1}^{i-1}\cdot)}
    {\sum_{w_i}c(w_{i-n+1}^i)}
$$

$$
backoff(prefix) = 
\frac
    {discount * cardinality \ of \ gram \ set \ with \ prefix}
    {prefix \ sum }
$$

where 
$$
N_{1+}(\bullet w_{i-n+2}^i) = |\{w_{i-n+1} : C(w_{i-n+1}^i > 0\}|
$$

$$
N_1(w_{i-n+1}^{i-1} \cdot) = |{w_i : c(w_{i-n+1}^{i-1}w_i) = 1}|
$$


In [299]:
from collections import Counter

# count k grams where 1<=k<=n
# return [1_gram_counter, 2_gram_counter, ...]
def gram_count(sentences, n, left_pad="<s>", right_pad="</s>"):
    counter = []
    for i in range(1,n+1):
        ng = Counter()
        for sentence in sentences:
            sentence_with_pad = [left_pad] * (i-1) + sentence + [right_pad] * (i-1)
            for j in range(len(sentence) + i-1):
                ng[tuple(sentence_with_pad[j:j+i])] +=1
        counter.append(ng)
    return counter


# total number of grams that have exact <key> count 
# return Counter({1: 1481, 2: 464, 3: 219, 4: 129})
def total_exact_count(gram_count):
    tec = Counter()
    for counter in gram_count:  
        for value in counter.values():
            tec[value] += 1
    return tec
    
    
# from equation D(c)
def discount(total_exact_count,c):
    n = total_exact_count
    Y = n[1] / (n[1]+2*n[2])
    D0 = 0
    D1 = 1-2*Y*n[2]/n[1]
    D2 = 2-3*Y*n[3]/n[2]
    D3 = 3-4*Y*n[4]/n[3]
    if c == 0:
        return D0
    elif c == 1:
        return D1
    elif c == 2:
        return D2
    elif c >= 3:
        return D3

# 
def prefix_sum(gram_count):
    ps = defaultdict(int)
    for counter in gram_count:
        for gram in counter.keys():
            prefix = gram[:-1]
            ps[prefix] += counter[gram]
    return ps
        
# 
def backoff(gram_count, total_exact_count, prefix_sum):
    bo = defaultdict(int)
    for counter in gram_count:
        for key in counter.keys():
            value = counter[key]
            prefix = key[:-1]
            bo[prefix] += discount(total_exact_count, value)
    for prefix in bo.keys():
        bo[prefix] = bo[prefix] / prefix_sum[prefix]
    return bo

# build a dict from unigram to n-gram using the recursive formulation
# return a dict { gram: kn_prob,}
def language_model(gram_count,total_exact_count, prefix_sum, backoff):
    lm = defaultdict(int)
    for counter in gram_count:
        for key in counter.keys():
            if len(key) == 1:
                # unigram Kneser-Ney probability
                total_diff_precede = len([k for k in gram_count[1].keys() if k[1] == key[0]])
                total_bigram_type = len(gram_count[1].keys())
                lm[key] = total_diff_precede / total_bigram_type
            else:
                # bigram, trigram, ..
                c = counter[key]
                D = discount(total_exact_count,c)
                prefix = key[:-1]
                suffix = key[1:]
                
                # recursive formulation
                lm[key] = (c - D) / prefix_sum[prefix] + backoff[prefix] * lm[suffix]
    return lm


def p_kn(language_model, backoff, n, sentence):
    
    # check lm for kn prob
    # if not found(first term is zero) and prefix is not in corpus (backoff weight undefined)
    # backoff completely to lower n-gram 
    # see first edge case
    # in http://www.cs.cmu.edu/~tbergkir/11711fa16/recitation1.pdf
    def p(s):
        gamma = 1
        for i in range(len(s)):
            if s[i:len(s)] in language_model:
                return math.log(language_model[s[i:len(s)]])
        return 0
        raise Exception("OOV word found in "+" ".join(s))

    prob = 0
    if len(sentence) < n:
        return prob
    for i in range(len(sentence)-n+1):
        partial = sentence[i:i+n]
        prob += p(partial)
    return prob


In [300]:
from nltk.corpus import gutenberg

# use gutenberg as corpus
sentences = [s for s in gutenberg.sents()][0:1000]

n = 3
gc = gram_count(sentences,n)
tec = total_exact_count(gc)
ps = prefix_sum(gc)
bo = backoff(gc,tec,ps)
lm = language_model(gc,tec,ps,bo)

print(p_kn(lm,bo,n,('Emma','by', 'is', 'Austen')))


[['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'], ['VOLUME', 'I'], ['CHAPTER', 'I'], ['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.'], ['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'", 's', 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.'], ['Her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses', ';', 'and', 'her', 'place', 'had', 'been',

Page 61 Speech and Language Processing

...
The highest accuracy language models at the time of this writing make use of
neural nets **neural nets**. 
The problem with standard language models is that **the number of parameters
increases exponentially** as the n-gram order increases, and n-grams have no
way to generalize from training to test set. Neural networks instead project words
into a continuous space in which words with similar contexts have similar representations.
...