In [1]:
import nltk
from nltk.corpus import brown
from collections import Counter
brown_words = brown.words()

unigram_counts = Counter(brown_words)

bigrams = []
for sent in brown.sents():
    bigrams.extend(nltk.bigrams(sent, pad_left=True, pad_right=True))
bigram_counts = Counter(bigrams)

trigrams = []
for sent in brown.sents():
    trigrams.extend(nltk.trigrams(sent, pad_left=True, pad_right=True))
trigram_counts = Counter(trigrams)

def bigram_LM(sentence_x, smoothing=0.0):
    unique_words = len(unigram_counts.keys()) + 2 # For the None paddings
    x_bigrams = nltk.bigrams(sentence_x, pad_left=True, pad_right=True)
    prob_x = 1.0
    for bg in x_bigrams:
        if bg[0] == None:
            prob_bg = (bigram_counts[bg]+smoothing)/(len(brown.sents())+smoothing*unique_words)
        else:
            prob_bg = (bigram_counts[bg]+smoothing)/(unigram_counts[bg[0]]+smoothing*unique_words)
        prob_x = prob_x *prob_bg
        print(str(bg)+":"+str(prob_bg))
    return prob_x

LookupError: 
**********************************************************************
  Resource [93mbrown[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('brown')
  [0m
  Searched in:
    - '/home/pranava/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/home/pranava/miniconda3/envs/python3/nltk_data'
    - '/home/pranava/miniconda3/envs/python3/lib/nltk_data'
**********************************************************************


<center>
<h2>Advanced language modeling</h2>
<p style="text-align:center">
Natural Language Processing<br>
(COM4513/6513)<br>
<br>
<a href="http://andreasvlachos.github.io">Andreas Vlachos</a><br>
a.vlachos@sheffield.ac.uk<br>
<small>Department of Computer Science<br>
University of Sheffield
</small>
</p>
</center>

### So far

Basic language modeling concepts:
- N-gram language models
- add-1 smoothing
- evaluation with sentence completion

### Recap by T-Rex

<img style:"float:left; width=75%" src="images/dinosaur_ngrams.png"/>

<p>by the <a href="http://nlp.cs.berkeley.edu/comics.shtml">Berkeley NLP group</a>

### In this lecture

More advanced approaches to language modelling
- back-off and linear interpolation
- stupid back-off
- Kneser-Ney

More evaluation:
- perplexity
- applications (a.k.a. extrinsic evaluation)

<h3>Problem setup</h3>

Training data is a (large) set of sentences $\mathbf{x}^m$ with words $x_n$:

<p>
\begin{align}
D_{train} & = \{\mathbf{x}^1,...,\mathbf{x}^M\} \\
\mathbf{x}& = [x_1,... x_N]\\
\end{align}
</p>

<p class="fragment">
for example:
\begin{align}
\mathbf{x}=&[\text{None}, \text{The}, \text{water}, \text{is}, \text{clear}, \text{.}, \text{None}]
\end{align}
</p>

We want to learn a model that returns:
\begin{align}
\text{probability}\; P(\mathbf{x}), \mathbf{for} \; \forall \mathbf{x}\in V^{maxN}
\end{align}
$V$ is the vocabulary and $V^{maxN}$ all possible sentences

<h3>N-gram language models</h3>

$$P(\mathbf{x}) = P(x_1,...,x_N) = \prod_{n=1}^N P(x_n|x_1,... x_{n-1})  $$ 
<p>k-th order Markov <b>assumption</b>:
\begin{align}
P(x_n|x_{n-1},..., x_1) &\approx P(x_n|x_{n-1},..., x_{n-k})\\
& = P(x_n|k\_word\_context)\\
\end{align}
</p>

<p>Longer contexts are more informative:</p>
<p style="text-align: center"><i>dog bites ...</i> better than <i>bites ...</i></p>
<p>but only if they are frequent enough:</p>
<p style="text-align: center"><i> canid bites ...</i> better than <i>bites ...</i>?

Can we combine them?

<h3>Simple Linear Interpolation</h3>

a.k.a. weighted average:

$$f(x) = \lambda_1 f_1(x) + \lambda_2 f_2(x) + ...,  \qquad \lambda_i>0, \sum \lambda_i = 1$$

<p>For trigram bigram LM and unigram LM:</p>

\begin{align}
P_{SLI}(x_n|x_{n-1},x_{n-2})& = \lambda_3 P(x_n|x_{n-1},x_{n-2})\\
& + \lambda_2 P(x_n|x_{n-1})\\
& + \lambda_1P(x_n) \qquad \lambda_i>0, \sum \lambda_i = 1
\end{align}

### Simple Linear Interpolation in action

In [41]:
x = ["I", "spoke", "the", "truth", "."]
unig=unigram_counts[x[1]]/len(brown_words)
print(unig)

7.492301014819255e-05


In [42]:
big=bigram_counts[(x[0],x[1])]/unigram_counts[x[0]]
print(big)

0.0007750435962022863


In [43]:
trig = trigram_counts[(None, x[0], x[1])]/bigram_counts[(None, x[0])]
print(trig)

0.0


In [44]:
print(0.5*trig+0.3*big+0.2*unig)

0.0002474976808903244


### Back off

Start with order $k$ but if the counts are 0 then use $k-1$:

\begin{align}
{BO}&(x_n|x_{n-1} \ldots x_{n-k}) =\\
&\begin{cases}
P(x_n|x_{n-1}\ldots x_{n-k}), & \text{if $c(x_{n} \ldots x_{n-k})>0$}\\
BO(x_n|x_{n-1} \ldots x_{n-k+1}), & \text{otherwise}
\end{cases}
\end{align}

Is this a probability distribution?

**NO!** Must discount probabilities for contexts with
counts $P^\star$ and distribute the mass to the shorter context ones:

\begin{align}
P_{BO}&(x_n|x_{n-1} \ldots x_{n-k}) =\\
&\begin{cases}
P^{\star}(x_n|x_{n-1}\ldots x_{n-k}), & \text{if $c(x_{n} \ldots x_{n-k})>0$}\\
\alpha^{x_{n-1}\ldots x_{n-k}} P_{BO}(x_n|x_{n-1} \ldots x_{n-k+1}), & \text{otherwise}
\end{cases}
\end{align}

### Absolute Discounting

<!-- give the intuition first, then show it with live data -->

Using 22M words for train and held-out
<img style="width:300px; background:none; border:none; box-shadow:none;" src="images/bigram_train_test.png"/>
Can you predict the heldout set average count given the training?

Testing counts = training counts - 0.75 (absolute discount)

### Absolute discounting

$$P_{AbsDiscount}(x_n|x_{n-1}) = \frac{c(x_n,x_{n-1}) - d}{c(x_{n-1})} +\lambda_{x_{n-1}} P(x_n) $$

$d=0.75$, $\lambda$s tuned to ensure we have a valid probability distribution.


Component of the **Kneser-Ney** discounting:
Intuition: a word can be very frequent, but it only follows
very few contexts, e.g. <i>Francisco</i> is frequent but almost always follows <i>San</i>. The unigram probability in the context of the bigram should capture how likely $x_n$ is to be a novel continuation.

### Stupid Back off

Do we really need probabilities? Estimating the additional parameters
takes time for large corpora.

If scoring is enough, <b>stupid backoff</b> works adequately:

\begin{align}
{SBO}&(x_n|x_{n-1} \ldots x_{n-k}) =\\
&\begin{cases}
P(x_n|x_{n-1}\ldots x_{n-k}), & \text{if $c(x_{n} \ldots x_{n-k})>0$}\\
\lambda SBO(x_n|x_{n-1} \ldots x_{n-k+1}), & \text{otherwise}
\end{cases}
\end{align}

$\lambda=0.4$ works well

They called it stupid because they didn't expect it work well!

### Syntax-based language models

<img style:"float:left" src="images/depLM.png"/>

$$P(\text{binoculars}|\text{saw})$$

more informative than:

$$(\text{binoculars}| \text{strong}, \text{very}, \text{with}, \text{ship}, \text{the},\text{saw})$$

### Intrinsic evaluation

What does it mean to have a good probabilistic language model?

To have lower **perplexity**: be less surprised by sentences unseen in its training:

<!---start with the probability dist, explain how it is applied, then build up, give the bigram/unigram example live --->

\begin{align}
PPX(\mathbf{x})&=P(x_1, \ldots, x_n)^{1/N}\\
&= \sqrt[N]{\frac{1}{P(x_1, \ldots, x_n)}}\\
&= \sqrt[N]{\frac{1}{\prod_{n=1}^N P(x_n|x_1,... x_{n-1})}}
\end{align}

Why is a bigram language model likely to have lower perplexity than a unigram one?

### Parameter tuning
Model simplicity (mainly counting) and no labeled data can be misleading.

There are (always) parameters to learn and tune, thus:
- training (for parameter learning, i.e. counting)
- development (for parameter tuning)
- test (for performance reporting)

Which parameters do our models have for tuning?

### Applications/extrinsic evaluation

- Sentence completion
- Grammatical error correction: detecting "odd" sentences and propose alternatives
- Natural lanuage generation: prefer more "natural" sentences
- Speech recognition
- Machine translation

### The problem with perplexity
    
<img style:"width:20%; float:left" src="images/AccuracyVsPerplexity.png"/>

- doesn't always correlate with application performance
- can't evaluate non probabilistic LMs

### Last words: More data defeats smarter model!

<a href="http://www.aclweb.org/anthology/D07-1090.pdf"><img style="width:60%;" src="images/MT_LM.png"/></a>

### Bibliography
- Jurafsky & Martin [Chapter 4](https://web.stanford.edu/~jurafsky/slp3/4.pdf)
- Michael Collins's [notes](http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf)

### Coming up next

We have learned how to model word sequences using a Markov model

In the following lecture we will look at
- how to perform part-of-speech tagging using the **hidden** Markov model.