
# Going beyond n-gram language models with smoothing and neural networks

* The sparsity problem and Zipf's for n-gram language models means bigger and bigger sequences are each less and less likely
* Many linguistic events are unseen in test corpora (say, if we switch genres)
* Do not want to assign 0% probability to unseen events that actually really occur

## Out-of-vocabulary symbols

* We could learn a special "OOV" (out-of-vocabulary) word that we apply to all words we see less than $n$ times
* Replace all rare words (defined by that threshold $n$) with "OOV"
* Re-compute probabilities and assign $p(w = OOV)$ or $p(w_{i-1} OOV)$ etc. to non-indexed words

## Smoothing | Discounting | Backoff

Assuming we have a total vocabulary of size $|V|$. Following set notation, $|V|$ represents the size (cardinality) of the set of all the words we have seen, we can change the probability of any event to be more or less depending.

* Laplace/Add-One smoothing
  * We assume we have seen everything one more time than we have actually seen it
  * Assigns $\frac{1}{|V|}$ probability to unseen events
  * Affects low-probability events much more than high-probability events
* Add-k smoothing
  * Use a smaller value $k$ that is less than 1 to allocate slightly less probability mass to unseen events
  * $\large\frac{P(w_{i-1}w_i + k)}{P(w_{i-1}) + k|V|}$
  * Choose $k$ by looking at held-out corpora
  * Make sure to divide the denominator (e.g., your total count) by the original frequency $F$ and add $k|V|$)
  * Generally doesn't allocate discounts well
* Katz backoff
  * We learn a set of coefficients $\lambda$ that tell us what proportion to consider ngrams of smaller sizes -- such as how much to "count" what we have seen from bigrams and unigrams before
  * $\hat{P}(w_n|w_{n−2}w_{n−1}) = \lambda_{1}P(w_n) + \lambda_{2}P(w_n |w_{n−1}) + \lambda_{3}P(w_n |w_{n−2} w_{n−1})$
* Kneser-Ney smoothing
  * Looks at the _histories_ of the words we are considering backing off too
  * Deprioritizes guesses of frequent words that occur in very few contexts (e.g., the _Kong_ in Hong Kong)
  * If those words occur in a variety of contexts, it form a more likely novel bigram
  * Advanced technique


## What is a neural language model?

* Any model that uses neural network methods to estimate the probabilities of a text
* May use a variety of architectures
  * word2vec
  * recurrent neural networks (RNNs)
  * Transformers (e.g., BERT)
  * Convolutional neural networks (CNNs)
  * Generative adversarial networks (GANs)

## How do NLMs learn?

* We tokenize language into units that we ask the model to predict
* Tokens can vary in their size from single characters (CharRNNs) to words
* We _optimize_ models to predict language from the context
  * Next word prediction (e.g., GPT-2, RNNs)
  * Masked word prediction (masked language modeling, e.g., BERT)
* Models are penalized for making mistakes (loss)
* Typically try to minimize the _cross-entropy_ (which depends on the _probabilities_ models assign to outcomes)

## What do NLMs learn?
* Long-distance co-occurrence statistics
  * Critical for idiomatic expressions
* Similarity in meaning between related words
  * Variants of the same word: is/was/were/been or go/went/gone
  * Words that appear in similar contexts: dog, cat, etc.
* Latent categories of words (nouns, verbs, adverbs, etc.)

## What does their ability to learn tell us about language?
* Language processing involves combining lots of _constraints_
* Predicting language is harder when we throw away statistical cues
* N-gram language models are not sufficient to characterize linguistic statistics
* (Probably neural language models also are not)

## Recurrent neural networks

Using a more complex neural structure that holds onto prior states in memory, we can learn contextual word representations.
* The model uses _recurrence_ (storing the previous state) and the current input to predict the next event -- e.g., a word. 
* The hidden state keeps track of everything up to the current word
* Very sophisticated n-gram language model
* State-of-the-art framework under the `seq2seq` banner until about 2015

<center><img src="images/elman_rnn.png" width=400></center>

* The hidden state and the current input are combined like so:
  * Hidden state: $1 \times n\_dim$
  * Recurrent units: $1 \times n\_dim$
  * Weight matrix: $2 * n\_dim \times |V|$
  * Predictions are a $1 \times |V|$ vector where each dimension is the relative strength of that word
  * 🚨 MUST pass predictions through a `Softmax` function to obtain probabilities 🚨

### Masked language modeling objective in Transformers

* Generalizes the language modeling task using a much more complex neural network architecture.
  
* Prediction task similar to the RNN but everything is computed in parallel, rather than serially
* The model's goal is to predict a hidden word (or some proportion of hidden words) given the context that surrounds those words. The model is trained by _backpropagation_, so, the model gets better by making mistakes about prediction, but as the model learns, it will make fewer and fewer mistakes.
  * **Prediction task**: The model's goal is to predict a hidden word (or some proportion of hidden words) given the context that surrounds those words.
  * The sentences it will get kind of look like this:
    * Masked sentence: Dr. Jacob's [MASK] Radish is very sweet and purrs very loudly.
    * Correct sentence: [MASK] corresponds to the word "cat."
  * **Error-driven**: The model is trained by _backpropagation_, so, the model gets better by making mistakes about prediction, but as the model learns, it will make fewer and fewer mistakes. 
  
* _Attention_ mechanism allows every token to influence every other token

* This added complexity makes the model much better at guessing a word given the context, because it can incorporate _syntactic_, _semantic_, and _lexical_ knowledge, in addition to some propositional knowledge. 

### DistilBERT

* A tiny **Transformer** whose goal was to compress a much bigger Transformer -- it is faster than the bigger model it was based off of (BERT) and is 40% of the original size.
* "Models" a neural language model

### The core components of Transformer models like BERT, GPT-2, XLM, etc.

* Word embeddings (hidden states within the models)
* Predicted probabilities in the outputs for each word in our vocabulary
* "Attention" values that represent the strength of a link between every word to every other word in the input
* Each of these is produced at every single **layer** in the network
* All of this is implemented as matrices -- which we will cover in more depth later