# L14a: Natural Language Models Before Transformers
In this lecture, we'll look at natural language models before the advent of transformers. In particular, we'll introduce [embedding models](https://en.wikipedia.org/wiki/Word_embedding), which are techniques used to represent words in a continuous vector space. These models are crucial for understanding the evolution of natural language processing (NLP) and the development of transformer architectures.

The key concepts of this lecture include:
* Fill me in

The sources for this lecture were:
* [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)
* [Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. “Attention is All you Need.” Neural Information Processing Systems (2017).](https://arxiv.org/abs/1706.03762)
* [Ramsauer, H., Schafl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Pavlovi'c, M., Sandve, G.K., Greiff, V., Kreil, D.P., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield Networks is All You Need. ArXiv, abs/2008.02217.](https://arxiv.org/abs/2008.02217)
* [Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. ArXiv, abs/2207.09238.](https://arxiv.org/abs/2207.09238)

___

## Embedding Models
The overall goal of embedding models is to represent words in a continuous vector space, where similar words are _close together_ in the embedding space. Let's take a look at some of the most popular embedding models, the continuous bag of words (CBOW) and skip-gram models. 
* _Key idea_: These CBOW and Skip-Gram models are based on the idea that words that appear in similar contexts tend to have similar meanings. The CBOW model predicts a target word based on its context, while the skip-gram model does the opposite: it predicts the context given a target word.

Before we dive into the details of these models, let's first introduce some key concepts, terminology and notation that will be used throughout this lecture.

### Vocabulary, Tokens and Tokenization
Let $\mathcal{V}$ be the vocabulary of tokens (characters, sub-words, full words, documents, etc) in our [corpus](https://en.wikipedia.org/wiki/Corpus), and let $N_{\mathcal{V}} = \dim\mathcal{V}$ be the size of the vocabulary. Let $\mathbf{x}\equiv \{x_1, x_2, \ldots, x_n\in\mathcal{V}\}$ be a sequence of tokens in the corpus i.e., a sentence or document, where $n$ is the length of the sequence, and $x_i$ is the $i$-th token in the sequence. 

Let's consider a simple example: `My grandma makes the best apple pie.`

Tokens are the basic units of text that we will be working with. In this space, tokens can be characters, sub-words, full words, or even entire documents. The process of converting a sequence of text into tokens is called _tokenization_.
* _Character-level tokenization_. Given the example above, one possible choice is to let the vocabulary $\mathcal{V}$ be the (English) alphabet (plus punctuation). Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 36: `[‘M’, ‘y’, ‘ ’, ..., ’.’]`. Character-level tokenization tends to yield _very long sequences_.
* _Word-level tokenization_. Another possible choice is to let the vocabulary $\mathcal{V}$ be the set of all words in the corpus. Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 8: `[‘My’, ‘grandma’, ‘makes’, ‘the’, ‘best’, ‘apple’, ‘pie’, ‘.’]`. Word-level tokenization tends to yield _shorter sequences_, however, word-level tokenization tends to
require a very large vocabulary and cannot deal with new words at test time.
* _Sub-word tokenization_. A third possible choice is to let the vocabulary $\mathcal{V}$ be the set of commonly occurring word segments like `cious`, `ing`, `pre`. Common words like `is` are often a separate token, and single characters are also included in the vocabulary $\mathcal{V}$ to ensure all words are expressible.

Given a choice of tokenization / vocabulary, each vocabulary element is assigned a unique index $\left\{1, 2,\dots,N_{\mathcal{V}}-3\right\}$. A number of special (control) tokens are then added to the vocabulary, let's use `3` but there could be more:
* $\texttt{mask} \rightarrow N_{\mathcal{V}} - 2$: the `mask` token that is used to mask out a toekn in the input sequence. This is used in training to predict the masked word.
* $\texttt{bos} \rightarrow N_{\mathcal{V}} - 1$: the begining of sequence (bos) token is used to indicate the start of a sequence. 
* $\texttt{eos} \rightarrow N_{\mathcal{V}}$: the end of sequence (eos) token is used to indicate the end of a sequence.

A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by $\texttt{bos}$-token and followed by the $\texttt{eos}$-token.

### Context Continuous Bag of Words (CBOW)
The Continuous Bag of Words (CBOW) model is a neural network architecture used for learning word embeddings that was popularized by the [word2vec algorithm](https://arxiv.org/abs/1301.3781). 

* _What is it?_ The CBOW model predicts the probability of a _target word_ based on its surrounding _context words_. The CBOW is encoded as a feedforward neural network with a single hidden layer. The input (context) vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is a [one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) representing the _context words_, while the output is a _softmax layer_ that computes the probability of the target word given the context.

In the simplest case, the hidden layer (dimension $h$) is a linear layer with no activation function. Thus, the output of the hidden layer is given by:
$$
\begin{align*}
\mathbf{h} &= \mathbf{W}_{\text{cbow}} \cdot \mathbf{x} \\
\end{align*}
$$
where $\mathbf{W}_{\text{cbow}}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is the weight matrix of the hidden layer, and $\mathbf{x}$ is the one-hot encoded vector of context word(s). The hidden layer is then mapped through another linear layer:
$$
\begin{align*}
\mathbf{u} &= \mathbf{W}_{\text{out}} \cdot \mathbf{h} \\
\end{align*}
$$
where $\mathbf{W}_{\text{out}}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is the weight matrix of the output layer. The output layer is then passed through a softmax activation function to obtain the probability distribution over the vocabulary. The ith element of the output vector $\mathbf{y}$ is given by:
$$
\begin{align*}
p(w_{i} | \mathbf{x}) = y_i &= \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \\
\end{align*}
$$
where $p(w_{i} | \mathbf{x})$ is the probability of observing the ith word in the vocabulary as the output (target) given the context vector $\mathbf{x}$, $N_{\mathcal{V}}$ is the size of the vocabulary, and $e^{\mathbf{u}_i}$ is the exponential function applied to the ith element of the vector $\mathbf{u}$.

#### Training


### The Skip-Gram Model
The skip-gram model is a neural network-based approach in natural language processing designed to learn word embeddings by predicting the surrounding context words given a target word within a fixed window in a text corpus.
* _What is it?_ A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. This method effectively captures semantic relationships and contextual similarity between words, making it foundational for many downstream NLP tasks.