# L14a: Natural Language Embedding Models
In this lecture, we'll examine natural language models before the advent of transformers. We'll introduce [embedding models](https://en.wikipedia.org/wiki/Word_embedding), which are techniques for representing words in a continuous vector space. These models are crucial for understanding the evolution of natural language processing (NLP) and the development of transformer architectures.

The key concepts of this lecture include:
* __Embedding Models__: These models represent words as vectors in a continuous space, allowing for the capture of semantic relationships between words. We'll discuss how these embeddings are learned and their applications in various NLP tasks.
* __Word2Vec__: A popular embedding model that uses shallow neural networks to learn word representations. We'll explore the two main architectures of Word2Vec: Continuous Bag of Words (CBOW) and Skip-Gram (in L14b).
* __Continuous Bag of Words (CBOW)__: This architecture predicts the target word based on its context words. It uses a shallow neural network to learn the embeddings of words in a given context. No positional information is used, and the model is trained to minimize the loss between the predicted and actual target word.

The sources for this lecture include:
* [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)
* [Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. “Attention is All You Need.” Neural Information Processing Systems (2017).](https://arxiv.org/abs/1706.03762)
* [Ramsauer, H., Schafl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Pavlovi'c, M., Sandve, G.K., Greiff, V., Kreil, D.P., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield Networks is All You Need. ArXiv, abs/2008.02217.](https://arxiv.org/abs/2008.02217)
* [Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. ArXiv, abs/2207.09238.](https://arxiv.org/abs/2207.09238)

___

## Embedding Models
The overall goal of embedding models is to represent language sequences, e.g., characters, words, documents, etc., in a continuous vector space, where similar words are _close together_ in the embedding space. Let's look at some of the most popular embedding models: the continuous bag-of-words (CBOW) and skip-gram models. 
* _Key idea_: These CBOW and Skip-Gram models are based on the idea that words that appear in similar contexts tend to have similar meanings. The CBOW model predicts a target word based on its context, while the skip-gram model does the opposite: it predicts the context given a target word.

Before we discuss the details of these models, let's introduce some key concepts, terminology, and notation that will be used throughout this lecture.

### Vocabulary, Tokens, and Tokenization
Let $\mathcal{V}$ be the vocabulary of tokens (characters, sub-words, whole words, documents, etc) in our [corpus](https://en.wikipedia.org/wiki/Corpus), and let $N_{\mathcal{V}} = \dim\mathcal{V}$ be the size of the vocabulary. Let $\mathbf{x}\equiv \{x_1, x_2, \ldots, x_n\in\mathcal{V}\}$ be a sequence of tokens in the corpus i.e., a sentence or document, where $n$ is the length of the sequence, and $x_i$ is the $i$-th token in the sequence. 

Let's consider a simple example: `My grandma makes the best apple pie.`

Tokens are the basic units of text that we will be working with. In this space, tokens can be characters, sub-words, whole words, or documents. Converting a sequence of text into tokens is called _tokenization_.
* _Character-level tokenization_. Given the example above, one possible choice is to let the vocabulary $\mathcal{V}$ be the (English) alphabet (plus punctuation). Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 36: `[‘M’, ‘y’, ‘ ’, ..., ’.’]`. Character-level tokenization tends to yield _very long sequences_.
* _Word-level tokenization_. Another possible choice is to let the vocabulary $\mathcal{V}$ be the set of all words in the corpus. Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 8: `[‘My’, ‘grandma’, ‘makes’, ‘the’, ‘best’, ‘apple’, ‘pie’, ‘.’]`. Word-level tokenization tends to yield _shorter sequences_; however, word-level tokenization tends to require an extensive vocabulary and cannot deal with new words at test time.
* _Sub-word tokenization_. A third possible choice is to let the vocabulary $\mathcal{V}$ be the set of commonly occurring word segments like `cious`, `ing`, `pre`. Common words like `is` are often a separate token, and single characters are also included in the vocabulary $\mathcal{V}$ to ensure all words are expressible.

Given a choice of tokenization/vocabulary, each vocabulary element is assigned a unique index $\left\{1, 2,\dots, N_{\mathcal{V}}-3\right\}$. Several special (control) tokens are then added to the vocabulary, let's use `3`, but there could be more:
* $\texttt{mask} \rightarrow N_{\mathcal{V}} - 2$: the `mask` token that is used to mask out a token in the input sequence. This is used in training to predict the masked word.
* $\texttt{bos} \rightarrow N_{\mathcal{V}} - 1$: the beginning of the sequence (bos) token is used to indicate the start of a sequence. 
* $\texttt{eos} \rightarrow N_{\mathcal{V}}$: the end of sequence (eos) token is used to indicate the end of a sequence.

A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by $\texttt{bos}$-token and followed by the $\texttt{eos}$-token.

In [3]:
let
    sentence = "My grandma makes the best apple pie.";
    sentence |> s-> split(s, " ")
end

7-element Vector{SubString{String}}:
 "My"
 "grandma"
 "makes"
 "the"
 "best"
 "apple"
 "pie."

### Contextual Continuous Bag of Words (CBOW)
The Continuous Bag of Words (CBOW) model is a neural network architecture used for learning word embeddings that was popularized by the [word2vec algorithm](https://arxiv.org/abs/1301.3781). 

* _What is it?_ The CBOW model predicts the probability of a _target word_ based on its surrounding _context words_. The CBOW is encoded as a feedforward neural network with a single hidden layer. The input (context) vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is a [one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) representing the _context words_. The output is a _softmax layer_ that computes the probability of the target word given the context.
* See: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

In the simplest case, the the input context vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is connected to a hidden layer $\mathbf{h}\in\mathbb{R}^{h}$ which is computed using a linear identity transformation, i.e., with no activation function:
$$
\begin{align*}
\mathbf{h} &= \mathbf{W}_{1} \cdot \mathbf{x} \\
\end{align*}
$$
where $\mathbf{W}_{1}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is the (unkown) weight matrix of the hidden layer, and $\mathbf{x}$ is [the one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) of context word(s) (the input vector). The hidden layer is then mapped through another linear layer:
$$
\begin{align*}
\mathbf{u} &= \mathbf{W}_{2} \cdot \mathbf{h} \\
\end{align*}
$$
which produces the $\mathbf{u}\in\mathbb{R}^{N_{\mathcal{V}}}$ vector, where $\mathbf{W}_{2}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is the (unknown) weight matrix for the output layer. The output layer is then passed through a softmax activation function to obtain the probability distribution over the vocabulary:
$$
\begin{align*}
p(w_{i} | \mathbf{x}) = y_i &= \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \\
\end{align*}
$$
where $p(w_{i} | \mathbf{x})$ is the probability of observing the $i$-th token, e.g., character, sub-word, word, document, etc in the vocabulary as the output (target) given the context vector $\mathbf{x}$, the term $N_{\mathcal{V}}$ is the size of the vocabulary, and $e^{\mathbf{u}_i}$ is the exponential function applied to the $i$-th element of the vector $\mathbf{u}$.

#### Training
The training objective of the CBOW model is to _maximize_ the likelihood of target word(s) given the context words. This is done by _minimizing_ the negative log-likelihood loss function (in this case, a weighted cross-entropy loss) over the training data. The loss function is defined as:
$$
\begin{align*}
\min\mathcal{L} &= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log p(w_{i} | \mathbf{x}) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log \left( \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \right) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left( \mathbf{u}_i - \log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) \right) \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) -  \mathbf{u}_i\right)\quad\text{substitute}~u_{i} = \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\langle \mathbf{w}_{2}^{(j)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle} \right) -  \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle\right)\blacksquare\\
\end{align*}
$$
where $\mathcal{L}$ is the loss function, $y_{i}$ is the $i$-th element of the one-hot encoded vector of the target word(s), $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ are the weight matrices of the hidden and output layers, respectively, and $\langle \cdot,\cdot\rangle$ is the inner product. Finally, the term $\mathbf{w}_{2}^{(i)}$ is the $i$-th row of the weight matrix $\mathbf{W}_{2}$, which corresponds to the target word $w_{i}$.

A variety of optimization algorithms can be used to minimize the loss function. Let's implement the CBOW model and mess around with the inputs, hyperparameters, etc, to see how they affect its performance.

___

## Example: CBOW Model of Sarcasm Headlines
Let's set up the computational environment for our example, e.g., importing the necessary libraries (and codes), etc, by including the `Include.jl` file:

In [6]:
include("Include.jl");

### Sarcasm Data
We'll load a public dataset of headlines curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. [Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).](https://www.sciencedirect.com/science/article/pii/S2666651023000013?via%3Dihub)
2. [Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).](https://rishabhmisra.github.io/Sculpting_Data_for_ML.pdf)

The sarcasm data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We'll load the saved data file that we generated in `L13b`.

In [8]:
corpusmodel = let

    # setup path -
    path_to_saved_corpus_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2");
    saveddata = load(path_to_saved_corpus_file);

    # get items from the saveddata -
    corpusmodel = saveddata["corpus"];

    # return 
    corpusmodel
end;

In [9]:
corpusmodel.records[1]

MySarcasmRecordModel(true, "thirtysomething scientists unveil doomsday clock of hair loss", "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205")

__Constants__. Next, we'll set some constants that will be used throughout the code. See the comment next to each constant for a description of its purpose, permissible values, etc.

In [11]:
θ = 0.10; # What percentage of record do we want to train?
number_of_records = length(corpusmodel.records)
number_of_training_samples = Int64(round(θ*number_of_records)); # θ of the data will be used for training
number_of_test_samples = number_of_records - number_of_training_samples; # the rest will be used for testing
vocabulary = corpusmodel.tokens; # vocabulary for the corpus
inverse_vocabulary = corpusmodel.inverse; # inverse vocabulary for the corpus
N = length(vocabulary); # number of tokens in the vocabulary
number_of_hidden_states = 100; # number of hidden states
number_of_epochs = 20; # number of epochs that we'll use for training
array_of_token_ids = range(1, step=1, length=N) |> collect; # array of token ids

In [12]:
corpusmodel.records[600].headline

"why these people of faith are marching for women this weekend"

__Select a context__. Let's set up a context to train the model. Select a headline for training:

In [14]:
corpusmodel.records[4].headline # used 1 and 4

"inclement weather prevents liar from getting to work"

Next, let's build the training dataset. This will be contained in the `training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` variable.

In [16]:
training_dataset = let

    # specify the context, and the target -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    
    # build list of training examples -
    for i ∈ 1:number_of_training_samples # use the first number_of_training_samples for records
        words = corpusmodel.records[i].headline |> s-> split(s, " ");

        # How many words are in this sentence?
        number_of_words = length(words);
        idx_target_word = rand(1:number_of_words); # select a random word
        idx_context = 1:number_of_words |> collect |> v-> setdiff(v,idx_target_word); # get all the words, excluding the random word

        # What is the token_id for the target word?
        target_word_token_id = vocabulary[words[idx_target_word]]; # this is the target word
        if target_word_token_id == 0 # hack:
            target_word_token_id = 1;
        end
        target_one_hot = onehot(target_word_token_id, array_of_token_ids);
       
        # build the context vector x
        C = length(idx_context);
        tmp = zeros(Float32, N, C); # temporary vector for the context words
        for j ∈ 1:C
            word = words[idx_context[j]]; # get the context word
            context_word_id = vocabulary[word]; # get the context word id

            # becuase we use 0?
            if (context_word_id == 0)
                context_word_id = 1;
            end
            
            tmp[context_word_id, j] = 1.0 |> Float32; # set the context word id to 1.0 
        end
        context_one_hot = (1/C)*sum(tmp, dims=2) |> vec; # sum the context words vector
        
        D = (context_one_hot, target_one_hot);
        push!(training_dataset, D);
    end
    
    # return the training dataset -
    training_dataset
end;

In [69]:
training_dataset

2862-element Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}:
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 ([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.

In [17]:
findall(x-> x!= 0.0, training_dataset[1][1]) # context words vector

7-element Vector{Int64}:
  5553
  8295
 12047
 15828
 18533
 23295
 27980

In [18]:
inverse_vocabulary[15828]

"loss"

__Setup model__: We will use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to encode the CBOW model. The model is a simple feedforward neural network with a single hidden layer. The input layer is a one-hot encoded vector of the context words (average for multiple words), and the output layer is a softmax layer that computes the probability of the target word given the context.

In [20]:
# TODO: Uncomment the code below to build the model!
Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
    Chain(
        input = Dense(N, number_of_hidden_states, identity),  # layer 1
        hidden = Dense(number_of_hidden_states, N, identity), # layer 2
        output = NNlib.softmax) # layer 3 (output layer)
);
model = MyModel().chain;

In [21]:
inverse_vocabulary[67]

"00003"

What does the untrained model give?

In [23]:
let
    x = onehot(67, array_of_token_ids); # give some random word
    y = model(x) |> v-> argmax(v) |> i-> inverse_vocabulary[i]
end

"satin"

__Loss__: The loss function is [the logit cross entropy function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy)

In [25]:
loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

__Train__: We'll train the model using the gradient descent with momentum optimizer. 

In [27]:
trainedmodel = let

    localmodel = model; # make a local copy of the model

    # setup the optimizer
    λ = 0.64; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(Momentum(λ,β), model);

    # training loop -
    for i ∈ 1:number_of_epochs
        # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
        Flux.train!(localmodel, training_dataset, opt_state) do m, x, y
            loss(m(x), y) # loss function
        end

        if (rem(i,10) == 0)
            @show "Epoch $i of $number_of_epochs completed" # print the epoch number
        end
    end

    # return the trained model -
    localmodel;
end

"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 10 of 20 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 20 of 20 completed"


Chain(
  input = Dense(29664 => 100),          [90m# 2_966_500 parameters[39m
  hidden = Dense(100 => 29664),         [90m# 2_996_064 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m5_962_564 parameters, 22.746 MiB.

__Check__: If we give the context vector to the trained model, and the training was good, we should get the target word back. Let's check this by passing the context vector to the model and checking the output. The output should be a probability distribution over the vocabulary, with the target word having the highest probability.

In [67]:
ŷ = trainedmodel(training_dataset[1000][1]); # get the predicted word
ŷ |> v-> argmax(v) |> i-> inverse_vocabulary[i] # get the predicted word

"to"

__Hmmm__. What happens if we change the context? Let's try a few different contexts and see how the model performs.

In [57]:
test_context_example = let

    # initialize -
    context_words_vector = zeros(Float32, N); # context words vector
    list_of_context_words = ["hair", "loss", "unveil"];
    C = length(list_of_context_words); # number of context words

    tmp = zeros(Float32, N, C); # temporary vector for the context words
    for i ∈ eachindex(list_of_context_words)
        word = list_of_context_words[i]; # get the context word
        context_word_id = vocabulary[word]; # get the context word id
        tmp[context_word_id, i] = 1.0; # set the context word id to 1.0        
    end
    context_one_hot = (1/C)*sum(tmp, dims=2) |> vec .|> Float32; # sum the context words vector

    # return -
    context_one_hot;
end;

What are going to see?

In [71]:
let
    ŷ = trainedmodel(test_context_example); # get the predicted word
    ŷ |> v-> argmax(v) |> i-> inverse_vocabulary[i] # get the predicted word
end

"to"

___

## Lab: The Skip-Gram Model
In lab `L14b`, we'll examine the skip-gram model, a neural network-based approach to natural language processing designed to learn word embeddings by predicting the _surrounding context words_ given a _target word_ within a fixed window in a text corpus. 
* _What is it?_ A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. Imagine you're reading a sentence and can guess the words that come before and after a particular word.

# Today?
That's a wrap! What are some of the interesting things we discussed today?

# 