# L14a: Natural Language Embedding Models
In this lecture, we'll look at natural language models before the advent of transformers. In particular, we'll introduce [embedding models](https://en.wikipedia.org/wiki/Word_embedding), which are techniques used to represent words in a continuous vector space. These models are crucial for understanding the evolution of natural language processing (NLP) and the development of transformer architectures.

The key concepts of this lecture include:
* Fill me in

The sources for this lecture were:
* [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)
* [Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. “Attention is All you Need.” Neural Information Processing Systems (2017).](https://arxiv.org/abs/1706.03762)
* [Ramsauer, H., Schafl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Pavlovi'c, M., Sandve, G.K., Greiff, V., Kreil, D.P., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield Networks is All You Need. ArXiv, abs/2008.02217.](https://arxiv.org/abs/2008.02217)
* [Phuong, M., & Hutter, M. (2022). Formal Algorithms for Transformers. ArXiv, abs/2207.09238.](https://arxiv.org/abs/2207.09238)

___

## Embedding Models
The overall goal of embedding models is to represent language sequences, e.g., characters, words, documents, etc in a continuous vector space, where similar words are _close together_ in the embedding space. Let's take a look at some of the most popular embedding models, the continuous bag of words (CBOW) and skip-gram models. 
* _Key idea_: These CBOW and Skip-Gram models are based on the idea that words that appear in similar contexts tend to have similar meanings. The CBOW model predicts a target word based on its context, while the skip-gram model does the opposite: it predicts the context given a target word.

Before we dive into the details of these models, let's first introduce some key concepts, terminology and notation that will be used throughout this lecture.

### Vocabulary, Tokens and Tokenization
Let $\mathcal{V}$ be the vocabulary of tokens (characters, sub-words, full words, documents, etc) in our [corpus](https://en.wikipedia.org/wiki/Corpus), and let $N_{\mathcal{V}} = \dim\mathcal{V}$ be the size of the vocabulary. Let $\mathbf{x}\equiv \{x_1, x_2, \ldots, x_n\in\mathcal{V}\}$ be a sequence of tokens in the corpus i.e., a sentence or document, where $n$ is the length of the sequence, and $x_i$ is the $i$-th token in the sequence. 

Let's consider a simple example: `My grandma makes the best apple pie.`

Tokens are the basic units of text that we will be working with. In this space, tokens can be characters, sub-words, full words, or even entire documents. The process of converting a sequence of text into tokens is called _tokenization_.
* _Character-level tokenization_. Given the example above, one possible choice is to let the vocabulary $\mathcal{V}$ be the (English) alphabet (plus punctuation). Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 36: `[‘M’, ‘y’, ‘ ’, ..., ’.’]`. Character-level tokenization tends to yield _very long sequences_.
* _Word-level tokenization_. Another possible choice is to let the vocabulary $\mathcal{V}$ be the set of all words in the corpus. Thus, we’d get a sequence $\mathbf{x}\in\mathcal{V}$ of length 8: `[‘My’, ‘grandma’, ‘makes’, ‘the’, ‘best’, ‘apple’, ‘pie’, ‘.’]`. Word-level tokenization tends to yield _shorter sequences_, however, word-level tokenization tends to
require a very large vocabulary and cannot deal with new words at test time.
* _Sub-word tokenization_. A third possible choice is to let the vocabulary $\mathcal{V}$ be the set of commonly occurring word segments like `cious`, `ing`, `pre`. Common words like `is` are often a separate token, and single characters are also included in the vocabulary $\mathcal{V}$ to ensure all words are expressible.

Given a choice of tokenization / vocabulary, each vocabulary element is assigned a unique index $\left\{1, 2,\dots,N_{\mathcal{V}}-3\right\}$. A number of special (control) tokens are then added to the vocabulary, let's use `3` but there could be more:
* $\texttt{mask} \rightarrow N_{\mathcal{V}} - 2$: the `mask` token that is used to mask out a toekn in the input sequence. This is used in training to predict the masked word.
* $\texttt{bos} \rightarrow N_{\mathcal{V}} - 1$: the begining of sequence (bos) token is used to indicate the start of a sequence. 
* $\texttt{eos} \rightarrow N_{\mathcal{V}}$: the end of sequence (eos) token is used to indicate the end of a sequence.

A piece of text is represented as a sequence of indices (called token IDs) corresponding to its (sub)words, preceded by $\texttt{bos}$-token and followed by the $\texttt{eos}$-token.

### Context Continuous Bag of Words (CBOW)
The Continuous Bag of Words (CBOW) model is a neural network architecture used for learning word embeddings that was popularized by the [word2vec algorithm](https://arxiv.org/abs/1301.3781). 

* _What is it?_ The CBOW model predicts the probability of a _target word_ based on its surrounding _context words_. The CBOW is encoded as a feedforward neural network with a single hidden layer. The input (context) vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is a [one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) representing the _context words_, while the output is a _softmax layer_ that computes the probability of the target word given the context.

In the simplest case, the hidden layer $\mathbf{h}\in\mathbb{R}^{h}$ is a computed using a linear layer with no activation function:
$$
\begin{align*}
\mathbf{h} &= \mathbf{W}_{1} \cdot \mathbf{x} \\
\end{align*}
$$
where $\mathbf{W}_{1}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is the (unkown) weight matrix of the hidden layer, and $\mathbf{x}$ is [the one-hot encoded vector](https://en.wikipedia.org/wiki/One-hot) of context word(s). The hidden layer is then mapped through another linear layer:
$$
\begin{align*}
\mathbf{u} &= \mathbf{W}_{2} \cdot \mathbf{h} \\
\end{align*}
$$
which produces the $\mathbf{u}\in\mathbb{R}^{N_{\mathcal{V}}}$ vector, where $\mathbf{W}_{2}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is the (unknown) weight matrix for the output layer. The output layer is then passed through a softmax activation function to obtain the probability distribution over the vocabulary:
$$
\begin{align*}
p(w_{i} | \mathbf{x}) = y_i &= \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \\
\end{align*}
$$
where $p(w_{i} | \mathbf{x})$ is the probability of observing the ith word in the vocabulary as the output (target) given the context vector $\mathbf{x}$, $N_{\mathcal{V}}$ is the size of the vocabulary, and $e^{\mathbf{u}_i}$ is the exponential function applied to the ith element of the vector $\mathbf{u}$.

#### Training
The training objective of the CBOW model is to maximize the likelihood of target word(s) given the context words. This is done by minimizing the negative log-likelihood loss function:
$$
\begin{align*}
\min\mathcal{L} &= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log p(w_{i} | \mathbf{x}) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log \left( \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \right) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left( \mathbf{u}_i - \log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) \right) \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) -  \mathbf{u}_i\right)\quad\text{substitute}~u_{i} = \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\langle \mathbf{w}_{2}^{(j)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle} \right) -  \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle\right)\blacksquare\\
\end{align*}
$$
where $\mathcal{L}$ is the loss function, $y_{i}$ is the one-hot encoded vector of the target word(s), and $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ are the weight matrices of the hidden and output layers, respectively, and $\langle \cdot,\cdot\rangle$ is the inner product. Finally, the term $\mathbf{w}_{2}^{(i)}$ is the ith column of the weight matrix $\mathbf{W}_{2}$, which corresponds to the target word $w_{i}$.

A variety of optimization algorithms can be used to minimize the loss function. Let's implement the CBOW model, and mess around with the hyperparameters to see how they affect the model's performance.

___

## Example: CBOW Model of Sarcasm Headlines
Fill me in.

In [1]:
include("Include.jl");

### Sarcasm Data
We'll load a public dataset of headlines curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. [Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).](https://www.sciencedirect.com/science/article/pii/S2666651023000013?via%3Dihub)
2. [Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).](https://rishabhmisra.github.io/Sculpting_Data_for_ML.pdf)

The sarcasm data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We'll load the saved data file that we generated in `L13b`.

In [2]:
corpusmodel = let

    # setup path -
    path_to_saved_corpus_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2");
    saveddata = load(path_to_saved_corpus_file);

    # get items from the saveddata -
    corpusmodel = saveddata["corpus"];

    # return 
    corpusmodel
end;

Fill me in

In [3]:
vocabulary = corpusmodel.tokens; # vocabulary for the corpus
inverse_vocabulary = corpusmodel.inverse; # inverse vocabulary for the corpus
N = length(vocabulary); # number of tokens in the vocabulary
number_of_hidden_states = 100; # number of hidden states
array_of_token_ids = range(1, step=1, length=N) |> collect; # array of token ids

Fill me in.

In [4]:
corpusmodel.records[1].headline

"thirtysomething scientists unveil doomsday clock of hair loss"

In [None]:
training_dataset = let

    # specify the context, and the target -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    context_word_id = vocabulary["thirtysomething"];
    target_word_id = vocabulary["scientists"];
    
    # compute the one-hot encoding for the context and target -
    context_one_hot = zeros(Float32, N, 1);
    context_one_hot[context_word_id] = 1.0 |> Float32;
    target_one_hot = onehot(target_word_id, array_of_token_ids);

    D = (context_one_hot, target_one_hot);
    push!(training_dataset, D);

    # return the training dataset -
    training_dataset
end;

In [6]:
x₁

29664×1 Matrix{Float32}:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 ⋮
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0

Fill me in

In [7]:
# TODO: Uncomment the code below to build the model!
Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, middle, hidden); # create a "namespaced" of sorts
MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
    Chain(
        input = Dense(N, number_of_hidden_states, tanh_fast),  # layer 1
        hidden = Dense(number_of_hidden_states, N, tanh_fast), # layer 2
        output = NNlib.softmax) # layer 3 (output layer)
);
model = MyModel().chain;

In [8]:
y₁ |> v-> argmax(v)

23295

In [9]:
inverse_vocabulary[23295]

"scientists"

Fill me in

In [10]:
model(x₁)

29664×1 Matrix{Float32}:
 3.373676f-5
 3.371541f-5
 3.371067f-5
 3.372976f-5
 3.376852f-5
 3.3670567f-5
 3.37205f-5
 3.3723845f-5
 3.368726f-5
 3.3717908f-5
 ⋮
 3.369874f-5
 3.371874f-5
 3.3690172f-5
 3.3718115f-5
 3.371758f-5
 3.3685053f-5
 3.3680633f-5
 3.3687964f-5
 3.3688557f-5

In [11]:
loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

Train.

In [None]:
trained_model = let

    localmodel = model; # make a local copy of the model

    # loss function -
    

    λ = 0.61; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(Momentum(λ,β), model);

    
    # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
    Flux.train!(localmodel, training_dataset, opt_state) do m, x, y
       loss(m(x), y) # loss function
    end

    localmodel;
end

MethodError: MethodError: no method matching (::var"#23#24")(::Chain{@NamedTuple{input::Dense{typeof(tanh_fast), Matrix{Float32}, Vector{Float32}}, hidden::Dense{typeof(tanh_fast), Matrix{Float32}, Vector{Float32}}, output::typeof(softmax)}}, ::Matrix{Float32})
The function `#23` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  (::var"#23#24")(::Any, ::Any, !Matched::Any)
   @ Main ~/Desktop/julia_work/CHEME-5820-SP25/CHEME-5820-Lectures-Spring-2025/lectures/week-14/L14a/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X30sZmlsZQ==.jl:18


___

## Lab: The Skip-Gram Model
In lab `L14b`, we'll look at the skip-gram model, which is a neural network-based approach in natural language processing designed to learn word embeddings by predicting the surrounding context words given a target word within a fixed window in a text corpus. 
* _What is it?_ A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. This method effectively captures semantic relationships and contextual similarity between words, making it foundational for many downstream NLP tasks.

# Today?
That's a wrap! What are some of the interesting things we discussed today?

# 