# L14b: Fun with Text Embedding Models
In this lab, we'll examine two text embedding models: the Continuous Bag of Words (CBOW) and the Skip-Gram model, which are neural network models for learning word embeddings. 
* __Continuous Bag of Words (CBOW)__: This architecture predicts the target word based on its context words. It uses a shallow neural network to learn the embeddings of words in a given context. No positional information is used, and the model is trained to minimize the loss between the predicted and actual target word.
* __Skip-Gram__: A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. Imagine you're reading a sentence and can guess the words that come before and after a particular word.

See sections 2 and 3: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW and skip-gram models. 
* __Task 2: Build and Train a CBOW model instance (20 min)__: In this task, we build and train a Continuous Bag of Words (CBOW) model instance on a sample input sequence. We start by creating a model instance, and then we train this instance for a few epochs, and finally, we see how the model performs.
* __Task 3: Build and train a skip-gram model instance (20 min)__: In this task, we will build and train a skip-gram model instance on the sample input sequence we selected above. We start by creating a model instance, then train it for a few epochs and see how it performs.

Let's get started!
___

## Task 1: Setup, Data, Prerequisites
In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW and skip-gram models. 

Let's start by setting up the environment, e.g., loading the required library and codes, loading the data, and preparing it for training. 

In [3]:
include("Include.jl")

Next, let's specify an example sentence, tokenize it, create a vocabulary, and an inverse vocabulary. 
* _What sentence to use?_ The example sentence we will work with can be whatever you want, as long as it consists of simple English words, no punctuation, and no control tokens.

In the code below, we chop up the `sample_sentence::String` using [the `split(...)` method](https://docs.julialang.org/en/v1/base/strings/#Base.split), which tokenizes around a specified character, in this case the `space` character. We return the `words::Array{String,1}` array, the `vocabulary::Dict{String, Int64}` and the `inverse_vocabulary::Dict{Int64, String}` dictionaries.

In [5]:
words, vocabulary, inverse_vocabulary = let 
    
    # initialize -
    vocabulary = Dict{String, Int}();
    inverse_vocabulary = Dict{Int, String}();

    # TDOD: specify a sample sentence -
    sample_sentence = "The quick brown fox jumps over the lazy dog"; # Classical pangram!

    # split -
    words = split(sample_sentence, ' ') .|> String; # no external ordering

    # build the vocabulary -
    for (i, word) in enumerate(words)
        vocabulary[word] = i;
        inverse_vocabulary[i] = word;
    end

    # return -
    words, vocabulary, inverse_vocabulary
end;

In [6]:
inverse_vocabulary

Dict{Int64, String} with 9 entries:
  5 => "jumps"
  4 => "fox"
  6 => "over"
  7 => "the"
  2 => "quick"
  9 => "dog"
  8 => "lazy"
  3 => "brown"
  1 => "The"

__Constants__: Let's set up some constants for the model. These constants will be used throughout the example codes below. See the comments in the code for more details.

In [8]:
N = length(words); # size of the vocabulary
windowsize = 3; # size of the context window # must be odd
number_of_epochs = 10000; # number of epochs
number_digit_array = range(1, stop=N, step=1) |> collect; # list of numbers from 1 to N

__CBOW training dataset__: The `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array contains the context (input) and target word (output) for the `sample_sentence::String` where we slide a `windowsize::Int64` window along the sample string. The first element of [the `Tuple`](https://docs.julialang.org/en/v1/base/base/#Core.Tuple) stored in the training data will be the context words, while the second element will be the target word. All will be encoded in [one-hot](https://en.wikipedia.org/wiki/One-hot) format. 
* _Example_: The context words (input) are the flanking words around the target word. Suppose the `windowsize=3`, and the `sample_sentence` = `The quick brown ...`, the first training sample will have context words `The` and `brown` with `quick` being the target word.

In [10]:
cbow_training_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    C = windowsize - 1; # number of context words

    # build the training data -
    for i ∈ 2:(N-1)
        
        targetword = words[i]; # target word
        contextwords = words[(i-1):(i+1)] |> v-> [v[1], v[end]] # context words
        
        # proces the target word -
        targetword_index = vocabulary[targetword]; # index of the target word
        y = onehot(targetword_index, number_digit_array); # one-hot encoding of the target word

        # process the context words -
        tmp = Array{Float32,2}(undef, N, C); # temporary array
        for (j,word) in enumerate(contextwords)
            contextword_index = vocabulary[word]; # index of the context word
            z = onehot(contextword_index, number_digit_array) .|> Float32; # one-hot encoding of the context word
            tmp[:, j] .= z; # store the context word
        end
        x = sum(tmp, dims=2) |> vec .|> Float32; # average of the context words
        
        # store the training data -
        push!(training_dataset, (x, y)); # store the training data
    end

    # return -
    training_dataset;
end;

__Skip Gram training datatset__: The `skip_gram_training_dataset::Vector{Tuple{Vector{Float32}, Vector{Float32}}}` array will be in the _inverse_ of the CBOW training data. We give the network a target word (the first element [of the `Tuple`](https://docs.julialang.org/en/v1/base/base/#Core.Tuple)), and we predict the context words flanking the target word (the second element of training data [`Tuple`](https://docs.julialang.org/en/v1/base/base/#Core.Tuple).

In [12]:
skip_gram_training_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, Vector{Float32}}}();
    C = windowsize - 1; # number of context words

    # build the training data -
    for i ∈ 2:(N-1)
        
        contextword = words[i]; # target word
        targetwords = words[(i-1):(i+1)] |> v-> [v[1], v[3]] # context words

        # proces the context word -
        contextword_index = vocabulary[contextword]; # index of the target word
        x = onehot(contextword_index, number_digit_array) .|> Float32; # one-hot encoding of the target word

        # @show contextword, contextword_index, targetwords, x; # show the context word and the target words

        # process the targets words -
        tmp = Array{Float32,2}(undef, N, C); # temporary array
        for (j,word) in enumerate(targetwords);
            contextword_index = vocabulary[word]; # index of the context word
            z = onehot(contextword_index, number_digit_array) .|> Float32; # one-hot encoding of the context word
            tmp[:, j] .= z; # store the context word
        end
        y = sum(tmp, dims=2) |> vec .|> Float32; # average of the context words
        
        # store the training data -
        push!(training_dataset, (x, y)); # store the training data
    end

    # return -
    training_dataset;
end;

In [13]:
skip_gram_training_dataset[1]

(Float32[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

## Task 2: Build and Train a CBOW model instance
In this task, we will build and train a CBOW model instance on the sample input sequence we specified above. We start by creating a model instance, then train it for a few epochs, and finally, we see how the model performs.

Let's start with building the `cbow_model::Chain` instance. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to build (and train) the model. The input layer will be a mapping between the vocabulary size $N_{\mathcal{V}}$ $\rightarrow$ `windowsize::Int64` (hidden layer dimension). The output layer (which we run through a softmax) will be `windowsize::Int64` $\rightarrow$ $N_{\mathcal{V}}$. In both cases, we use the identity activation function, i.e, the transformations do not involve a nonlinear activation function.

We save the (initially untrained) CBOW model in the `cbow_model::Chain` variable:

In [15]:
cbow_model = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(N, windowsize, identity),  # layer 1. Notice: identity activation function
            hidden = Dense(windowsize, N, identity), # layer 2. Notice: identity activation function
            output = NNlib.softmax) # layer 3 (output layer)
    );
    cbow_model = MyModel().chain;
end

Chain(
  input = Dense(9 => 3),                [90m# 30 parameters[39m
  hidden = Dense(3 => 9),               [90m# 36 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m66 parameters, 472 bytes.

In [16]:
fieldnames(typeof(cbow_model.layers[:input]))

(:weight, :bias, :σ)

__Training__: In the code block below, we train the CBOW model instance using the `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array. We'll _minimize_ the [logitcrossentropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy) using [the `Momentum` optimizer](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers.Momentum) (all of which are exported by [the `Flux.jl` package](https://github.com/FluxML/Flux.jl)).

In [18]:
trained_cbow_model = let

    localmodel = deepcopy(cbow_model); # make a local copy of the model

    # setup the loss function -
    loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

    # setup the optimizer
    λ = 0.64; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(Momentum(λ,β), localmodel);

    # training loop -
    for i ∈ 1:number_of_epochs
        # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
        Flux.train!(localmodel, cbow_training_dataset, opt_state) do m, x, y
            loss(m(x), y) # loss function
        end

        # output for the user -
        if (rem(i,1000) == 0)
            @show "Epoch $i of $number_of_epochs completed" # print the epoch number
        end
    end

    # return the trained model -
    localmodel;
end

"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 1000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 2000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 3000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 4000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 5000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 6000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 7000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 8000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 9000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 10000 of 10000 completed"


Chain(
  input = Dense(9 => 3),                [90m# 30 parameters[39m
  hidden = Dense(3 => 9),               [90m# 36 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m66 parameters, 472 bytes.

Let's give the CBOW model a few inputs and see what it predicts. If we give it the original context, it should return the original target word. 
* _What get's returned?_ The network will return $p(w_{i}|\mathbf{x})$, the probability of each word in the vocabulary being the target word. 

In [20]:
(x,y,word) = let
    
    x = cbow_training_dataset[1][1]; # first training data
    y = trained_cbow_model(x);
    word = y |> argmax |> i-> inverse_vocabulary[i]; # index of the word

    (x,y,word) # return the values
end;

In [21]:
x |> x-> findall(x-> x!= 0.0, x) .|> i-> inverse_vocabulary[i] # find the words in the context

2-element Vector{String}:
 "The"
 "brown"

In [22]:
word

"quick"

__What does the embedding look like?__ Let's compute the hidden state $\mathbf{h}$ (a low-dimensional embedded representation corresponding to our target word) for each target word in our sample sentence. We'll store these in the `CBOW_embedding_dictionary::Dict{String, Array{Float32,1}` dictionary.

In [24]:
CBOW_embedding_dictionary = let

    # initialize -
    embedding_dictionary = Dict{String, Array{Float32,1}}();
    number_of_training_examples = length(cbow_training_dataset);

    # get the parameters from the trained model -
    W₁ = trained_cbow_model.layers.input.weight
    b₁ = trained_cbow_model.layers.input.bias
    W₂ = trained_cbow_model.layers.hidden.weight
    b₂ = trained_cbow_model.layers.hidden.bias

    # let's compute all the embeddings -
    for i ∈ 1:number_of_training_examples

        # what is h?
        x = cbow_training_dataset[i][1]; # first training data
        h = W₁*x + b₁;

        # what is the key (target word) ?
        pᵢ = W₂*h + b₂ |> u -> NNlib.softmax(u);
        key = pᵢ |> argmax |> j -> inverse_vocabulary[j];

        embedding_dictionary[key] = h;
    end
        
    # return -
    embedding_dictionary;
end;

In [25]:
CBOW_embedding_dictionary

Dict{String, Vector{Float32}} with 7 entries:
  "brown" => [-4.12924, -0.591874, -2.55188]
  "jumps" => [-2.11527, 4.3435, -3.82844]
  "lazy"  => [-3.41541, -1.97225, 4.19899]
  "the"   => [-1.42853, 3.7855, 1.36336]
  "quick" => [4.82387, 2.12251, 2.98309]
  "fox"   => [4.56134, -2.00698, -1.09246]
  "over"  => [-0.0999005, -4.99337, 0.041354]

## Task 3: Build and train a skip-gram model instance
In this task, we will build and train a skip-gram model instance on the sample input sequence we selected above. We start by creating a model instance, and then we train this instance for a few epochs, and see how the model performs. 

Let's start with building the `skip_gram_model::Chain` instance. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to build (and train) the model. The input layer will be a mapping between the vocabulary size $N_{\mathcal{V}}$ $\rightarrow$ `windowsize::Int64` (hidden layer dimension). The output layer (which we run through a softmax) will be `windowsize::Int64` $\rightarrow$ $N_{\mathcal{V}}$. In both cases, we use the identity activation function, i.e, the transformations do not involve a nonlinear activation function.

We save the (initially untrained) skip gram model in the `skip_gram_model::Chain` variable:

In [27]:
skip_gram_model = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(N, windowsize, identity),  # layer 1
            hidden = Dense(windowsize, N, identity), # layer 2
            output = NNlib.softmax) # layer 3 (output layer)
    );
    skip_gram_model = MyModel().chain;
end

Chain(
  input = Dense(9 => 3),                [90m# 30 parameters[39m
  hidden = Dense(3 => 9),               [90m# 36 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m66 parameters, 472 bytes.

__Training__: In the code block below, we train the skip-gram model instance using the `skip_gram_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array. We'll _minimize_ the [logitcrossentropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy) using [the `Momentum` optimizer](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers.Momentum) (all of which are exported by [the `Flux.jl` package](https://github.com/FluxML/Flux.jl)).

In [95]:
trained_skip_gram_model = let

    localmodel = deepcopy(skip_gram_model); # make a local copy of the model

    # setup the loss function -
    loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

    # setup the optimizer
    λ = 0.61; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(AdaBelief(), localmodel); # changed Momentum to AdaBelief: https://arxiv.org/abs/2010.07468

    # training loop -
    for i ∈ 1:number_of_epochs
        # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
        Flux.train!(localmodel, skip_gram_training_dataset, opt_state) do m, x, y
            loss(m(x), y) # loss function
        end

        # some output for the user .... short attention span ...
        if (rem(i,1000) == 0)
            @show "Epoch $i of $number_of_epochs completed" # print the epoch number
        end
    end

    # return the trained model -
    localmodel;
end

"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 1000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 2000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 3000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 4000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 5000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 6000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 7000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 8000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 9000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 10000 of 10000 completed"


Chain(
  input = Dense(9 => 3),                [90m# 30 parameters[39m
  hidden = Dense(3 => 9),               [90m# 36 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m66 parameters, 472 bytes.

Let's give the skip-gram model a few inputs and see what it predicts. If we give it the original context, it should return the original target words. 

In [109]:
(x₂,y₂,ŷ₂) = let
    
    example_index = 4;
    skip_gram_context = skip_gram_training_dataset[example_index][1]; # first training data
    skip_gram_target_actual = skip_gram_training_dataset[example_index][2]; # this is what *should* see
    skip_gram_target_model = trained_skip_gram_model(skip_gram_context); # this what we actually see?

    (skip_gram_context, skip_gram_target_actual, skip_gram_target_model)
end;

In [111]:
[x₂ y₂ ŷ₂]

9×3 Matrix{Float32}:
 0.0  0.0  9.39936f-14
 0.0  0.0  6.20029f-7
 0.0  0.0  2.60902f-13
 0.0  1.0  0.49972
 1.0  0.0  8.35616f-8
 0.0  1.0  0.500279
 0.0  0.0  2.58987f-9
 0.0  0.0  8.15599f-7
 0.0  0.0  8.11161f-9

## What's coming up next time?
In lecture `L14c`, we'll look at [the attention mechanism](https://en.wikipedia.org/wiki/Attention_(machine_learning)) that underpins most [large language models (LLMs)](https://en.wikipedia.org/wiki/Large_language_model). Want to get ahead?
* __Check out__: [Vaswani, Ashish, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin. “Attention is All You Need.” Neural Information Processing Systems (2017).](https://arxiv.org/abs/1706.03762)