# L14d: Understanding the Self-Attention Mechanism
In this lab, we'll examine two text embedding models: the Continuous Bag of Words (CBOW) and the Skip-Gram model, which are neural network models for learning word embeddings. 
* __Continuous Bag of Words (CBOW)__: This architecture predicts the target word based on its context words. It uses a shallow neural network to learn the embeddings of words in a given context. No positional information is used, and the model is trained to minimize the loss between the predicted and actual target word.
* __Skip-Gram__: A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. Imagine you're reading a sentence and can guess the words that come before and after a particular word.

See sections 2 and 3: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW and skip-gram models. 
* __Task 2: Build and Train a CBOW model instance (20 min)__: In this task, we build and train a Continuous Bag of Words (CBOW) model instance on a sample input sequence. We start by creating a model instance, and then we train this instance for a few epochs, and finally, we see how the model performs.
* __Task 3: Build and train a skip-gram model instance (20 min)__: In this task, we will build and train a skip-gram model instance on the sample input sequence we selected above. We start by creating a model instance, then train it for a few epochs and see how it performs.

Let's get started!
___

## Task 1: Setup, Data, Prerequisites
In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW and skip-gram models. 

Let's start by setting up the environment, e.g., loading the required library and codes, loading the data, and preparing it for training. 

In [1]:
include("Include.jl")

Next, let's specify an example sentence, tokenize it, create a vocabulary, and an inverse vocabulary. 
* _What sentence to use?_ The example sentence we will work with can be whatever you want, as long as it consists of simple English words, no punctuation, and no control tokens.

In the code below, we chop up the `sample_sentence::String` using [the `split(...)` method](https://docs.julialang.org/en/v1/base/strings/#Base.split), which tokenizes around a specified character, in this case the `space` character. We return the `words::Array{String,1}` array, the `vocabulary::Dict{String, Int64}` and the `inverse_vocabulary::Dict{Int64, String}` dictionaries.

In [2]:
words, vocabulary, inverse_vocabulary = let 
    
    # initialize -
    vocabulary = Dict{String, Int}();
    inverse_vocabulary = Dict{Int, String}();

    # TDOD: specify a sample sentence -
    sample_sentence = "<bos> The quick brown fox jumps over the lazy dog my sample sentence and other stuff goes here <eos>"; # Classical pangram!

    # split -
    words = split(sample_sentence, ' ') .|> String; # no external ordering

    # build the vocabulary -
    for (i, word) in enumerate(words)
        vocabulary[word] = i;
        inverse_vocabulary[i] = word;
    end

    # return -
    words, vocabulary, inverse_vocabulary
end;

In [3]:
words

19-element Vector{String}:
 "<bos>"
 "The"
 "quick"
 "brown"
 "fox"
 "jumps"
 "over"
 "the"
 "lazy"
 "dog"
 "my"
 "sample"
 "sentence"
 "and"
 "other"
 "stuff"
 "goes"
 "here"
 "<eos>"

__Constants__: Let's set up some constants for the model. These constants will be used throughout the example codes below. See the comments in the code for more details.

In [None]:
N = length(words); # size of the vocabulary
windowsize = 5; # size of the context window # must be odd
number_of_epochs = 10000; # number of epochs
number_digit_array = range(1, stop=N, step=1) |> collect; # list of numbers from 1 to N
β = 0.9; # Inverse temperature of the system. 


### Compute the CBOW embedding
In this task, we will build and train a CBOW model instance on the sample input sequence we specified above. We start by creating a model instance, then train it for a few epochs, and finally, we see how the model performs.

Let's start with building the `cbow_model::Chain` instance. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to build (and train) the model. The input layer will be a mapping between the vocabulary size $N_{\mathcal{V}}$ $\rightarrow$ `windowsize::Int64` (hidden layer dimension). The output layer (which we run through a softmax) will be `windowsize::Int64` $\rightarrow$ $N_{\mathcal{V}}$. In both cases, we use the identity activation function, i.e, the transformations do not involve a nonlinear activation function.

We save the (initially untrained) CBOW model in the `cbow_model::Chain` variable:

In [5]:
cbow_model = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(N, windowsize, identity),  # layer 1. Notice: identity activation function
            hidden = Dense(windowsize, N, identity), # layer 2. Notice: identity activation function
            output = NNlib.softmax) # layer 3 (output layer)
    );
    cbow_model = MyModel().chain;
end

Chain(
  input = Dense(19 => 3),               [90m# 60 parameters[39m
  hidden = Dense(3 => 19),              [90m# 76 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m136 parameters, 752 bytes.

In [6]:
fieldnames(typeof(cbow_model.layers[:hidden]))

(:weight, :bias, :σ)

__CBOW training dataset__: The `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array contains the context (input) and target word (output) for the `sample_sentence::String` where we slide a `windowsize::Int64` window along the sample string. The first element of [the `Tuple`](https://docs.julialang.org/en/v1/base/base/#Core.Tuple) stored in the training data will be the context words, while the second element will be the target word. All will be encoded in [one-hot](https://en.wikipedia.org/wiki/One-hot) format. 
* _Example_: The context words (input) are the flanking words around the target word. Suppose the `windowsize=3`, and the `sample_sentence` = `The quick brown ...`, the first training sample will have context words `The` and `brown` with `quick` being the target word.

In [7]:
cbow_training_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    C = windowsize - 1; # number of context words
    startindex = Int(C/2) + 1; # start index
    endindex = N - Int(C/2); # end index
    Δ = Int(C/2); # half the context window size

    # build the training data -
    for i ∈ startindex:endindex

        # get the target word -
        target_word = words[i]; # target word
        target_word_index = vocabulary[target_word]; # target word index
        y = onehot(target_word_index, number_digit_array); # target word one-hot vector

        # Build the context list -
        context_words_list = Vector{Vector{Float32}}();
        context_index_array = range(i - Δ, stop=i + Δ, step=1) |> collect; # context index array
        for j ∈ context_index_array

            # get the context word -
            if j == i
                continue; # skip the target word
            end

            # get the context word -
            context_word = words[j]; # context word
            context_word_index = vocabulary[context_word]; # context word index
            context_word_onehot = onehot(context_word_index, number_digit_array); # context word one-hot vector
            push!(context_words_list, context_word_onehot); # add to the list of context words

        end
        x = sum(context_words_list, dims=1) |> vec; # concatenate the context words
        
        data_tuple = (x[1], y); # data tuple
        push!(training_dataset, data_tuple); # add to the training dataset
    end

    # return -
    training_dataset;
end;

__Training__: In the code block below, we train the CBOW model instance using the `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array. We'll _minimize_ the [logitcrossentropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy) using [the `Momentum` optimizer](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers.Momentum) (all of which are exported by [the `Flux.jl` package](https://github.com/FluxML/Flux.jl)).

In [8]:
trained_cbow_model = let

    localmodel = deepcopy(cbow_model); # make a local copy of the model

    # setup the loss function -
    loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

    # setup the optimizer
    λ = 0.64; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(Momentum(λ,β), localmodel);

    # training loop -
    for i ∈ 1:number_of_epochs
        # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
        Flux.train!(localmodel, cbow_training_dataset, opt_state) do m, x, y
            loss(m(x), y) # loss function
        end

        # output for the user -
        if (rem(i,1000) == 0)
            @show "Epoch $i of $number_of_epochs completed" # print the epoch number
        end
    end

    # return the trained model -
    localmodel;
end

"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 1000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 2000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 3000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 4000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 5000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 6000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 7000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 8000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 9000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 10000 of 10000 completed"


Chain(
  input = Dense(19 => 3),               [90m# 60 parameters[39m
  hidden = Dense(3 => 19),              [90m# 76 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m136 parameters, 752 bytes.

Let's give the CBOW model a few inputs and see what it predicts. If we give it the original context, it should return the original target word. 
* _What get's returned?_ The network will return $p(w_{i}|\mathbf{x})$, the probability of each word in the vocabulary being the target word. 

In [9]:
(x,y,word) = let
    
    x = cbow_training_dataset[1][1]; # first training data
    y = trained_cbow_model(x);
    word = y |> argmax |> i-> inverse_vocabulary[i]; # index of the word

    (x,y,word) # return the values
end;

In [10]:
x |> x-> findall(x-> x!= 0.0, x) .|> i-> inverse_vocabulary[i] # find the words in the context

2-element Vector{String}:
 "<bos>"
 "quick"

In [11]:
word

"The"

__What does the embedding look like?__ Let's compute the hidden state $\mathbf{h}$ (a low-dimensional embedded representation corresponding to our target word) for each target word in our sample sentence. We'll store these in the `CBOW_embedding_dictionary::Dict{String, Array{Float32,1}` dictionary.

In [12]:
CBOW_embedding_dictionary = let

    # initialize -
    embedding_dictionary = Dict{String, Array{Float32,1}}();
    number_of_training_examples = length(cbow_training_dataset);

    # get the parameters from the trained model -
    W₁ = trained_cbow_model.layers.input.weight
    b₁ = trained_cbow_model.layers.input.bias
    W₂ = trained_cbow_model.layers.hidden.weight
    b₂ = trained_cbow_model.layers.hidden.bias

    # let's compute all the embeddings -
    for i ∈ 1:number_of_training_examples

        # what is h?
        x = cbow_training_dataset[i][1]; # first training data
        h = W₁*x + b₁;

        # what is the key (target word) ?
        pᵢ = W₂*h + b₂ |> u -> NNlib.softmax(u);
        key = pᵢ |> argmax |> j -> inverse_vocabulary[j];

        embedding_dictionary[key] = h;
    end
        
    # return -
    embedding_dictionary;
end;

In [13]:
CBOW_embedding_dictionary

Dict{String, Vector{Float32}} with 17 entries:
  "quick"    => [2.42246, -3.12691, 3.06027]
  "my"       => [-0.872877, 3.0572, 8.28969]
  "over"     => [-10.1697, 3.05824, 0.111518]
  "and"      => [5.08242, 6.8755, 3.15755]
  "here"     => [8.19845, -3.7553, -0.896546]
  "sample"   => [0.30528, 1.30486, 0.801785]
  "stuff"    => [10.9274, 2.12803, 2.49473]
  "jumps"    => [-2.36542, -7.34619, -2.66204]
  "the"      => [-0.548648, -2.67539, -9.34289]
  "fox"      => [-5.1697, -2.0258, 2.63307]
  "dog"      => [1.96168, -2.99209, -3.7056]
  "brown"    => [-2.98576, -8.92488, 3.29007]
  "The"      => [-4.23359, -1.91239, -2.62269]
  "goes"     => [4.06902, 2.58961, -3.30548]
  "lazy"     => [-4.83329, 3.76046, 5.12293]
  "other"    => [-0.2255, 6.22819, -7.1791]
  "sentence" => [-2.64456, 6.17913, -0.985498]

Fill me in

In [62]:
X, embedded_words_list = let
    
    # initialize -
    X = zeros(Float32, N-2, windowsize); # matrix of zeros
    embedded_words_list = Vector{String}();

    # iterate over the words -
    linearindex = 1;
    for word ∈ words[2:end-1]
        # get the embedding -
        embedding = CBOW_embedding_dictionary[word]; # get the embedding
        X[linearindex, :] = embedding; # add to the matrix
        linearindex += 1; # increment the index

        push!(embedded_words_list, word); # add to the list of embedded words
    end

    # return -
    X, embedded_words_list;
end;

In [63]:
embedded_words_list

17-element Vector{String}:
 "The"
 "quick"
 "brown"
 "fox"
 "jumps"
 "over"
 "the"
 "lazy"
 "dog"
 "my"
 "sample"
 "sentence"
 "and"
 "other"
 "stuff"
 "goes"
 "here"

## Task 2: Look at Attention in a Modern Hopfield Network
In this task, we revisit the Hopfield network, and particularly the modern Hopfield incantation, which is a recurrent neural network (RNN) that can be used for associative memory. 

A modern Hopfield network addresses many of the perceived limitations of the original Hopfield network. The original Hopfield network was limited to binary values and could only store a limited number of patterns. The modern Hopfield network uses continuous values and can store a large number of patterns.
* For a detailed discussion of the key milestones in the development of modern Hopfield networks, check out [Hopfield Networks is All You Need Blog, GitHub.io](https://ml-jku.github.io/hopfield-layers/)

### Algorithm
The user provides a set of memory vectors $\mathbf{X} = \left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{m}\right\}$, where $\mathbf{x}_{i} \in \mathbb{R}^{n}$ is a memory vector of size $n$ and $m$ is the number of memory vectors. Further, the user provides an initial _partial memory_ $\mathbf{s}_{\circ} \in \mathbb{R}^{n}$, which is a vector of size $n$ that is a partial version of one of the memory vectors and specifies the _temperature_ $\beta$ of the system.

__Initialize__ the network with the memory vectors $\mathbf{X}$, and the inverse temperature $\beta$. Set current state to the initial state $\mathbf{s} \gets \mathbf{s}_{\circ}$

Until convergence __do__:
   1. Compute the _current_ probability vector defined as $\mathbf{p} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s})$ where $\mathbf{s}$ is the _current_ state vector, and $\mathbf{X}^{\top}$ is the transpose of the memory matrix $\mathbf{X}$.
   2. Compute the _next_ state vector $\mathbf{s}^{\prime} = \mathbf{X}\mathbf{p}$ and the _next_ probability vector $\mathbf{p}^{\prime} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s}^{\prime})$.
   3. If $\mathbf{p}^{\prime}$ is _close_ to $\mathbf{p}$ or we run out of iterations, then __stop__. For example, $\lVert \mathbf{p}^{\prime} - \mathbf{p}\rVert_{2}^{2} \leq \epsilon$ for some small $\epsilon > 0$.
   4. Otherwise, update the state $\mathbf{s} \gets\mathbf{s}^{\prime}$, and __go back to__ step 1.

   
This algorithm is implemented in [the `recover(...)` method](src/Compute.jl).

### Implementation
Let's start by creating a model of a modern Hopfield network. 
* We'll construct [a `MyModernHopfieldNetworkModel` instance](src/Types.jl) using a custom [`build(...)` function](src/Factory.jl). The [`build(...)` method](src/Factory.jl) takes the type of thing we want to build, the (linearized) image library we want to encode, and the (inverse) system temperature $\beta$ as inputs — images along the columns.
* The [`build(...)` function](src/Factory.jl) returns a `MyModernHopfieldNetworkModel` instance, where the image library is stored in the `X::Array{Float32,2}` field, and the system temperature is stored in the `β::Float64` field.

We'll store the problem instance in the `model::MyModernHopfieldNetworkModel` variable.

In [52]:
model = let

    # initialize -
    number_of_words_to_learn = size(X,1);
    linearwordcollection = Array{Float32,2}(undef, windowsize, number_of_words_to_learn); # words on columns
    index_vector = range(1,stop=number_of_words_to_learn, step=1) |> collect; # # turn our set into a sorted vector - we'll process this in the sorted order 

    # populate the data array that we give to the model
    for k ∈ eachindex(index_vector)
        j = index_vector[k]; # what image index will we load?
        sₖ = X[j,:]; # original data, vectorized
        
        for i ∈ 1:windowsize
            linearwordcollection[i,k] = sₖ[i];  # fill the columns of the array with the image data
        end
    end
    
    # build model -
    model = build(MyModernHopfieldNetworkModel, (
            memories = linearwordcollection, # this is the data we want to memorize. Images on columns
            β = β, # Inverse temperature of the system. A big beta means we are more likely to get the right answer
    ));

    model; # return the model to the calling scope
end;

In [53]:
model.β

0.9

We implemented the modern Hopfield recovery algorithm above in [the `recover(...)` method](src/Compute.jl). This method takes our `model::MyModernHopfieldNetworkModel` instance, the initial configuration vector `sₒ::Array{Int32,1}`, and the maximum number `maxiterations::Int64`, and iteration tolerance parameter `ϵ::Float64`. 
* [The `recover(...)` method](src/Compute.jl) returns the recovered word in the `s₁::Array{Float32,1}` variable, the word at each iteration in the `f::Dict{Int, Array{Float32,2}}` dictionary, and the probability of the word at each iteration in the `p::Dict{Int, Array{Float32,2}}` variable. The frames and probability dictionaries are indexed from `0`.

In [87]:
starting_word_index, s₁, f, p = let 
    
    starting_word_index = 5; # index of the starting word
    sₒ = X[starting_word_index,:]; # initial state
    (s₁,f,p) = recover(model, sₒ, maxiterations = 10000, ϵ = 1e-16); # iterate until we hit stop condition

    # return -
    starting_word_index, s₁,f,p
end;

In [88]:
println("How many iterations: $(length(f))") # how many iterations did we need to converge?

How many iterations: 6


__Check__: Let's check to see if the recovered word is identical to the original word (not guaranteed). We can do this by checking the `s₁::Array{Float32,1}` variable against the original word.

In [89]:
let
    n = length(f) - 1; # number of iterations (last index)
    recovered_memory = p[n] |> argmax |> i-> embedded_words_list[i] # index of the most probable word
    starting_word = embedded_words_list[starting_word_index]; # starting word
    println("Starting word: $starting_word and recovered word: $recovered_memory");
end

Starting word: jumps and recovered word: brown


## Task 3: Scaled Dot-Product Attention
In this task, we explore the scaled dot-product attention mechanism. The scaled dot-product attention is a key component of the transformer architecture, which has revolutionized natural language processing and other fields.

### Single Query Attention
Now, let's implement the single query attention mechanism. Fill me in

In [17]:
q, k, v, α = let

    # initalize -
    d_in = windowsize; # input dimension
    d_attn = 3; # attention dimension 
    d_out = windowsize; # output dimension
    number_of_key_vectors = size(X,1); # number of key vectors 
    example_word_index = 4; # index of the example word
    w = X[example_word_index,:]; # raw embededd vector
    W = zeros(Float32, number_of_key_vectors, d_in); # initialize the embedding matrix
    k = Vector{Vector{Float32}}(undef, number_of_key_vectors); # initialize the key vectors
    v = Vector{Vector{Float32}}(undef, number_of_key_vectors); # initialize the value vector

    # populate the W - matrix
    for i ∈ 1:number_of_key_vectors
        W[i, :] = X[i,:]; # get the embedding
    end

    # generate some random weights and biases -
    W₁ = Matrix{Float32}(I,d_attn,d_in); # weights W_q
    b₁ = randn(Float32,d_attn); # bias b_q
    W₂ = Matrix{Float32}(I,d_attn,d_in); # weights W_k
    b₂ = zeros(Float32,d_attn); # bias b_k
    W₃ = Matrix{Float32}(I,d_out,d_in); # weights W_v
    b₃ = zeros(Float32,d_out); # bias b_v

    # compute the query, key and value -
    q = W₁ * w + b₁; # query vector
    for i ∈ 1:number_of_key_vectors
        k[i] = W₂ * W[i, :] + b₂; # key
        v[i] = W₃ * W[i, :] + b₃; # value
    end

    # compute the attention scores -
    s = zeros(Float32, number_of_key_vectors); # initialize the attention scores
    for i ∈ 1:number_of_key_vectors
        s[i] = (1/sqrt(d_attn))*dot(q, k[i]); # attention score
    end
    α = NNlib.softmax(s); # attention weights
   
    (q, k, v, α) # return the value and attention weights
end;

In [18]:
q

3-element Vector{Float32}:
 -2.752718
 -1.4726125
  2.618626

In [19]:
transpose(v)*α # weighted sum of the value vectors

1×3 transpose(::Vector{Float32}) with eltype Float32:
 -3.24818  -8.29206  3.2086

## What's coming up?
In lecture `L15a`, we'll look at alternatives to [the attention mechanism](https://en.wikipedia.org/wiki/Attention_(machine_learning)) that underpins most [large language models (LLMs)](https://en.wikipedia.org/wiki/Large_language_model). What comes after attention and transformers?