# L14d: Understanding Self-Attention Mechanisms
In this lab, we'll revisit the update rule of [modern Hopfield networks](https://arxiv.org/abs/2008.02217) and show that it is an example of self-attention, namely the scaled dot product attention mechanism.

* _What is self-attention?_ [Attention mechanisms and self-attention](https://slds-lmu.github.io/seminar_nlp_ss20/attention-and-self-attention-for-nlp.html) in a [transformer](https://arxiv.org/abs/1706.03762) compute a representation of each element in an input sequence by weighing and aggregating information from all other elements in the same sequence, using _learned_ query, key, and value projections to dynamically determine the relevance of each element based on their contextual relationships.
* _Scaled dot product attention_? Scaled dot-product attention in transformers calculates the relevance of each element in a sequence by taking the dot product of query and key vectors, scaling by the inverse square root of their dimensionality to stabilize gradients, applying softmax to obtain attention weights, and then aggregating value vectors through a weighted sum.

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW and skip-gram models. Finally, we'll train a Continuous Bag of Words (CBOW) model instance on a sample input sequence. 
* __Task 2: Does a Modern Hopfield Network use Attention (20 min)?__ In this task, we revisit the update rule of the [modern Hopfield network](https://arxiv.org/abs/2008.02217) that we implemented in week 9 using our natural language problem. We'll explore the update rule in depth and gain some intuition about how it works.
* __Task 3: Single Query Scaled Dot-Product Attention (20 min)__ In this task, we explore the single query scaled dot-product attention mechanism. This is a simplified version of the attention mechanism used in the transformer architecture. 

Let's go!
___

## Task 1: Setup, Data, Prerequisites
In this task, we set up the computational environment and then specify a simple text sequence, e.g., a sentence without punctuation. From this sequence, we'll build a vocabulary, an inverse vocabulary, and the training datasets for the CBOW model that we'll use to compute the embedding. 

Let's start by setting up the environment, e.g., loading the required library and codes, loading the data, and preparing it for training. 

In [3]:
include("Include.jl")

Next, let's specify an example sentence, tokenize it, create a vocabulary, and an inverse vocabulary. 
* _What sentence to use?_ The example sentence we will work with can be whatever you want, as long as it consists of simple English words, no punctuation, and no control tokens.

In the code below, we chop up the `sample_sentence::String` using [the `split(...)` method](https://docs.julialang.org/en/v1/base/strings/#Base.split), which tokenizes around a specified character, in this case the `space` character. We return the `words::Array{String,1}` array, the `vocabulary::Dict{String, Int64}` and the `inverse_vocabulary::Dict{Int64, String}` dictionaries.

In [5]:
words, vocabulary, inverse_vocabulary = let 
    
    # initialize -
    vocabulary = Dict{String, Int}();
    inverse_vocabulary = Dict{Int, String}();

    # TDOD: specify a sample sentence -
    sample_sentence = "<bos> The quick brown fox jumps over the lazy dog my sample sentence and other stuff goes here <eos>"; # Classical pangram!

    # split -
    words = split(sample_sentence, ' ') .|> String; # no external ordering

    # build the vocabulary -
    for (i, word) in enumerate(words)
        vocabulary[word] = i;
        inverse_vocabulary[i] = word;
    end

    # return -
    words, vocabulary, inverse_vocabulary
end;

In [6]:
words

19-element Vector{String}:
 "<bos>"
 "The"
 "quick"
 "brown"
 "fox"
 "jumps"
 "over"
 "the"
 "lazy"
 "dog"
 "my"
 "sample"
 "sentence"
 "and"
 "other"
 "stuff"
 "goes"
 "here"
 "<eos>"

__Constants__: Let's set up some constants for the model. These constants will be used throughout the example codes below. See the comments in the code for more details.

In [8]:
N = length(words); # size of the vocabulary
windowsize = 5; # size of the context window # must be odd
number_of_epochs = 10000; # number of epochs
number_digit_array = range(1, stop=N, step=1) |> collect; # list of numbers from 1 to N
β = 0.9; # Inverse temperature of the system (for the Hopfield model)

### Compute the CBOW embedding
Next, we build and train a CBOW model instance on the sample input sequence we specified above. We start by creating a model instance, then train it for a few epochs, and finally, we see how the model performs.

Let's start with building the `cbow_model::Chain` instance. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to build (and train) the model. The input layer will be a mapping between the vocabulary size $N_{\mathcal{V}}$ $\rightarrow$ `windowsize::Int64` (hidden layer dimension). The output layer (which we run through a softmax) will be `windowsize::Int64` $\rightarrow$ $N_{\mathcal{V}}$. In both cases, we use the identity activation function, i.e, the transformations do not involve a nonlinear activation function.

We save the (initially untrained) CBOW model in the `cbow_model::Chain` variable:

In [10]:
cbow_model = let

    # TODO: Uncomment the code below to build the model!
    Flux.@layer MyFluxNeuralNetworkModel  trainable=(input, hidden); # create a "namespaced" of sorts
    MyModel() = MyFluxNeuralNetworkModel( # a strange type of constructor
        Chain(
            input = Dense(N, windowsize, identity),  # layer 1. Notice: identity activation function
            hidden = Dense(windowsize, N, identity), # layer 2. Notice: identity activation function
            output = NNlib.softmax) # layer 3 (output layer)
    );
    cbow_model = MyModel().chain;
end

Chain(
  input = Dense(19 => 5),               [90m# 100 parameters[39m
  hidden = Dense(5 => 19),              [90m# 114 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m214 parameters, 1.039 KiB.

__CBOW training dataset__: The `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array contains the context (input) and target word (output) for the `sample_sentence::String` where we slide a `windowsize::Int64` window along the sample string. The first element of [the `Tuple`](https://docs.julialang.org/en/v1/base/base/#Core.Tuple) stored in the training data will be the context words, while the second element will be the target word. All will be encoded in [one-hot](https://en.wikipedia.org/wiki/One-hot) format. 
* _Example_: The context words (input) are the flanking words around the target word. Suppose the `windowsize=3`, and the `sample_sentence` = `The quick brown ...`, the first training sample will have context words `The` and `brown` with `quick` being the target word.
* __New and improved!__ I (think) I've updated the training data generation logic so we can use different values for the `windowsize::Int` parameter - just one of the many ways the teaching team strives for teaching excellence!

In [12]:
cbow_training_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    C = windowsize - 1; # number of context words
    startindex = Int(C/2) + 1; # start index
    endindex = N - Int(C/2); # end index
    Δ = Int(C/2); # half the context window size

    # build the training data -
    for i ∈ startindex:endindex

        # get the target word -
        target_word = words[i]; # target word
        target_word_index = vocabulary[target_word]; # target word index
        y = onehot(target_word_index, number_digit_array); # target word one-hot vector

        # Build the context list -
        context_words_list = Vector{Vector{Float32}}();
        context_index_array = range(i - Δ, stop=i + Δ, step=1) |> collect; # context index array
        for j ∈ context_index_array

            # get the context word -
            if j == i
                continue; # skip the target word
            end

            # get the context word -
            context_word = words[j]; # context word
            context_word_index = vocabulary[context_word]; # context word index
            context_word_onehot = onehot(context_word_index, number_digit_array); # context word one-hot vector
            push!(context_words_list, context_word_onehot); # add to the list of context words

        end
        x = sum(context_words_list, dims=1) |> vec; # concatenate the context words
        data_tuple = (x[1], y); # data tuple
        push!(training_dataset, data_tuple); # add to the training dataset
    end
    training_dataset; # return: training tuples
end;

__Training__: In the code block below, we train the CBOW model instance using the `cbow_training_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` array. We'll _minimize_ the [logitcrossentropy loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.logitcrossentropy) using [the `Momentum` optimizer](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers.Momentum) (all of which are exported by [the `Flux.jl` package](https://github.com/FluxML/Flux.jl)).

In [14]:
trained_cbow_model = let

    localmodel = deepcopy(cbow_model); # make a local copy of the model

    # setup the loss function -
    loss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

    # setup the optimizer
    λ = 0.64; # TODO: maybe change the learning rate (default: 0.61)?
    β = 0.10; # TODO: maybe change the momentum parameter (default: 0.10)?
    opt_state = Flux.setup(Momentum(λ,β), localmodel);

    # training loop -
    for i ∈ 1:number_of_epochs
        # train the model - check out the do block notion: https://docs.julialang.org/en/v1/base/base/#do
        Flux.train!(localmodel, cbow_training_dataset, opt_state) do m, x, y
            loss(m(x), y) # loss function
        end

        # output for the user -
        if (rem(i,1000) == 0)
            @show "Epoch $i of $number_of_epochs completed" # print the epoch number
        end
    end

    # return the trained model -
    localmodel;
end

"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 1000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 2000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 3000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 4000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 5000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 6000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 7000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 8000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 9000 of 10000 completed"
"Epoch $(i) of $(number_of_epochs) completed" = "Epoch 10000 of 10000 completed"


Chain(
  input = Dense(19 => 5),               [90m# 100 parameters[39m
  hidden = Dense(5 => 19),              [90m# 114 parameters[39m
  output = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m214 parameters, 1.039 KiB.

__What does the embedding look like?__ Let's compute the hidden state $\mathbf{h}$ (a low-dimensional embedded representation corresponding to our target word) for each target word in our sample sentence. We'll store these in the `CBOW_embedding_dictionary::Dict{String, Array{Float32,1}` dictionary.

In [16]:
CBOW_embedding_dictionary = let

    # initialize -
    embedding_dictionary = Dict{String, Array{Float32,1}}();
    number_of_training_examples = length(cbow_training_dataset);

    # get the parameters from the trained model -
    W₁ = trained_cbow_model.layers.input.weight
    b₁ = trained_cbow_model.layers.input.bias
    W₂ = trained_cbow_model.layers.hidden.weight
    b₂ = trained_cbow_model.layers.hidden.bias

    # let's compute all the embeddings -
    for i ∈ 1:number_of_training_examples

        # what is h?
        x = cbow_training_dataset[i][1]; # first training data
        h = W₁*x + b₁;

        # what is the key (target word) ?
        pᵢ = W₂*h + b₂ |> u -> NNlib.softmax(u);
        key = pᵢ |> argmax |> j -> inverse_vocabulary[j];

        embedding_dictionary[key] = h;
    end
        
    # return -
    embedding_dictionary;
end;

In [17]:
CBOW_embedding_dictionary

Dict{String, Vector{Float32}} with 15 entries:
  "quick"    => [0.732705, -4.21973, 4.64067, -0.815573, 0.944938]
  "my"       => [-7.29506, -2.87861, -2.01533, 0.829007, 1.60731]
  "over"     => [0.228965, -4.99912, -1.96413, 4.33702, 1.55493]
  "and"      => [-1.72747, 1.27735, 0.251753, 1.89585, 5.48616]
  "sample"   => [-5.38171, 3.45946, -0.29314, -0.678659, -3.71791]
  "stuff"    => [3.36234, 4.35206, 3.92659, -2.31671, 2.44103]
  "jumps"    => [3.04484, -5.20193, -2.19688, -2.52884, -0.710544]
  "the"      => [-1.76475, -4.25966, -2.03461, 1.68388, -4.55532]
  "fox"      => [4.65445, -1.99659, 0.608372, 0.936876, -4.82589]
  "dog"      => [-1.34461, -0.428428, -7.25014, 0.0266408, 1.94266]
  "brown"    => [1.21374, -1.51093, 4.17144, 5.12981, -1.695]
  "goes"     => [4.71384, -0.295659, 1.26102, 1.15934, 3.70097]
  "lazy"     => [-0.605802, -0.333771, -6.06893, -2.93532, -6.00324]
  "other"    => [-3.83705, 3.35079, 5.26944, -0.220615, 1.3604]
  "sentence" => [1.03618, 5.81997, 

Finally, let's represent the words in our embedding in a matrix $\mathbf{X}$, where the words will be in the rows and the embedded coordinates will be stored in the columns. We store this data in the `X::Array{Float32,2}` array:

In [19]:
X, embedded_words_list = let
    
    # initialize -
    embedded_words_list = Vector{String}();
    list_of_words = keys(CBOW_embedding_dictionary) |> collect |> sort # get the keys of the dictionary
    N = length(list_of_words); # size of the vocabulary
    X = zeros(Float32, N, windowsize); # matrix of zeros

    # iterate over the words -
    linearindex = 1;
    for word ∈ list_of_words
        # get the embedding -
        embedding = CBOW_embedding_dictionary[word]; # get the embedding
        X[linearindex, :] = embedding; # add to the matrix
        linearindex += 1; # increment the index

        push!(embedded_words_list, word); # add to the list of embedded words
    end

    # return -
    X, embedded_words_list;
end;

In [20]:
X

15×5 Matrix{Float32}:
 -1.72747    1.27735    0.251753   1.89585     5.48616
  1.21374   -1.51093    4.17144    5.12981    -1.695
 -1.34461   -0.428428  -7.25014    0.0266408   1.94266
  4.65445   -1.99659    0.608372   0.936876   -4.82589
  4.71384   -0.295659   1.26102    1.15934     3.70097
  3.04484   -5.20193   -2.19688   -2.52884    -0.710544
 -0.605802  -0.333771  -6.06893   -2.93532    -6.00324
 -7.29506   -2.87861   -2.01533    0.829007    1.60731
 -3.83705    3.35079    5.26944   -0.220615    1.3604
  0.228965  -4.99912   -1.96413    4.33702     1.55493
  0.732705  -4.21973    4.64067   -0.815573    0.944938
 -5.38171    3.45946   -0.29314   -0.678659   -3.71791
  1.03618    5.81997   -2.79991   -0.0885359   2.15014
  3.36234    4.35206    3.92659   -2.31671     2.44103
 -1.76475   -4.25966   -2.03461    1.68388    -4.55532

## Task 2: Does a Modern Hopfield Network use Attention?
In this task, we revisit the update rule of the [modern Hopfield network](https://arxiv.org/abs/2008.02217) that we implemented in week 9 using our natural language problem. We'll explore the update rule in depth and gain some intuition about how it works.

* _What is it?_ A [modern Hopfield network](https://arxiv.org/abs/2008.02217) addresses many of the _perceived limitations_ of the original Hopfield network. The original Hopfield network was limited to binary values and could only store a limited number of patterns. The modern Hopfield network uses continuous values and can store a large number of patterns.

For a detailed discussion of the key milestones in the development of modern Hopfield networks, check out [Hopfield Networks is All You Need Blog, GitHub.io](https://ml-jku.github.io/hopfield-layers/)

### Algorithm
The user provides a set of memory (word) vectors $\mathbf{X} = \left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{m}\right\}$, where $\mathbf{x}_{i} \in \mathbb{R}^{n}$ is a memory vector of size $n$ and $m$ is the number of memory vectors. Further, the user provides an initial _partial memory_ $\mathbf{s}_{\circ} \in \mathbb{R}^{n}$, which is a vector of size $n$ that is a partial version of one of the memory vectors and specifies the _temperature_ $\beta$ of the system.

__Initialize__ the network with the memory vectors $\mathbf{X}$, and the inverse temperature $\beta$. Set current state to the initial state $\mathbf{s} \gets \mathbf{s}_{\circ}$

Until convergence __do__:
   1. Compute the _current_ probability vector defined as $\mathbf{p} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s})$ where $\mathbf{s}$ is the _current_ state vector, and $\mathbf{X}^{\top}$ is the transpose of the memory matrix $\mathbf{X}$.
   2. Compute the _next_ state vector $\mathbf{s}^{\prime} = \mathbf{X}\mathbf{p}$ and the _next_ probability vector $\mathbf{p}^{\prime} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s}^{\prime})$.
   3. If $\mathbf{p}^{\prime}$ is _close_ to $\mathbf{p}$ or we run out of iterations, then __stop__. For example, if $\lVert \mathbf{p}^{\prime} - \mathbf{p}\rVert_{2}^{2} \leq \epsilon$ for some small $\epsilon > 0$, then we __stop__.
   4. Otherwise, update the state $\mathbf{s} \gets\mathbf{s}^{\prime}$, and __go back to__ step 1.

   
This algorithm is implemented in [the `recover(...)` method](src/Compute.jl).

#### Implementation
Let's start by creating a model of a modern Hopfield network. 
* We'll construct [a `MyModernHopfieldNetworkModel` instance](src/Types.jl) using a custom [`build(...)` function](src/Factory.jl). The [`build(...)` method](src/Factory.jl) takes the type of thing we want to build, the (linearized) image library we want to encode, and the (inverse) system temperature $\beta$ as inputs — images along the columns.
* The [`build(...)` function](src/Factory.jl) returns a `MyModernHopfieldNetworkModel` instance, where the image library is stored in the `X::Array{Float32,2}` field, and the system temperature is stored in the `β::Float64` field.

We'll store the problem instance in the `model::MyModernHopfieldNetworkModel` variable.

In [23]:
model = let
    
    # build model -
    model = build(MyModernHopfieldNetworkModel, (
            memories = transpose(X) |> Matrix, # this is the data we want to memorize. Words on the columns
            β = β, # Inverse temperature of the system. A big beta means we are more likely to get the right answer
    ));

    model; # return the model to the calling scope
end;

In [24]:
model.X

5×15 Matrix{Float32}:
 -1.72747    1.21374  -1.34461    …   1.03618     3.36234  -1.76475
  1.27735   -1.51093  -0.428428       5.81997     4.35206  -4.25966
  0.251753   4.17144  -7.25014       -2.79991     3.92659  -2.03461
  1.89585    5.12981   0.0266408     -0.0885359  -2.31671   1.68388
  5.48616   -1.695     1.94266        2.15014     2.44103  -4.55532

We implemented the modern Hopfield recovery algorithm above in [the `recover(...)` method](src/Compute.jl). 
* This method takes our `model::MyModernHopfieldNetworkModel` instance, the initial configuration vector `sₒ::Array{Int32,1}`, the maximum number `maxiterations::Int64`, and iteration tolerance parameter `ϵ::Float64`. 
* This method returns the recovered word in the `s₁::Array{Float32,1}` variable, the word at each iteration in the `f::Dict{Int, Array{Float32,2}}` dictionary, and the probability of the word at each iteration in the `p::Dict{Int, Array{Float32,2}}` variable. The frames and probability dictionaries are indexed from `0`.

In [26]:
starting_word_index, s₁, f, p = let 
    
    starting_word_index = 10; # index of the starting word
    sₒ = X[starting_word_index,:]; # initial state
    (s₁,f,p) = recover(model, sₒ, maxiterations = 10000, ϵ = 1e-16); # iterate until we hit stop condition

    # return -
    starting_word_index, s₁,f,p
end;

In [27]:
println("How many iterations: $(length(f))") # how many iterations did we need to converge?

How many iterations: 3


__Check__: Let's check to see if the recovered word is identical to the original word (not guaranteed). We can do this by checking the `s₁::Array{Float32,1}` variable against the original word.
* _How?_ We find the index of the maximum probability in _last_ entry the `p::Dict{Int, Array{Float32,2}}` dictionary [using the `argmax(...)` method](https://docs.julialang.org/en/v1/base/collections/#Base.argmax), extract the word at that index from the `embedded_words_list::Array{String,1}` array, and set tyhe most probable word to the `recovered_word::String` variable.

In [29]:
let
    n = length(f) - 1; # number of iterations (last index)
    recovered_word = p[n] |> argmax |> i-> embedded_words_list[i] # index of the most probable word
    starting_word = embedded_words_list[starting_word_index]; # starting word
    println("Starting word: $starting_word and recovered word: $recovered_word");
end

Starting word: over and recovered word: over


## Task 3: Single Query Scaled Dot-Product Attention
In this task, we explore the single query scaled dot-product attention mechanism. This is a simplified version of the attention mechanism used in the transformer architecture. 

This mechanism looks a lot like the modern Hopfield network, but it is not a recurrent neural network (RNN). There is no iteration, and it has adjustable weights (and biases) that must be learned. 

There are three key concepts in the scaled dot-product attention mechanism:
* __Query__: The query vector $\mathbf{q} \in \mathbb{R}^{d_{a}}$, a vector of size $d_{a}$ that specifies the _query_. The query vector is computed as: $\mathbf{q} = \mathbf{W}_{Q}\mathbf{x}_{\circ} + \mathbf{b}_{Q}$, where $\mathbf{x}_{\circ}\in\mathbb{R}^{n}$ is the input (word) vector, $\mathbf{W}_{Q}\in\mathbb{R}^{d_{a}\times n}$ is the (learned) query weight matrix, and $\mathbf{b}_{Q}\in\mathbb{R}^{d_{a}}$ is the query bias vector.
* __Keys__: The key vector(s) $\mathbf{k}_{i} \in \mathbb{R}^{d_{a}}$ of vector of size $d_{a}$ specify the _keys_. At most there are $m$ key vectors, one for each memory vector $\mathbf{x}_{i}\in\mathbf{X}$. The key vector(s) are computed as: $\mathbf{k}_{i} = \mathbf{W}_{K}\mathbf{x}_{i} + \mathbf{b}_{K}$, where $\mathbf{x}_{i}\in\mathbb{R}^{n}$ is the input (word) vector, $\mathbf{W}_{K}\in\mathbb{R}^{d_{a}\times n}$ is the (learned) key weight matrix, and $\mathbf{b}_{K}\in\mathbb{R}^{d_{a}}$ is the key bias vector. The key matrix is $\mathbf{K} \in \mathbb{R}^{d_{a} \times m}$, where $m$ is the number of memory vectors (number of embedded words), and $d_{a}$ is the attention dimension.
    * We'll store the key vectors in the `K::Array{Float32,2}` variable, where $\mathbf{K} = \left[\mathbf{k}_{1}, \mathbf{k}_{2}, \ldots, \mathbf{k}_{m}\right]$ (keys on the columns).
* __Values__: The value vector(s) $\mathbf{v}_{i} \in \mathbb{R}^{n}$, a vector of size $n$ that specifies the _values_. At most there are $m$ value vectors, one for each memory vector $\mathbf{x}_{i}\in\mathbf{X}$. The value vector(s) are computed as: $\mathbf{v}_{i} = \mathbf{W}_{V}\mathbf{x}_{i} + \mathbf{b}_{V}$, where $\mathbf{x}_{i}\in\mathbb{R}^{n}$ is the input (word) vector, $\mathbf{W}_{V}\in\mathbb{R}^{n \times n}$ is the (learned) value weight matrix, and $\mathbf{b}_{V}\in\mathbb{R}^{n}$ is the value bias vector. The value matrix is $\mathbf{V} \in \mathbb{R}^{ n\times m}$, where $m$ is the number of memory vectors (number of embedded words), and $n$ is the embedding dimension.
    * We'll store the value vectors in the `V::Array{Float32,2}` variable, where $\mathbf{V} = \left[\mathbf{v}_{1}, \mathbf{v}_{2}, \ldots, \mathbf{v}_{m}\right]$ (values on the columns).

__Hypothesis__: If Hopfield and the single query scaled dot-product attention are the same, then we should recover the query word if all the transformations are identity, i.e., all the $\mathbf{W}_{\star} = \mathbf{I}$ and $\mathbf{b}_{\star} = \mathbf{0}$.

### Algorithm
Let's build a single query scaled dot-product attention mechanism; our approach is based on [Algorithm 3 of Phuong and Hutter (2019)](https://arxiv.org/abs/2207.09238).

__Initialize__ with $\mathbf{X} = \left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{m}\right\}$, where $\mathbf{x}_{i} \in \mathbb{R}^{n}$ is a memory vector of size $n$ (embedding dimension) and $m$ is the number of memory vectors (number of embedded words). The attention parameters $\mathbf{W}_{Q}$, $\mathbf{W}_{K}$, $\mathbf{W}_{V}$, $\mathbf{b}_{Q}$, $\mathbf{b}_{K}$, and $\mathbf{b}_{V}$. Specify an initial query word $\mathbf{x}_{\circ}$.

1. Compute the _query_ vector $\mathbf{q} \gets \mathbf{W}_{Q}\mathbf{x}_{\circ} + \mathbf{b}_{Q}$, where $\mathbf{x}_{\circ}\in\mathbb{R}^{n}$ is an input (word) vector, $\mathbf{W}_{Q}\in\mathbb{R}^{d_{a}\times n}$ is a (learned) query weight matrix, and $\mathbf{b}_{Q}\in\mathbb{R}^{d_{a}}$ is the query bias vector. The query vector is a vector of size $d_{a}$, which is the attention dimension.
2. Compute the _key_ matrix $\mathbf{K}\in\mathbb{R}^{d_{a} \times m}$. The columns $i=1,2,\dots,m$ are given by $\mathbf{k}_{i} \gets \mathbf{W}_{K}\mathbf{x}_{i} + \mathbf{b}_{K}$, where $\mathbf{x}_{i}\in\mathbb{R}^{n}$ is an input (word) vector, $\mathbf{W}_{K}\in\mathbb{R}^{d_{a}\times n}$ is the (learned) key weight matrix, and $\mathbf{b}_{K}\in\mathbb{R}^{d_{a}}$ is the key bias vector. 
3. Compute the _value_ matrix $\mathbf{V}\in\mathbb{R}^{n\times{m}}$. The columns $i=1,2,\dots,m$ are given by $\mathbf{v}_{i} \gets \mathbf{W}_{V}\mathbf{x}_{i} + \mathbf{b}_{V}$, where $\mathbf{x}_{i}\in\mathbb{R}^{n}$ is an input (word) vector, $\mathbf{W}_{V}\in\mathbb{R}^{n \times n}$ is the (learned) value weight matrix, and $\mathbf{b}_{V}\in\mathbb{R}^{n}$ is the value bias vector.
4. Compute the scaled score vector $\mathbf{s} \gets \left(1/\sqrt{d_{a}}\right)\cdot\mathbf{K}^{\top}\mathbf{q}$, where $\mathbf{K}^{\top}$ is the transpose of the key matrix $\mathbf{K}$. The score vector is a vector of size $m$, where $m$ is the number of memory vectors (number of embedded words).
5. Compute the attention vector $\mathbf{p} \gets \texttt{softmax}(\mathbf{s})$. The attention vector is a vector of size $m$, where $m$ is the number of memory vectors (number of embedded words). It is a probability vector, i.e., the sum of the elements in the attention vector is equal to 1.
6. Compute the output vector $\mathbf{y} \gets \mathbf{V}\mathbf{p}$, where $\mathbf{V}\in\mathbb{R}^{n\times{m}}$ is the value matrix. The output vector is a vector of size $n$, where $n$ is the embedding dimension.

___

Let's implement our single query scaled dot-product attention mechanism. We'll use randomly generated weights and biases for the attention mechanism and see what happens.

In [32]:
(x, y, p, example_word_index) = let

    # initalize -
    n = size(X,2); # input dimension (embedding dimension)
    m = size(X,1); # number of key vectors (number of embedded words)
    d_attn = windowsize; # attention dimension (we can change this)
    d_out = n; # output dimension (we want to get the same dimension as the input)
    example_word_index = 4; # index of the example word that we pass to the model
    xₒ = X[example_word_index,:]; # initial word that we use to construct the query vector
    K = Array{Float32,2}(undef, d_attn, m); # initialize the key matrix
    V = Array{Float32,2}(undef, n, m); # initialize the key matrix
    XT = transpose(X) |> Matrix; # transpose the input matrix

    # generate some random weights and biases -
    W₁ = Matrix{Float32}(I,d_attn,n); # weights W_q
    b₁ = 100*randn(Float32,d_attn); # bias b_q
    W₂ = Matrix{Float32}(I,d_attn,n); # weights W_k
    b₂ = zeros(Float32,d_attn); # bias b_k
    W₃ = Matrix{Float32}(I,d_out,n); # weights W_v
    b₃ = zeros(Float32,d_out); # bias b_v
    
    # compute the q vector, K and V matrices 
    q = W₁*xₒ + b₁; # query vector
    for i ∈ 1:m
        K[:,i] = W₂*XT[:,i] + b₂; # key vector
        V[:,i] = W₃*XT[:,i] + b₃; # value vector
    end
    
    # compute the attention score vector s
    s = transpose(K) * q; # attention score vector
    s = s ./ sqrt(d_attn); # scale the attention score vector
    p = softmax(s); # apply softmax to the attention score vector
    y = V * p; # output vector

    xₒ, y, p, example_word_index # return the input and output vectors
end;

What does the attention mechanism return?

In [34]:
let
    query_word = embedded_words_list[example_word_index]; # query word
    recovered_word = p |> argmax |> i-> embedded_words_list[i]; # index of the most probable word
    println("Query word: $query_word and recovered word: $recovered_word");
end

Query word: fox and recovered word: over


## What's coming up?
In lecture `L15a`, we'll look at what comes after the attention and transformer technologies that underpins most [large language models (LLMs)](https://en.wikipedia.org/wiki/Large_language_model). 
* Want to get ahead? Check out: [Schneider, J. (2024). What comes after transformers? - A selective survey connecting ideas in deep learning. ArXiv, abs/2408.00386.](https://arxiv.org/abs/2408.00386v1)