# Practicum: Modern Hopfield Network Spell Checker
In a lecture, an idea popped into my head in passing: "Could we use a modern Hopfield Network to do spell checking and word recommendation?" Well, you are in luck! In this practicum, we will implement a modern Hopfield Network and do a proof-of-concept test of its capabilities for these tasks. 

* __Hypothesis__: Suppose we have words embedded in some low-dimensional space, i.e., we have the vector representation of each word in a large text corpus. Then, we should be able to load these vectors into a modern Hofpfield network (which memorizes vectors), and let its self-attention-based update mechanism find the correct true word given a corrupted version of that word.

Thus, the modern Hopfield network could serve as a spell checker that returns a correctly spelled word given a misspelled variation of that word.

## Tasks
Before we get started, execute the `Run All Cells` command to check if you have any code or setup issues. Code issues, post a question on EdDiscussion.
* __Task 1: Setup, Data, Constants__: In this task, we set up the computational environment by including [the `Include.jl` file](Include.jl), loading any needed resources, such as sample datasets, and setting up any required constants.
*  __Task 2: Can we recover an uncorrupted memory?__ In this task, we'll create a modern Hopfield network model, load the test data, and then check if we can recover an _uncorrupted_ memory (word) from the network. 
* __Task 3: Retrieve a corrupted memory from the network__: In this task, we'll repeat the process above, but this time we'll start from a corrupted memory. We'll do this by cutting off a fraction of the word and then see if the model recovers the correct memory given the corrupted starting memory. 

Let's get started! (Don't forget to answer the discussion questions!)

___

## Background
A modern Hopfield network addresses many of the perceived limitations of the original Hopfield network. The original Hopfield network was limited to binary values and could only store a limited number of patterns. The modern Hopfield network uses continuous values and can store a large number of patterns.
 
* We'll used the following paper to guide our implementation and analysis: [Ramsauer, H., Schafl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., Pavlovi'c, M., Sandve, G.K., Greiff, V., Kreil, D.P., Kopp, M., Klambauer, G., Brandstetter, J., & Hochreiter, S. (2020). Hopfield Networks is All You Need. ArXiv, abs/2008.02217.](https://arxiv.org/abs/2008.02217)
* In addition, for a detailed discussion of the key milestones in the development of modern Hopfield networks, check out [Hopfield Networks is All You Need Blog, GitHub.io](https://ml-jku.github.io/hopfield-layers/).

### Algorithm
The user provides a set of memory vectors $\mathbf{X} = \left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{m}\right\}$, where $\mathbf{x}_{i} \in \mathbb{R}^{n}$ is a memory vector of size $n$ and $m$ is the number of memory vectors. Further, the user provides an initial _partial memory_ $\mathbf{s}_{\circ} \in \mathbb{R}^{n}$, which is a vector of size $n$ that is a partial version of one of the memory vectors and specifies the _temperature_ $\beta$ of the system.

__Initialize__ the network with the memory vectors $\mathbf{X}$, and the inverse temperature $\beta$. Set current state to the initial state $\mathbf{s} \gets \mathbf{s}_{\circ}$

 - Until convergence __do__:
      1. Compute the _current_ probability vector defined as $\mathbf{p} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s})$ where $\mathbf{s}$ is the _current_ state vector, and $\mathbf{X}^{\top}$ is the transpose of the memory matrix $\mathbf{X}$.
      2. Compute the _next_ state vector $\mathbf{s}^{\prime} = \mathbf{X}\mathbf{p}$ and the _next_ probability vector $\mathbf{p}^{\prime} = \texttt{softmax}(\beta\cdot\mathbf{X}^{\top}\mathbf{s}^{\prime})$.
      3. If $\mathbf{p}^{\prime}$ is _close_ to $\mathbf{p}$ or we run out of iterations, then __stop__. For example, $\lVert \mathbf{p}^{\prime} - \mathbf{p}\rVert_{2}^{2} \leq \epsilon$ for some small $\epsilon > 0$.
      4. Otherwise, update the state $\mathbf{s} \gets\mathbf{s}^{\prime}$, and __go back to__ step 1.
- End __do__ loop
   
This algorithm is implemented in [the `recover(...)` method](src/Compute.jl) provided in [the `src/Compute.jl` file](src/Compute.jl) of this repository. 

## Task 1: Setup, Data, and Prerequisites
In this task, we set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [4]:
include("Include.jl"); # load a bunch of libs, including the ones we need to work with images

### Constants
Before we load the data, let's set up some constants that we will use in the exercise. Please enter the number of words (memories) you want your Hopefield network to memorize, the embedding dimension and the inverse temperature $\beta$.

* The `number_of_words_to_memorize::Int` should be less than or equal to the number of words in the dataset that we load below.
* The `number_of_embedding_dimesions::Int` depends upon the pretrained embedding dataset that we will use. In this case, we will use the [GloVe](https://nlp.stanford.edu/projects/glove/) dataset, with `50` embedding dimensions.
* The inverse temperature $\beta > 0$ is a hyperparameter that controls the sharpness of the softmax distribution. A higher value of $\beta$ will make the distribution sharper, while a lower value will make it smoother. Set an initial value of `1.5` for $\beta$.

In [6]:
number_of_words_to_memorize = 2^5; # TODO: Enter how many memories we want to memorize (Int ≥ 0)
number_of_embedding_dimesions = 50; # TODO: Enter the number of embedding dimensions for each word (Int = 50)
β = 1.5; # TODO: Enter an inverse temperature of the system (high T -> low β) Float64

### Data
In this section, let's load a vector embedding from words. We'll use the [GloVe pretrained word embedding dataset](https://nlp.stanford.edu/projects/glove/) as the memories for our Hopfield model. The **GloVe (Global Vectors for Word Representation)** dataset is a widely used pre-trained word embedding resource developed at Stanford University by Pennington, Socher, and Manning. 
* _What is it?_ It constructs vector representations of words by aggregating global word co-occurrence statistics from a corpus, enabling semantic relationships to be captured in vector space. GloVe embeddings have been trained on large datasets such as Wikipedia, Gigaword, and Common Crawl, offering dimensionalities typically ranging from 50 to 300. These embeddings are foundational for many NLP tasks, including text classification, sentiment analysis, and machine translation.
* See: [Pennington et al., EMNLP 2014, GloVe: Global Vectors for Word Representation](https://aclanthology.org/D14-1162/) for more details on this dataset.

__IMPORTANT READ THIS!!__ The word embedding dataset is too large to check into the repository. Instead, you'll __must__ download it from [here]("https://drive.google.com/file/d/1tP9W4R1Ap7vp2AAmgoVEJ5lMfQAk5Fym/view?usp=share_link"). Once the embedding dataset is downloaded, put it into the `data` directory of this repository, and then run the code block below to load the data.

In [8]:
word2vec, vec2word = let

    # Do we have the embeddings downloaded?
    data = nothing;
    if (isfile(joinpath(_PATH_TO_DATA, "glove_6B_50d.jld2")) == false) 
        throw("Oooops! The glove_6B_50d.jld2 file was not found in the /data directory. Did you download and move it into the /data directory?")
    end
    data = JLD2.load(joinpath(_PATH_TO_DATA, "glove_6B_50d.jld2"));

    # load the embeddings
    word2vec = data["word2vec"] # this is a Dict{String, NTuple{50,Float64}}: word -> embedding (50d)
    vec2word = data["vec2word"] # this is a Dict{NTuple{50, Float64}, String} embedding (50d) -> word

    # return -
    (word2vec, vec2word)
end;

__Test data__: Now that we have loaded the [pretrained GloVe dataset](https://nlp.stanford.edu/projects/glove/), let's select a (random) subset of length `number_of_words_to_memorize` from the dataset to encode into the modern Hopfield network. 
* This code block returns the `test_words::Array{String,1}` array of words that we will use to test the Hopfield network. The embedding vectors corresponding to these words are stored in the `test_vocabulary::Array{Float64,2}` array, where the vector representation of each word is stored in the columns of this array.

In [10]:
test_words, test_vocabulary = let

    # initialize -
    vocabulary = Array{Float64,2}(undef, number_of_words_to_memorize, number_of_embedding_dimesions); # this is a matrix of Float64
    test_words = Array{String,1}(undef, number_of_words_to_memorize); # this is a vector of strings
    total_number_of_words = length(word2vec); # this is the total number of words in the dataset
    index_of_words_to_learn = randperm(total_number_of_words)[1:number_of_words_to_memorize]; # this is the random index of words to learn

    # get the keys of word2vec -
    words = keys(word2vec) |> collect; # this is a vector of strings

    # loop over the words to learn
    for i ∈ eachindex(index_of_words_to_learn)
        
        j = index_of_words_to_learn[i]; # this is the index of the word we want to learn
        wⱼ = words[j]; # this is the word we want to learn
        test_words[i] = wⱼ; # this is the word we want to learn 
        embedding = word2vec[wⱼ]; # this is the embedding of the word we want to learn

        for j ∈ 1:number_of_embedding_dimesions
            vocabulary[i,j] = embedding[j]; # this is the embedding of the word we want to learn
        end
    end

    test_vocabulary = vocabulary |> transpose |> Matrix; # this is the vocabulary we want to learn

    # return -
    (test_words, test_vocabulary);
end;

In [11]:
test_words # What are our words?

32-element Vector{String}:
 "américo"
 "histologic"
 "2in"
 "rhadamanthus"
 "abū"
 "transcorp"
 "jayewardene"
 "brassicas"
 "webshop"
 "harf"
 "faysal"
 "milićević"
 "3,540"
 ⋮
 "alveolus"
 "trade-related"
 "washout"
 "14-footer"
 "coss"
 "perugia"
 "8,180"
 "especialista"
 "finalisation"
 "reproductively"
 "evdokia"
 "lakhi"

__Check__: Let's check that the `test_words` and `test_vocabulary` arrays are of the correct size. 
* The `test_words` array should be of size `number_of_words_to_memorize`. The `test_vocabulary` array should be of size `number_of_embedding_dimensions` $\times$ `number_of_words_to_memorize`, i.e., the memorized words should be on the columns of the `test_vocabulary` array.
* We'll use the [@assert macro](hhttps://docs.julialang.org/en/v1/base/base/#Base.@assert) to check that the arrays are of the correct size. If the assertion fails, an error will be raised and the program will stop. If no error is raised, the program will continue (everything is correctly sized).

In [13]:
let
    @assert length(test_words) == number_of_words_to_memorize;
    @assert size(test_vocabulary, 1) == number_of_embedding_dimesions;
    @assert size(test_vocabulary, 2) == number_of_words_to_memorize;
end

## Task 2: Can we recover an uncorrupted memory?
In this task, we'll create a modern Hopfield network model, load the test data, and then check if we can recover an _uncorrupted_ memory from the network. We'll do this by starting from a state vector $\mathbf{s}_{\circ}$ that is a column from the data loaded into the model.

* __Expectation__: The network will converge to a local minimum, but we are not guaranteed that the local minimum corresponds to the original _uncorrupted_ memory. Given an _uncorrupted_ memory as a starting point, we expect to recover that _uncorrupted_ memory with a small number of mistakes when the system temperature is cold, i.e., $\beta > \beta^{\star}$, where $\beta^{\star}$ is a (unknown) threshold temperature that (potentially) depends on the number of memories, and other factors. As the temperature increases, we expect the error probability to increase. 
 
Let's start by creating a model of a modern Hopfield network. 
* We'll construct [a `MyModernHopfieldNetworkModel` instance](src/Types.jl) using a custom [`build(...)` function](src/Factory.jl). The [`build(...)` method](src/Factory.jl) takes the type of thing we want to build, the (linearized) image library we want to encode, and the (inverse) system temperature $\beta$ as inputs — images along the columns.
* The [`build(...)` function](src/Factory.jl) returns a `MyModernHopfieldNetworkModel` instance, where the image library is stored in the `X::Array{Float64,2}` field, and the system temperature is stored in the `β::Float64` field.

We'll store the Hopfield network instance in the `mymodel::MyModernHopfieldNetworkModel` variable.

In [15]:
mymodel = let

    # initialize -
    model = nothing; # this is the model we want to build
    memorycollection =test_vocabulary; # words (memories) on columns
    index_vector = 1:number_of_words_to_memorize |> collect; # this is the index of the words we want to learn
    words = keys(test_vocabulary) |> collect; # this is a vector of strings
    
    # build model -
    model = build(MyModernHopfieldNetworkModel, (
            memories = memorycollection, # this is the data we want to memorize. Images on columns
            β = β, # Inverse temperature of the system. A big beta means we are more likely to get the right answer
    ));

    model; # return the model to the calling scope
end;

__Check__: Let's do a quick check to make sure we are doing what we think we are doing when we loaded the memories into the model. The columns of the `model.X` field should be the words that we are encoding into the Hopfield network. Thus, we should be able to grab a column from `model.X` and look it up in the original `vec2word` dictionary.
* We'll use the [@assert macro](hhttps://docs.julialang.org/en/v1/base/base/#Base.@assert) to check that the true word, and the word encoded in the model are the same. If the assertion fails, an error will be raised and the program will stop. If no error is raised, the program will continue (everything is correct).

In [17]:
let
    
    # initialize -
    X = mymodel.X; # get the training data in the model
    index_to_check = rand(1:number_of_words_to_memorize); # what index do we want to check? (random)
    
    # Get the true word, and the word we think we learned -
    eᵢ = X[:,index_to_check] |> Tuple # this is the embedding of the word we want to learn
    wᵢ = test_words[index_to_check]; # this is the word we want to learn
    ŵᵢ = vec2word[eᵢ]; # this is the word we think we learned

    # Compare the two words -
    @assert wᵢ == ŵᵢ; # this is the word we want to learn
end

__Retrieve a memory from the network__: Next, we'll test if we can recover uncorrupted and corrupted memories from the Hopfield network.
Let's start by specifying which memory we are trying to recover in the `memoryindextorecover::Int` variable.

In [19]:
memoryindextorecover = 21; # TODO: Specify which memory vector will we choose (must be between 1 and number_of_words_to_memorize)

Next, let's build an uncorrupted and corrupted initial condition vector using the true word emebedding vector. We'll store 
the uncorrupted word in the `sₒ::Array{Float64,1}` variable, while the corrupted word will be stored in the `s₁::Array{Float64,1}` variable. Let's start with the uncorrupted memory.

In [21]:
sₒ = mymodel.X[:,memoryindextorecover]; # this is the memory vector we want to recover

What word does `memoryindextorecover::Int64` point to? Let's do a quick check to make sure $\mathbf{s}_{\circ}$ is what we think it is.

In [23]:
let 
    X = mymodel.X; # get the training data in the model
    p = β*(transpose(X) * sₒ) |> s-> NNlib.softmax(s) # this is the probability of the word we want to learn
    ŵ = argmax(p) |> i-> test_words[i]; # this is the index of the word we think we learned
    w = test_words[memoryindextorecover]; # this is the word we want to learn
    println("The word at index $(memoryindextorecover) that we (think) we encoded is: $(ŵ). Check: the true word is: ", w);
end

The word at index 21 that we (think) we encoded is: alveolus. Check: the true word is: alveolus


Now that we have a starting memory encoded in the state vector $\mathbf{s}_{\circ}$, can we recover the original uncorrupted word? We are guaranteed a word, but maybe _not_ the correct one.
* __Implementation__: We implemented the modern Hopfield recovery algorithm above in [the `recover(...)` method](src/Compute.jl). This method takes our `model::MyModernHopfieldNetworkModel` instance, the initial configuration vector `sₒ::Array{Float64,1}`, the maximum number `maxiterations::Int64`, and an iteration tolerance parameter `ϵ::Float64`. This method will continue to iterate until the probability vector converges, or we run out of iterations.
* [The `recover(...)` method](src/Compute.jl) returns the recovered word vector in the `ŝₒ::Array{Float32,1}` variable, the word vectors at each iteration are stored in the `fₒ::Dict{Int, Array{Float64,2}}` dictionary, and the probability of the words at each iteration in the `pₒ::Dict{Int, Array{Float64,2}}` variable. The dictionaries are indexed from `0`.

In [25]:
(ŝₒ,fₒ,pₒ) = recover(mymodel, sₒ, maxiterations = 10000, ϵ = 1e-16); # iterate until we hit stop condition

How many iterations did it take to converge? (this will be the length, i.e.,. the numbr of keys of the `fₒ` dictionary).

In [27]:
println("How many iterations: $(length(fₒ))") # how many iterations did we need to converge?

How many iterations: 5


__Which word did we recover?__ We can check if the recovered word is what we expected by looking at the probability of the recovered words stored in the `pₒ::Dict{Int, Array{Float64,2}}` dictionary.

In [29]:
recovered_word_uncorrupted = let 
    
    # initialize -
    number_of_iterations = length(fₒ); # how many iterations did we need to converge? 
    p = pₒ[number_of_iterations - 1]; # this is the probability of the word we want to learn  (0 based)
    ŵ = argmax(p) |> i-> test_words[i]; # this is the index of the word we think we learned
    
    ŵ; # return the word we *think* we learned
end;

In [30]:
println("The recovered_word_uncorrupted = $(recovered_word_uncorrupted)")

The recovered_word_uncorrupted = alveolus


__Check__: Let's check to see if the recovered word is identical to the original word (not guaranteed). We can do this by checking the `s₁::Array{Float32,1}` variable against the original image.

In [32]:
let
    true_word = test_words[memoryindextorecover]; # this is the word we want to learn
    @assert recovered_word_uncorrupted == true_word
end

### Discussion
1. We hypothesized that the network's retrieval error frequency was directly proportional to the system temperature; i.e., as the system gets hotter (smaller $\beta$), we should see more mistakes (the `@assert` statement above should fail when the network makes a mistake).
    - Run the Task 2 logic a few times for different values of $\beta$, i.e., $\beta = 1.5$ (base), $\beta = 1.0$, and $\beta = 0.15$. Does the network behave like we expect (is our intuition consistent with what you see)?
    - Explain what you think is going on with the role of $\beta$.

In [34]:
# -- Put DQ answer here -- #

In [35]:
did_I_answer_DQ1 = false; # TODO: update the flag value {true | false} 

## Task 3: Retrieve a corrupted memory from the network
In this task, we'll repeat the process above, but this time we'll start from a corrupted memory. We'll do this by cutting off a fraction of the word and then see if the model recovers the correct memory given the corrupted starting point. 

Let's get started by building a corrupted memory. We'll iterate through each embedding dimension from the uncorrupted word; sometimes, we'll make a mistake and replace the correct embedding value with an incorrect value. We'll control how oftern we make a mistake using the hyperparameter $\theta$.
* _What is the $\theta$ parameter?_ The $\theta$ hyperparameter controls how often we make mistakes. Its interpretation depends upon our _mistake_ model. For example, if we are cutting off some fraction of the embedding dimension, then $\theta$ describes the fraction of the image we are cutting off. 

Whichever mistake model we use, the $\theta\in[0,1]$. We store the corrupted word in the `s₁::Array{Float64,1}` variable.

In [37]:
s₁ = let

    # initialize -
    sₒ = mymodel.X[:,memoryindextorecover]; # this is the memory vector we want to recover (uncorrupted)
    s₁ = Array{Float32,1}(undef, number_of_embedding_dimesions); # initialize some space to store the corrupted word
    θ = 0.40; # TODO: set a mistake threshold (1 - θ is the fraction the original memory that we retain)

    # Corruption model: Cutoff part of the memory
    cutoff = (1-θ)*number_of_embedding_dimesions |> x-> round(Int,x);
    for i ∈ 1:number_of_embedding_dimesions
        eᵢ =  sₒ[i]; # We have some gray-scale values in the original vector, need to perturb
        if (i ≤ cutoff)
            s₁[i] = eᵢ;
        else
            s₁[i] = β*randn(); # add some random noise (proportional to β)
        end
    end
    
    s₁ # return corrupted data to the calling scope
end;

__What is the closest word to the corrupted word__? Using the self-attention mechanism as our similarity measure, we can compute the most probable memory given the corrupted memory. `Unhide` the code block below, to see how we computed and printed a table holding the probability of observing `number_of_top_words::Int` words using [the `pretty_table(...)` method exported by the `PrettyTables.jl` package](https://github.com/ronisbr/PrettyTables.jl).
* _How should we interpret this table_? Think of this table as the network's first guess. When we mutate a word, we are directly changing the embedding. Thus, there is no guarantee that the mutated word will be in our vocabulary (highly unlikely given this type of mutation model). However, we can use the self-attention mechanism of the modern Hopfield model to compute the probability that the mutated word corresponds to a word in our vocabulary. Pretty neat! That's what we have in the table.

Depending upon the $\theta$ parameter, the correct word may have only a small probability of being closest word.

In [39]:
let
    # initialize -
    number_of_top_words = 5; # TODO: You can see how many words are the closest to the corrupted memory by changing this number
    X = mymodel.X; # get the training data in the model
    p = β*(transpose(X) * s₁) |> s-> NNlib.softmax(s) # this is the probability of the word we want to learn
    ŵ = argmax(p) |> i-> test_words[i]; # this is the index of the word we think we learned

    # make a table -
    df = DataFrame();
    sorted_indices = sortperm(p, rev=true); # sort the indices of the probabilities
    for i ∈ 1:number_of_top_words
        index = sorted_indices[i]; # this is the index of the word we think we learned
        ŵ = test_words[index]; # this is the word we think we learned
        p̂ = p[index]; # this is the probability of the word we think we learned
        push!(df, (word=ŵ, index = index, probability=p̂)); # add the word and its probability to the table
    end
    pretty_table(df, tf = tf_simple)
end

 [1m      word [0m [1m index [0m [1m probability [0m
 [90m    String [0m [90m Int64 [0m [90m     Float64 [0m
   alveolus      21       0.99789
     cimade      15    0.00104567
       cjnt      14   0.000240341
  brassicas       8   0.000170976
     repute      16    0.00016555


__Do we converge to the correct word starting from a corrupted word__? Now that we have a starting (corrupted) memory encoded in the $\mathbf{s}_{1}$ state vector, can we recover the original uncorrupted memory, i.e., the uncorrupted word? We are guaranteed to converge to a word, but maybe _not_ the correct one.
* __Implementation__: We implemented the modern Hopfield recovery algorithm above in [the `recover(...)` method](src/Compute.jl). This method takes our `model::MyModernHopfieldNetworkModel` instance, the initial configuration vector `s₁::Array{Float64,1}`, the maximum number `maxiterations::Int64`, and the iteration tolerance parameter `ϵ::Float64`. 
* [The `recover(...)` method](src/Compute.jl) returns the recovered image in the `ŝ₁::Array{Float64,1}` variable, the word embeddings at each iteration in the `f₁::Dict{Int, Array{Float64,2}}` dictionary, and the probability of the image at each iteration in the `p₁::Dict{Int, Array{Float64,2}}` variable. The dictionaries are indexed from `0`.

In [41]:
(ŝ₁,f₁,p₁) = recover(mymodel, s₁, maxiterations = 10000, ϵ = 1e-16); # iterate until we hit stop condition

How many iterations did it take to converge? 

In [43]:
println("The network converged to an answer (staring from a corrupted word) in: $(length(f₁)) iterations.") # how many iterations did we need to converge?

The network converged to an answer (staring from a corrupted word) in: 6 iterations.


In [44]:
recovered_word_corrupted = let 
    
    # initialize -
    number_of_iterations = length(f₁); # how many iterations did we need to converge? 
    p = p₁[number_of_iterations - 1]; # this is the probability of the word we want to learn  (0 based)
    ŵ = argmax(p) |> i-> test_words[i]; # this is the index of the word we think we learned
    
    ŵ; # return the word we *think* we recovered
end

"alveolus"

### Discussion
1. Depending upon the $\theta$ parameter, the original word and the system inverse temperature $\beta$, sketch when belive the we _should_ recover the corrupted word.
   - Run the Task 3 logic on a few samples with different combinations of the uncorrected word, a range of values for the $\theta$ parameter, and the system inverse temperature $\beta$, and sketch when you think is going on. We expect to be able to get the correct word as long as the starting input is not _too_ corrupted, and the system temperature is not _too_ hot.
   - Can you confirm?

In [46]:
# -- DQ answer goes here -- #

In [47]:
did_I_answer_DQ2 = false; # TODO: update the flag value {true | false} 

## Fun (totally optional) directions we could go with this idea in the future
This was a proof-of-concept exploration of modern Hopfield networks in a simple text application, namely a spell checker. In addition to more deeply exploring the probability of mistakes, the role of $\beta$, etc., we could go a zillion different directions. Here are a few ideas below.

### Maybe different embeddings?
While combining GloVe word embeddings with a modern Hopfield network is a promising approach for spell checking, there are several important caveats to consider (that we could explore):
* __Out-of-Vocabulary (OOV) Issues__: GloVe is a word-level embedding model that only provides vectors for words in its training corpus. If a misspelled word does not exist in the GloVe vocabulary, it will lack a pre-trained embedding, making it impossible to map the misspelling into the embedding space directly. This is a critical limitation for handling typos, especially those that significantly distort the word.
* __Lack of Subword Information__: Unlike other embeddings, GloVe embeddings do not incorporate subword (character n-gram) information. GloVe is less robust to spelling variations that retain obvious subword patterns. As a result, misspelled words that share common roots or morphological components with correct words won't necessarily be mapped close together in the embedding space.
* __Semantic Rather Than Orthographic Proximity__: GloVe embeddings are designed to capture semantic similarity, not orthographic (spelling-based) similarity. Two words that are close in spelling but semantically unrelated (e.g., form vs. from) __may not__ be near in the GloVe space. This could limit the system's ability to correct specific errors where semantic context is weak or absent.
* __Lexicon Size and Memory Constraints__: A modern Hopfield network stores patterns (in this case, correct word embeddings) as attractors in its memory. To be effective, the network must store a sufficiently large lexicon to cover the domain of interest. However, as the lexicon grows, the network's memory requirements and retrieval complexity increase, which may pose practical challenges for large-scale dictionaries.

### Can we do phrases, or just single words?
Using multihead attention for phrases can help tackle limitations of word-level models by focusing on contextual relationships between words—here’s why this idea is strong and a few things to think about:
* __Captures contextual cues__: Single words (especially misspelled ones) can be ambiguous, but phrases or sentences provide rich context. Multihead attention allows our model to focus on different aspects of the phrase (e.g., syntax, semantics), which can disambiguate corrections.
*__Deals with word dependencies__: Many spelling errors are clearer when you consider surrounding words (e.g., “Their going to the store” vs “They’re going to the store”). Attention lets your model weigh relevant parts of the phrase to guide correction.

However, we must consider extending our single attention mechanism to a multi-head attention mechanism. That sounds like another cool question!

## Tests
In the code block below, we check some values in your notebook and give you feedback on which items are correct or different. `Unhide` the code block below (if you are curious) about how we implemented the tests and what we are testing.

In [49]:
let 
    @testset verbose = true "CHEME 5820 Practicum S2025" begin

        @testset "Task 1: Setup, Prerequisites and Data" begin
            @test _DID_INCLUDE_FILE_GET_CALLED == true
            @test isnothing(number_of_words_to_memorize) == false
            @test isnothing(number_of_embedding_dimesions) == false
            @test isnothing(β) == false
            @test isnothing(word2vec) == false
            @test length(test_words) == number_of_words_to_memorize;
            @test size(test_vocabulary, 1) == number_of_embedding_dimesions;
            @test size(test_vocabulary, 2) == number_of_words_to_memorize;
        end

        @testset "Task 2: Recovering a word from an uncorrupted memory vector" begin
            @test isnothing(mymodel) == false
            @test size(mymodel.X, 1) == number_of_embedding_dimesions
            @test size(mymodel.X, 2) == number_of_words_to_memorize
            @test isnothing(sₒ) == false
            @test length(sₒ) == number_of_embedding_dimesions
            @test isnothing(memoryindextorecover) == false
            @test length(fₒ) > 0
            @test length(pₒ) > 0
            @test did_I_answer_DQ1 == true;
        end

        @testset "Task 3: Recovering a word from a corrupted memory vector" begin
            @test isnothing(s₁) == false
            @test length(s₁) == number_of_embedding_dimesions
            @test length(f₁) > 0
            @test length(p₁) > 0
            @test did_I_answer_DQ2 == true;
        end
    end
end;

[0m[1mTest Summary:                                                 | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
CHEME 5820 Practicum S2025                                    | [32m  22  [39m[36m   22  [39m[0m0.3s
  Task 1: Setup, Prerequisites and Data                       | [32m   8  [39m[36m    8  [39m[0m0.3s
  Task 2: Recovering a word from an uncorrupted memory vector | [32m   9  [39m[36m    9  [39m[0m0.0s
  Task 3: Recovering a word from a corrupted memory vector    | [32m   5  [39m[36m    5  [39m[0m0.0s
