# L14b: The Skip-Gram Embedding Model
In this lab, we'll look at the Skip-Gram model, which is a neural network model for learning word embeddings. This is the second text embedding model we'll cover in this course. 
* __Continuous Bag of Words (CBOW)__: This architecture predicts the target word based on its context words. It uses a shallow neural network to learn the embeddings of words in a given context. No positional information is used, and the model is trained to minimize the loss between the predicted and actual target word.
* __Skip-Gram__: A skip-gram model consists of a single hidden layer that transforms a one-hot encoded input word into a dense vector representation, optimizing the embedding so that words appearing in similar contexts have similar vector representations. Imagine you're reading a sentence and can guess the words that come before and after a particular word.

See section 2: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we'll load a public dataset of headlines curated as either sarcastic or not sarcastic. Our dataset is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection). After loading the data, we'll tokenize the data (convert text strings to numerical arrays).
* __Task 2: Build and Train a HiPPO-LegS model instance (15 min)__: In this task, we will build and train a HiPPO-S4-LegS model instance on the sample input sequence we selected above. We start by creating a model instance, and the we train this instance for different hidden state sizes.
* __Task 3: Does the S4 model generalize? (25 min)__: In this task, we'll explore how the S4-LegS model performs when we give input sequences that are _similar_ but not the same as the training data. We'll take the training data, perturb some words, and feed the perturbed sequence into the model.

Let's get started!
___

## Task 1: Setup, Data, Prerequisites
In this task, we'll set up the environment, load the data, and prepare it for training. We'll also install the required libraries and load the dataset. 

In [13]:
include("Include.jl")

Next, let's specify an example sentence, tokenize it and create a vocabulary. We'll also create a mapping from words to indices and vice versa. This will help us convert the text data into numerical arrays that can be fed into the model.

In [None]:
words, vocabulary, inverse_vocabulary = let 
    
    # initialize -
    vocabulary = Dict{String, Int}();
    inverse_vocabulary = Dict{Int, String}();

    # TDOD: specify a sample sentence -
    sample_sentence = "The quick brown fox jumps over the lazy dog"; # Classical pangram!

    # split -
    words = split(sample_sentence, " ") .|> lowercase |> unique; # no external ordering

    # build the vocabulary -
    for (i, word) in enumerate(words)
        vocabulary[word] = i;
        inverse_vocabulary[i] = word;
    end

    # return -
    words, vocabulary, inverse_vocabulary
end;

In [17]:
words

8-element Vector{String}:
 "the"
 "quick"
 "brown"
 "fox"
 "jumps"
 "over"
 "lazy"
 "dog"

__Constants__: Let's set up some constants for the model. These constants will be used throughout the example codes below. See the comments in the code for more details.

In [None]:
N = length(words); # size of the vocabulary
windowsize = 3; # size of the context window
number_of_epochs = 1000; # number of epochs
number_digit_array = range(1, stop=N, step=1) |> collect; # list of numbers from 1 to N

Fill me on

In [None]:
training_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}();
    C = windowsize - 1; # number of context words

    # build the training data -
    for i ∈ 2:(N-1)
        
        targetword = words[i]; # target word
        contextwords = words[(i-1):(i+1)] |> v-> [v[1], v[3]] # context words
        
        # proces the target word -
        targetword_index = vocabulary[targetword]; # index of the target word
        y = onehot(targetword_index, number_digit_array); # one-hot encoding of the target word

        # process the context words -
        tmp = Array{Float32,2}(undef, N, C); # temporary array
        for (j,word) in enumerate(contextwords)
            contextword_index = vocabulary[word]; # index of the context word
            x = onehot(contextword_index, number_digit_array) .|> Float32; # one-hot encoding of the context word
            tmp[:, j] .= x; # store the context word
        end
        x = sum(tmp, dims=2) |> vec .|> Float32; # average of the context words
        
        # store the training data -
        push!(training_dataset, (x, y)); # store the training data
    end

    # return -
    training_dataset;
end;

In [27]:
training_dataset[1]

(Float32[0.5, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0], Bool[0, 1, 0, 0, 0, 0, 0, 0])

## Task 2: Build and Train a CBOW model instance
In this task, we will build and train a CBOW model instance on the sample input sequence we selected above. We start by creating a model instance, and the we train this instance for a few epochs.