# L13b: Long Short-Term Memory (LSTM) Model for Natural Language Text
In this lab, we'll compare the binary classification performance of a traditional feedforward neural network (FNN) and a recurrent neural network (RNN) using long short-term memory (LSTM) cells on a sarcasm detection task. The dataset is a set of approximately 28,000 news headlines labeled as sarcastic or not sarcastic. The dataset is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection).

__This is a _super hard_ problem__. Classification of sarcasm text is challenging because sarcasm often relies on subtle contextual cues, tone, and cultural knowledge that are difficult for algorithms to detect and interpret accurately.

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we'll load a public dataset of headlines curated as either sarcastic or not sarcastic. Our dataset is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection). After loading the data, we'll tokenize the data (convert text strings to numerical arrays).
* __Task 2: Construct, Train, and Analyze a Sarcasm FFN (15 min)__: In this task, we'll construct a feedforward neural network (FNN) model to classify (binary classification task) the sarcasm dataset. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to construct and train the model. Does an FNN model beat an RNN for this task?
* __Task 3: Construct, Train, and Analyze a Sarchasim LSTM (15 min)__: In this task, we'll construct an LSTM with a dense output layer and train it using a collection of labeled headlines. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to construct and train the model.

Let's get started!
___

## Task 1: Setup, Data and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [3]:
include("Include.jl");

### Sarcasm Data
We'll load a public dataset of headlines that have been curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

The data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We've developed a parser to read the sarcasm data file. The [`corpus(...)` method](src/Files.jl) takes the `path::String` argument (the path to the data file) and returns a [`MySarcasmRecordCorpusModel` instance](src/Types.jl) that holds the data. 

In [5]:
corpusmodel = joinpath(_PATH_TO_DATA, "Sarcasm_Headlines_Dataset_v2.txt") |> corpus;

The [`MySarcasmRecordCorpusModel` instance](src/Types.jl) has the fields that are populated when we read the file:
* The `records::Dict{Int, MySarcasmRecordModel}` field holds the original records data as a dictionary, where the keys of the dictionary correspond to the headline index, and the values are [instances of the `MySarcasmRecordModel` type](src/Types.jl).
* The `tokens::Dict{String, Int64}` field holds the vocabulary computed over the dataset as a dictionary, where the dictionary's keys are the words (called tokens) and the values of the index of the word. We assemble the `tokens` dictionary in alphabetical order. This is initially undefined.
* The `inverse::Dict{Int64, String}` field is the inverse of the `tokens` dictionary, where the keys are the token indexes and the values are the tokens (words).

Each [`MySarcasmRecordModel` instance](src/Types.jl) has the three fields in the original data records: an `issarcastic::Bool` field holding the label for this record, the `headline::String` field holding the headline and the `article::String` field holding a link to the original article.

In [8]:
corpusmodel.records[5].headline

"mother comes pretty close to using word streaming correctly"

### Tokenize the headline records
In this task, we'll use the corpus model, particularly the `tokens::Dict{String, Int64}` dictionary, to tokenize headlines in our dataset, i.e., convert a text representation into a numerical vector representation. 

To better understand how this works, let's first examine a single (random) record and tokenize it.  We'll select a random record from the `number_of_records::Int64` possible records [using the built-in `rand(...)` method](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand), and store it in the `random_test_record::MySarcasmRecordModel` variable

In [10]:
number_of_records = corpusmodel.records |> length; # what is going on here?
random_test_record = rand(1:number_of_records) |> i -> corpusmodel.records[i]

MySarcasmRecordModel(false, "palestinians suspicious of alaqsa surveillance promoted by kerry", "https://www.huffingtonpost.com/entry/palestinians-suspicious-of-al-aqsa-surveillance-promoted-by-kerry_us_562d5454e4b0443bb564547a")

Next, let's call [the `tokenize(...)` method](src/Compute.jl), which takes the `headline::String` that we want to tokenize, and our vocabulary stored in the `tokens::Dict{String, Int64}` dictionary and returns a token vector

In [12]:
tv = tokenize(random_test_record.headline, corpusmodel.tokens)

8-element Vector{Int64}:
 19193
 25945
 18533
  1421
 25920
 20861
  4362
 14698

### Hmmm. What happens if a token is not in the dataset?
We have created the vocabulary in the `tokens::Dict{String, Int64}` dictionary by analyzing the entire dataset, but suppose we have new samples that aren't in the dataset; what happens then? We've added the `<OOV>` token to our dataset; let's see if that works. 
* Let's take the headline from the `random_test_record::MySarcasmRecordModel` instance and add something to the end, e.g., `#ilovemyroomba`. we should get the `<OOV>` token at the end of the token vector.

In [14]:
words = corpusmodel.tokens |> keys |> collect; # what?? We are getting keys (words) and turning into an array
"#ilovemyroomba" ∈ words # fancy way of checking if item is in array

false

Create a new headline by appending `#ilovemyroomba` to the old headline. String append operations in Julia use [the `*` method](https://docs.julialang.org/en/v1/manual/strings/)

In [16]:
new_test_headline = random_test_record.headline * " " * "#ilovemyroomba"

"palestinians suspicious of alaqsa surveillance promoted by kerry #ilovemyroomba"

Tokenize the `new_test_headline::String`, and let's see what happens:

In [18]:
tv = tokenize(new_test_headline, corpusmodel.tokens)

9-element Vector{Int64}:
 19193
 25945
 18533
  1421
 25920
 20861
  4362
 14698
   912

### Compute the maximum pad length
Not every headline has the same length, but we want the token vectors to have the same size. Thus, we'll find the longest vectors in the dataset and pad the token vectors to that length. To do that, let's iterate through each headline, compute its size, and then save this length if it is longer than we've seen before.

In [20]:
max_pad_length = let

    max_pad_length = 0; # initialize: we have 0 length
    for i ∈ 1:number_of_records
        test_record_length = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens) |> length; # tokenize, and calc the number of tokens
        if (test_record_length > max_pad_length)
            max_pad_length = test_record_length; # we've found a new longest headline!
        end
    end
    max_pad_length
end

151

### Compute the vector representation of all headline samples
Finally, now that we have found the `max_pad_length::Int64`, we can tokenize all records using the `max_pad_length::Int64` value as the `pad` value in [the `tokenize(...)` method](src/Compute.jl). 
* We'll use `right-padding` and will store the tokenized records for each headline in the `token_record_dictionary::Dict{Int64, Array{Int64,1}}` dictionary, where the keys of this dictionary are the record indexes, and the values of the tokenized records (which are of type `Array{Int64,1}.`)

In [22]:
token_record_dictionary, labels = let

    # initialize -
    token_record_dictionary = Dict{Int64, Array{Float32,1}}();
    labels = Dict{Int64, Float32}();
    
    for i ∈ 1:number_of_records
        v = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens, 
                pad = max_pad_length); 
        l = corpusmodel.records[i].issarcastic; # 1 for sarcastic, 0 for not sarcastic
        token_record_dictionary[i] = v .|> Float32; # convert to float32
        labels[i] = l .|> Float32; # convert to float32
    end

    # return -
    token_record_dictionary, labels
end;

### Save tokenized data and labels to disk
We did a bunch of stuff in this example, and we don't want to have to recompute the corpus, token dictionary, etc. So let's save it [in an HDF5 encoded binary file](https://en.wikipedia.org/wiki/Hierarchical_Data_Format). 
* _Details_: To start, we specify a path. We'll then write data to disk as a `jld2` (binary) saved file using [the `save(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). This will save the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). The save file is [an HDF5 encoded file format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), which is small (compressed), which is excellent! 

In [24]:
let
    # initialize -
    path_to_save_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2"); 
    save(path_to_save_file, Dict("corpus" => corpusmodel, 
        "number_of_records" => number_of_records, 
        "tokenrecorddictionary" => token_record_dictionary, 
        "labeldictionary" => labels)); # encode, and write
end

__Constants__: Let's set up some constants we will use in the exercise. Check the comment next to the value for a description of its meaning, permissible values, etc.

In [26]:
number_of_training_examples = 10000; # how many training examples?
number_of_inputs = max_pad_length; # dimension of the input
number_of_hidden_states = 2^10; # dimension of hidden state memory
σ₂ = NNlib.tanh_fast; # activation function
number_of_epochs = 50; # TODO: update how many epochs we want to train for
number_digit_array = range(0,length=2,step=1) |> collect; # numbers 0 ... 1

__Build a training dataset__. The training data will consist of a vector of tuples, where the first element is the tokenized headline and the second is the label [in OneHot format](https://en.wikipedia.org/wiki/One-hot). 
* We save the training data in the `training_headlines_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` variable. This vector will have `number_of_training_examples::Int` elements.

In [28]:
training_headlines_dataset = let

    # generate random index set -
    random_training_index_set = Set{Int64}();
    
    # Uncomment me for random selection -
    # should_stop_loop = false;
    # counter = 0;
    # while (should_stop_loop == false)
    #     i = rand(1:number_of_records);
    #     push!(random_training_index_set, i);

    #     if (length(random_training_index_set) ≥ number_of_training_examples)
    #         should_stop_loop = true; # ok to stop the loop
    #     else
    #         counter += 1;
    #     end
    # end
    # random_training_index_array = random_training_index_set |> collect |> sort;

    # Uncomment me for sequential selection -
    random_training_index_array = range(1,
        length=number_of_training_examples, step=1) |> collect; # sequential selection
    
    
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}()
    for index ∈ eachindex(random_training_index_array)
        i = random_training_index_array[index];
        token_record = token_record_dictionary[i] .|> Float32; # get the tokenized headline
        one_hot_label = onehot(labels[i],number_digit_array); # get the label
        push!(training_dataset, (token_record, one_hot_label)); # add to the dataset
    end
    
    training_dataset;
end;

## Task 2: Construct, Train, and Analyze a Sarcasm FFN 
In this task, we'll construct a feedforward neural network (FNN) model to classify the sarcasm dataset. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to construct and train the model. 

Let's start by building the model, which we'll store in the `fnnmodel::Chain` variable, where the [`Chain` type is exported by the `Flux.jl` package](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.Chain). One of the nice things about this formulation is that it is easy to abb (or subtract) layers from the FNN model.

In [30]:
Flux.@layer MyFluxFeedForwardNeuralNetworkModel trainable=(input, output); # create a "namespaced" of sorts
MyFNNModel() = MyFluxFeedForwardNeuralNetworkModel( # a strange type of constructor
    Flux.Chain(
        input = Flux.Dense(number_of_inputs => number_of_hidden_states, σ₂),  # hidden layer
        # middle = Flux.Dense(number_of_hidden_states => number_of_hidden_states, σ₂), # output layer
        # final = Flux.Dense(number_of_hidden_states => number_of_hidden_states, σ₂), # output layer
        output = Flux.Dense(number_of_hidden_states => 2, σ₂), # output layer
        softmax = NNlib.softmax # softmax layer
    )
);
fnnmodel = MyFNNModel().chain; # Hmmm. fnnmodel is callable? (Yes, because of a cool Julia syntax quirk)

_Which FNN optimizer do we use_? The [`Flux.jl` library supports _many_ optimizers](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers-Reference) which are all some version of gradient descent. We'll use [Gradient descent with momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) where the `λ` parameter denotes the `learning rate` and `β` denotes the momentum parameter. 
* We save information about the optimizer in the `opt_fnn` variable, which will eventually be passed to the feedforward network training loop. The `opt_fnn` variable is a complex data structure composed [of `NamedTuples` instances](https://docs.julialang.org/en/v1/base/base/#Core.NamedTuple)

In [32]:
opt_fnn = let

    λ = 0.50; # TODO: update the learning rate
    β = 0.10; # TODO: update the momentum parameter
    opt_state = Flux.setup(Momentum(λ, β), fnnmodel); # opt_state has all the details of the optimizer

    # return -
    opt_state;
end;

__Training loop__. The training loop for an FNN is simpler than that of an RNN. If the `should_we_train::Bool` flag is `true,` then we process all the batches in the `training_headlines_dataset::Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}` dataset `number_of_epochs::Int` times. The updated model instance is returned in the `trained_fnn_model::Chain` variable.
* _What happens if the training flag is false?_ If the `should_we_train::Bool` flag is set to `false,` we load a previously saved model state and use that for computation. If we change the model, then this previous state is no longer valid.

In [34]:
trained_fnn_model = let 
   
    should_we_train = true; # TODO: set this flag to {true | false}
    model = fnnmodel;
    if (should_we_train == true)
        for i = 1:number_of_epochs
        
            # train the model -
            Flux.train!(model, training_headlines_dataset, opt_fnn) do m, x, y
                Flux.Losses.logitcrossentropy(m(x), y; agg = mean); # loss for training multiclass classifiers, what is the agg?
            end
    
            if (rem(i,10) == 0)
                @info "Epoch $i of $number_of_epochs completed" # print the epoch number
            end
    
            # save the state of the model, in case something happens. We can reload from this state
            jldsave(joinpath(_PATH_TO_DATA, "tmp-model-training-checkpoint.jld2"), model_state = Flux.state(model))  
        end
    else

        # if we don't train: load up a previous model
        model_state = JLD2.load(joinpath(_PATH_TO_DATA, "tmp-model-training-checkpoint.jld2"), "model_state");
        Flux.loadmodel!(model, model_state);
    end

    # return -
    model;
end

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mEpoch 10 of 50 completed
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mEpoch 20 of 50 completed
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mEpoch 30 of 50 completed
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mEpoch 40 of 50 completed
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mEpoch 50 of 50 completed


Chain(
  input = Dense(151 => 1024, tanh_fast),  [90m# 155_648 parameters[39m
  output = Dense(1024 => 2, tanh_fast),  [90m# 2_050 parameters[39m
  softmax = NNlib.softmax,
) [90m                  # Total: 4 arrays, [39m157_698 parameters, 616.211 KiB.

### How well does the FNN classify the `training` dataset?
In the code block below, we pass the headline training dataset into the `fnnmodel::Chain` instance, compute the predicted label `ŷ,` and compare the predicted and actual labels for the `training_headlines_dataset` dataset.
* __Logic__: If the prediction and the actual label agree, we update the `S` variable (a running count of the number of correct predictions). Finally, we compute the fraction of _correct_ classifications by dividing the number of correct predictions by the total number of images in the `training_headlines_dataset` dataset.

In [36]:
let
    S_training = 0;
    for i ∈ eachindex(training_headlines_dataset)
    
        x = training_headlines_dataset[i][1];
        y = training_headlines_dataset[i][2];
        ŷ = trained_fnn_model(x) |> z-> argmax(z) |> z-> number_digit_array[z] |> z-> onehot(z,[0,1])
        y == ŷ ? S_training +=1 : nothing
    end
    correct_prediction_training = (S_training/length(training_headlines_dataset))*100;
    println("Correct prediction % on the training data: $(correct_prediction_training)%");
end

Correct prediction % on the training data: 51.81%


## Task 3: Construct, Train, and Analyze a Sarchasim LSTM
In this task, we'll construct an LSTM with a dense output layer and train it using a collection of labeled headlines. We'll use [the `Flux.jl` package](https://github.com/FluxML/Flux.jl) to construct and train the model.

Let's start by building the LSTM model. This follows much the same structure as the FNN model above, but now we have (as our first layer) an [`LSTM` block type exported by `Flux.jl`](https://fluxml.ai/Flux.jl/stable/reference/models/layers/#Flux.LSTM). This layer takes a text vector $\mathbf{x}$ and returns the _hidden_ state vector $\mathbf{h}_{t}$. We run this through a dense output layer and then [a `softmax(...)` method](https://fluxml.ai/NNlib.jl/dev/reference/#Softmax) to compute the probability of the binary label (sarcastic, or not sarcastic). 

We store the LSTM model in the `lstmmodel::Chain` variable.

In [38]:
Flux.@layer MyFluxLSTMNeuralNetworkModel trainable=(lstm, output); # create a "namespaced" of sorts
MyLSTMRNNModel() = MyFluxLSTMNeuralNetworkModel( # a strange type of constructor
    Flux.Chain(
        lstm = Flux.LSTM(number_of_inputs => number_of_hidden_states),  # hidden layer
        output = Flux.Dense(number_of_hidden_states => 2, σ₂), # output layer
        softmax = NNlib.softmax # softmax layer
    )
);
lstmmodel = MyLSTMRNNModel().chain; # Hmmm. lstmmodel is callable? (Yes, because of a cool Julia syntax quirk)

_Which LSTM optimizer_? The [`Flux.jl` library supports _many_ optimizers](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers-Reference) which are all some version of gradient descent. 
* We'll use [Gradient descent with momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) where the `λ` parameter denotes the `learning rate` and `β` denotes the momentum parameter. We save information about the optimizer in the `opt_lstm` variable, which will eventually get passed to the training loop.

In [40]:
opt_lstm = let

    λ = 0.50; # TODO: update the learning rate
    β = 0.10; # TODO: update the momentum parameter
    opt_state = Flux.setup(Momentum(λ, β), lstmmodel); # opt_state has all the details of the optimizer

    # return -
    opt_state;
end;

__Training loop__. `Unhide` the code block below to see the training loop for our Elman RNN. In the training loop, we process the training data for `number_of_epochs::Int` epochs (each epoch is one complete pass through all the training data). The implementation below uses [a few interesting `Flux.jl` specific features](https://github.com/FluxML/Flux.jl). 
* _Automatic gradient?_: The [`Flux.jl` package](https://fluxml.ai/Flux.jl/stable/) has [the `gradient(...)` method](https://fluxml.ai/Flux.jl/stable/guide/models/basics/#man-taking-gradients) which [uses automatic differentiation](https://arxiv.org/abs/1502.05767) to compute _exact_ gradient values. This is a super interesting feature that removes much of the headache associated with computing the gradient of neural networks.
* _Update!?_ The [`update!(...)` method](https://fluxml.ai/Flux.jl/stable/reference/training/reference/#Optimisers.update!) is a [mutating method](https://docs.julialang.org/en/v1/manual/functions/#man-functions), i.e., changes made in the method are visible in the calling scope. In this case, the [`update!(...)` method](https://fluxml.ai/Flux.jl/stable/reference/training/reference/#Optimisers.update!) using the gradient and the optimizer to update the model parameters stored in the model instance. It also updates the `opt_state` data, although what it is doing is not clear.

In [42]:
trained_lstm_model = let
    
    # put the training data in the right format -
    x = Array{Float32, 3}(undef, (number_of_inputs, 1, number_of_training_examples)); # initialize
    y = Array{Any, 3}(undef, (2, 1, number_of_training_examples)); # initialize

    # package the data up
    for i ∈ 1:number_of_training_examples
        x[:, 1, i] = training_headlines_dataset[i][1]; # get the tokenized headline
        y[:, 1, i] = training_headlines_dataset[i][2]; # get the label
    end

    # training loop: Notice this is a hassle compared to the FNN loop. 
    model = lstmmodel; # This is the model we want to train (with default parameters initially)
    tree = opt_lstm; # details of the optimizer that we'll use
    for i ∈ 1:number_of_epochs
        
        g = gradient(m -> Flux.logitcrossentropy(m(x), y), model); # Hmmm. This uses automatic differentiation, cool!
        (newtree, newmodel) = Flux.update!(tree, model, g[1]) # run the model to convergence(?) - not sure. Docs are bad. Come on Flux.jl!!
        
        model = newmodel; # reset the model to the new *updated* instance
        tree = newtree; # reset the opt tree to the new *updated* instance (not sure what is going on here, Docs bad! Get it together Flux.jl!)
    end
    model
end

Chain(
  lstm = LSTM(151 => 1024),             [90m# 4_816_896 parameters[39m
  output = Dense(1024 => 2, tanh_fast),  [90m# 2_050 parameters[39m
  softmax = NNlib.softmax,
) [90m                  # Total: 5 arrays, [39m4_818_946 parameters, 18.383 MiB.

### How well does the LSTM classify the `training` dataset?
In the code block below, we pass the headline training dataset into the `lstmmodel::Chain` instance, compute the predicted label `ŷ,` and compare the predicted and actual labels for the `training_headlines_dataset` dataset.
* __Logic__: If the prediction and the actual label agree, we update the `S` variable (a running count of correct predictions). Finally, we compute the fraction of _correct_ classifications by dividing the number of correct predictions by the total number of images in the `training_headlines_dataset` dataset.

The LSTM model has _weird_ data needs, so let's first compute the `ŷ` vector, i.e., the _predicted_ labels for the headlines.

In [44]:
ŷ,Ŷ = let 

    # put the training data in the right format -
    x = Array{Float32, 3}(undef, (number_of_inputs, 1, number_of_training_examples)); # initialize
    ŷ = Vector{OneHotVector{UInt32}}(undef, number_of_training_examples); # initialize

    # package the tokenized headlines
    for i ∈ 1:number_of_training_examples
        x[:, 1, i] = training_headlines_dataset[i][1]; # get the tokenized headline
    end

    # output
    Ŷ = trained_lstm_model(x); # compute the output from evaluating the input training data 
    for i ∈ 1:number_of_training_examples
        yᵢ = Ŷ[:,:,i] |> vec # get the i-th output
        choice = argmax(yᵢ) |> z-> number_digit_array[z] |> z-> onehot(z,[0,1])
        ŷ[i] = choice; # add to the output 
    end
    
    ŷ, Ŷ # ŷ: vector of one-hot vectors for prediction, Ŷ is the original output from the LSTM
end;

Compute the performance on the training data for the LSTM.

In [46]:
let
    S_training = 0;
    for i ∈ eachindex(training_headlines_dataset)
        
        y = training_headlines_dataset[i][2];
        y == ŷ[i] ? S_training +=1 : nothing
    end
    correct_prediction_training = (S_training/length(training_headlines_dataset))*100;
    println("Correct prediction % on the training data: $(correct_prediction_training)%");
end

Correct prediction % on the training data: 53.73%


## Next time
In the lecture `L13c` (and associated lab), we'll introduce another (more advanced) approach for modeling long sequences based on [Time Invariant Linear State Space models](https://en.wikipedia.org/wiki/State-space_representation). We'll consider two different approaches to modeling long sequences:

* [Gu, A., Goel, K., & Ré, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. ArXiv, abs/2111.00396.](https://arxiv.org/abs/2111.00396)
* [Gu, A., Johnson, I., Timalsina, A., Rudra, A., & Ré, C. (2022). How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. ArXiv, abs/2206.12037.](https://arxiv.org/abs/2206.12037)
* [Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. ArXiv, abs/2312.00752.](https://arxiv.org/abs/2312.00752)