# L13b: Long Short Term Memory (LSTM) Model for Natural Language Text
Fill me in

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: Let's take 5 minutes to load and analyze a weather dataset downloaded from [the National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00304174/detail). Once we load the data, we'll do some data wrangling (scaling).
* __Task 2: Setup the model structure and training (15 min)__: In this task, we'll construct and train the RNN model, i.e., we'll learn the model parameters, using [the gradient descent with momentum algorithm](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) to minimize [the mean-squared error (mse) loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.mse). 
* __Task 3: Play around with the model structure and parameters (20 min)__: In this task, we'll change the model structure, e.g., how many hidden states we have, and include other layers. We'll also change the learning rate and other hyperparameters and look at their effect on the model performance. We'll also look at the effect of changing the number of training epochs and the batch size.

Let's get started!
___

## Task 1: Setup, Data and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl");

### Text Data
We'll load a public dataset of headlines that have been curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

The data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We've developed a parser to read the sarcasm data file. The [`corpus(...)` method](src/Files.jl) takes the `path::String` argument (the path to the datafile) and returns a [`MySarcasmRecordCorpusModel` instance](src/Types.jl) which holds the data. 

In [2]:
corpusmodel = joinpath(_PATH_TO_DATA, "Sarcasm_Headlines_Dataset_v2.txt") |> corpus;

The [`MySarcasmRecordCorpusModel` instance](src/Types.jl) has the fields that are populated when we read the file:
* The `records::Dict{Int, MySarcasmRecordModel}` field holds the original records data as a dictionary, where the keys of the dictionary correspond to the headline index, and the values are [instances of the `MySarcasmRecordModel` type](src/Types.jl).
* The `tokens::Dict{String, Int64}` field holds the vocabulary computed over the dataset as a dictionary, where the dictionary's keys are the words (called tokens) and the values of the index of the word. We assemble the `tokens` dictionary in alphabetical order. This is initially undefined.
* The `inverse::Dict{Int64, String}` field is the inverse of the `tokens` dictionary, where the keys are the token indexes and the values are the tokens (words).

In [3]:
corpusmodel.records |> length

28619

Each [`MySarcasmRecordModel` instance](src/Types.jl) has the three fields in the original data records: an `issarcastic::Bool` field holding the label for this record, the `headline::String` field holding the headline and the `article::String` field holding a link to the original article.

In [4]:
corpusmodel.records[5].headline

"mother comes pretty close to using word streaming correctly"

### Tokenize the headline records
In this task, we'll use the corpus model, particularly the `tokens::Dict{String, Int64}` dictionary, to tokenize headlines in our dataset, i.e., convert a text representation into a numerical vector representation. 

To better understand how this works, let's first examine a single (random) record and tokenize it.  We'll select a random record from the `number_of_records::Int64` possible records [using the built-in `rand(...)` method](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand), and store it in the `random_test_record::MySarcasmRecordModel` variable

In [5]:
number_of_records = corpusmodel.records |> length; # what is going on here?
random_test_record = rand(1:number_of_records) |> i -> corpusmodel.records[i]

MySarcasmRecordModel(true, "news website likes to set aside a little ad space to promote own articles", "https://local.theonion.com/news-website-likes-to-set-aside-a-little-ad-space-to-pr-1819579398")

Next, let's call [the `tokenize(...)` method](src/Compute.jl), which takes the `headline::String` that we want to tokenize, and our vocabulary stored in the `tokens::Dict{String, Int64}` dictionary and returns a token vector

In [6]:
tv = tokenize(random_test_record.headline, corpusmodel.tokens)

14-element Vector{Int64}:
 17990
 28821
 15543
 26826
 23707
  2224
   914
 15641
  1115
 24839
 26826
 20860
 19081
  2176

### Hmmm. What happens if a token is not in the dataset?
We have created the vocabulary in the `tokens::Dict{String, Int64}` dictionary by analyzing the entire dataset, but suppose we have new samples that aren't in the dataset; what happens then? We've added the `<OOV>` token to our dataset; let's see if that works. 
* Let's take the headline from the `random_test_record::MySarcasmRecordModel` instance and add something to the end, e.g., `#ilovemyroomba`. we should get the `<OOV>` token at the end of the token vector.

In [7]:
words = corpusmodel.tokens |> keys |> collect; # what?? We are getting keys (words) and turning into an array
"#ilovemyroomba" ∈ words # fancy way of checking if item is in array

false

Create a new headline by appending `#ilovemyroomba` to the old headline. String append operations in Julia use [the `*` method](https://docs.julialang.org/en/v1/manual/strings/)

In [8]:
new_test_headline = random_test_record.headline * " " * "#ilovemyroomba"

"news website likes to set aside a little ad space to promote own articles #ilovemyroomba"

Tokenize the `new_test_headline::String`, and let's see what happens:

In [9]:
tv = tokenize(new_test_headline, corpusmodel.tokens)

15-element Vector{Int64}:
 17990
 28821
 15543
 26826
 23707
  2224
   914
 15641
  1115
 24839
 26826
 20860
 19081
  2176
   912

### Compute the maximum pad length
Not every headline has the same length, but we want the token vectors to have the same size. Thus, we'll find the longest vectors in the dataset and pad the token vectors to that length. To do that, let's iterate through each headline, compute its size, and then save this length if it is longer than we've seen before.

In [10]:
max_pad_length = let

    max_pad_length = 0; # initialize: we have 0 length
    for i ∈ 1:number_of_records
        test_record_length = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens) |> length; # tokenize, and calc the number of tokens
        if (test_record_length > max_pad_length)
            max_pad_length = test_record_length; # we've found a new longest headline!
        end
    end
    max_pad_length
end

151

### Compute the vector representation of all headline samples
Finally, now that we have found the `max_pad_length::Int64`, we can tokenize all records using the `max_pad_length::Int64` value as the `pad` value in [the `tokenize(...)` method](src/Compute.jl). 
* We'll use `right-padding` and will store the tokenized records for each headline in the `token_record_dictionary::Dict{Int64, Array{Int64,1}}` dictionary, where the keys of this dictionary are the record indexes, and the values of the tokenized records (which are of type `Array{Int64,1}.`)

In [11]:
token_record_dictionary, labels = let

    # initialize -
    token_record_dictionary = Dict{Int64, Array{Float32,1}}();
    labels = Dict{Int64, Float32}();
    
    for i ∈ 1:number_of_records
        v = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens, 
                pad = max_pad_length); 
        l = corpusmodel.records[i].issarcastic; # 1 for sarcastic, 0 for not sarcastic
        token_record_dictionary[i] = v .|> Float32; # convert to float32
        labels[i] = l .|> Float32; # convert to float32
    end

    # return -
    token_record_dictionary, labels
end

(Dict{Int64, Vector{Float32}}(24824 => [25877.0, 6523.0, 16124.0, 24452.0, 13458.0, 7184.0, 19562.0, 4737.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], 25754 => [20180.0, 17482.0, 12832.0, 18535.0, 19766.0, 25507.0, 3017.0, 20259.0, 28438.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], 11950 => [2446.0, 8040.0, 4362.0, 1645.0, 6930.0, 18873.0, 21117.0, 913.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], 1703 => [8236.0, 6707.0, 23707.0, 26826.0, 29323.0, 16580.0, 18615.0, 4068.0, 23172.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], 12427 => [17647.0, 22223.0, 26826.0, 12327.0, 20945.0, 29192.0, 28514.0, 21852.0, 8483.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], 7685 => [26618.0, 26363.0, 27362.0, 26826.0, 16117.0, 26534.0, 22568.0, 22967.0, 29484.0, 3017.0  …  913.0, 913.0, 913.0, 913.0, 9

### Save tokenized data and labels to disk
We did a bunch of stuff in this example, and we don't want to have to recompute the corpus, token dictionary, etc. So let's save it [in an HDF5 encoded binary file](https://en.wikipedia.org/wiki/Hierarchical_Data_Format). 

To start, we specify a path. We'll then write data to disk as a `jld2` (binary) saved file using [the `save(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). This will save the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). The save file is [an HDF5 encoded file format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), which is small (compressed), which is excellent! 

In [12]:
let
    # initialize -
    path_to_save_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2"); 
    save(path_to_save_file, Dict("corpus" => corpusmodel, 
        "number_of_records" => number_of_records, 
        "tokenrecorddictionary" => token_record_dictionary, 
        "labeldictionary" => labels)); # encode, and write
end

__Constants__: Let's set up some constants that we will use in the exercise. Check the comment next to the value for a description of its meaning, permissible values, etc.

In [13]:
number_of_training_examples = 20000;
number_of_inputs = max_pad_length; # dimension of the input
number_of_hidden_states = 2^10; # number of hidden layers
σ₂ = NNlib.tanh_fast; # activation function
number_of_epochs = 1000; # TODO: update how many epochs we want to train for

Fill me in

In [14]:
training_headlines_dataset = let

    # initialize -
    training_dataset = Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}()
    for i ∈ 1:number_of_training_examples
       
        token_record = token_record_dictionary[i] .|> Float32; # get the tokenized headline
        one_hot_label = onehot(labels[i],[0,1]); # get the label
        push!(training_dataset, (token_record, one_hot_label)); # add to the dataset
    end
    
    training_dataset;
end

20000-element Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}:
 ([26617.0, 23295.0, 27980.0, 8295.0, 5553.0, 18533.0, 12047.0, 15828.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([7439.0, 22069.0, 26972.0, 17722.0, 29031.0, 6091.0, 14100.0, 9853.0, 23998.0, 18652.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [1, 0])
 ([8743.0, 29545.0, 28233.0, 869.0, 7418.0, 7808.0, 21629.0, 913.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [1, 0])
 ([13505.0, 28804.0, 20665.0, 15447.0, 10890.0, 11322.0, 26826.0, 29282.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([17429.0, 5812.0, 20655.0, 5563.0, 26826.0, 28097.0, 29278.0, 25501.0, 6408.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([17677.0, 28994.0, 13711.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0  …  913.

## Task 2: Construct, Train and Analyze a Sarcasm FFN 
Fill me in

In [15]:
Flux.@layer MyFluxFeedForwardNeuralNetworkModel trainable=(input, output); # create a "namespaced" of sorts
MyFNNModel() = MyFluxFeedForwardNeuralNetworkModel( # a strange type of constructor
    Flux.Chain(
        input = Flux.Dense(number_of_inputs => number_of_hidden_states, σ₂),  # hidden layer
        output = Flux.Dense(number_of_hidden_states => 2, σ₂), # output layer
        softmax = NNlib.softmax # softmax layer
    )
);
fnnmodel = MyFNNModel().chain; # Hmmm. lstmmodel is callable? (Yes, because of a cool Julia syntax quirk)

Fill me in

In [16]:
# TODO: Uncomment below to setup the loss function -
fnnloss(ŷ, y) = Flux.Losses.logitcrossentropy(ŷ, y; agg = mean); # loss for training multiclass classifiers, what is the agg?

_Which FNN optimizer_? The [`Flux.jl` library supports _many_ optimizers](https://fluxml.ai/Flux.jl/stable/reference/training/optimisers/#Optimisers-Reference) which are all some version of gradient descent. We'll use [Gradient descent with momentum](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) where the `λ` parameter denotes the `learning rate` and `β` denotes the momentum parameter. We save information about the optimizer in the `opt_state` variable, which will eventually get passed to the training method.

In [17]:
opt_fnn = let

    λ = 0.50; # TODO: update the learning rate
    β = 0.10; # TODO: update the momentum parameter
    opt_state = Flux.setup(Momentum(λ, β), fnnmodel); # opt_state has all the details of the optimizer

    # return -
    opt_state;
end;

Fill me in

In [None]:
trained_fnn_model = let 
   
    for i = 1:number_of_epochs
        
        # train the model -
        Flux.train!(fnnmodel, training_headlines_dataset, opt_fnn) do m, x, y
            fnnloss(m(x), y)
        end

        @info "Epoch $i of $number_of_epochs completed" # print the epoch number

        # save the state of the model, in case something happens. We can reload from this state
        jldsave(joinpath(_PATH_TO_DATA, "tmp-model-training-checkpoint.jld2"), model_state = Flux.state(fnnmodel))  
    end

    fnnmodel;
end

┌ Info: Epoch 1 of 1000 completed
└ @ Main /Users/jeffreyvarner/Desktop/julia_work/CHEME-5820-SP25/CHEME-5820-Labs-Spring-2025/labs/week-13/L13b/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X51sZmlsZQ==.jl:10


UndefVarError: UndefVarError: `model` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

## Task 3: Construct, Train and Analyze a Sarchasim LSTM
Fill me in

In [19]:
Flux.@layer MyFluxLSTMNeuralNetworkModel trainable=(lstm, output); # create a "namespaced" of sorts
MyLSTMRNNModel() = MyFluxLSTMNeuralNetworkModel( # a strange type of constructor
    Flux.Chain(
        lstm = Flux.LSTM(number_of_inputs => number_of_hidden_states),  # hidden layer
        output = Flux.Dense(number_of_hidden_states => 2, σ₂), # output layer
        softmax = NNlib.softmax # softmax layer
    )
);
lstmmodel = MyLSTMRNNModel().chain; # Hmmm. lstmmodel is callable? (Yes, because of a cool Julia syntax quirk)

### Training
Next, let's set up the model training. One of the shortcomings of [the `Flux.jl` package](https://fluxml.ai/Flux.jl/stable/) is the generally opaque nature of model training. It's a headache, but we've figured it out (maybe). On the other hand, [Flux.jl` package](https://fluxml.ai/Flux.jl/stable/) does handle the model unrolling step for us, so the training works like a feedforward model.


__Training data__. In the code block below, we specify the training data for our RNN. To simplify our life, we grab the first `number_of_batches::Int` blocks of `number_of_inputs::Int` days of data to train the model.
* _What?_ We will train the model of `number_of_batches::Int` blocks of data, e.g., `4` blocks that are `number_of_inputs::Int` days long, e.g., `252-days`. Thus, we are training the model on four years of data in one trading-year increments.

In [21]:
token_record_dictionary[1] .|> Float32

151-element Vector{Float32}:
 26617.0
 23295.0
 27980.0
  8295.0
  5553.0
 18533.0
 12047.0
 15828.0
   913.0
   913.0
     ⋮
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0

In [23]:
training_headlines_dataset

20000-element Vector{Tuple{Vector{Float32}, OneHotVector{UInt32}}}:
 ([26617.0, 23295.0, 27980.0, 8295.0, 5553.0, 18533.0, 12047.0, 15828.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([7439.0, 22069.0, 26972.0, 17722.0, 29031.0, 6091.0, 14100.0, 9853.0, 23998.0, 18652.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [1, 0])
 ([8743.0, 29545.0, 28233.0, 869.0, 7418.0, 7808.0, 21629.0, 913.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [1, 0])
 ([13505.0, 28804.0, 20665.0, 15447.0, 10890.0, 11322.0, 26826.0, 29282.0, 913.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([17429.0, 5812.0, 20655.0, 5563.0, 26826.0, 28097.0, 29278.0, 25501.0, 6408.0, 913.0  …  913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0], [0, 1])
 ([17677.0, 28994.0, 13711.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0, 913.0  …  913.

__Loss function__: Fill me in

In [24]:
training_headlines_dataset[1][2]

2-element OneHotVector(::UInt32) with eltype Bool:
 ⋅
 1

__Training loop__. `Unhide` the code block below to see the training loop for our Elman RNN. In the training loop, we process the training data for `number_of_epochs::Int` epochs (each epoch is one complete pass through all the training data). The implementation below uses [a few interesting `Flux.jl` specific features](https://github.com/FluxML/Flux.jl). 
* _Automatic gradient?_: The [`Flux.jl` package](https://fluxml.ai/Flux.jl/stable/) has [the `gradient(...)` method](https://fluxml.ai/Flux.jl/stable/guide/models/basics/#man-taking-gradients) which [uses automatic differentiation](https://arxiv.org/abs/1502.05767) to compute _exact_ gradient values. This is a super interesting feature that removes much of the headache associated with computing the gradient of neural networks.
* _Update!?_ The [`update!(...)` method](https://fluxml.ai/Flux.jl/stable/reference/training/reference/#Optimisers.update!) is a [mutating method](https://docs.julialang.org/en/v1/manual/functions/#man-functions), i.e., changes made in the method are visible in the calling scope. In this case, the [`update!(...)` method](https://fluxml.ai/Flux.jl/stable/reference/training/reference/#Optimisers.update!) using the gradient and the optimizer to update the model parameters stored in the model instance. It also updates the `opt_state` data, although what it is doing is not clear.

In [25]:
trainedmodel = let
    
    # put the training data in the right format -
    x = Array{Float32, 3}(undef, (number_of_inputs, 1, number_of_training_examples)); # initialize
    y = Array{OneHotVector{UInt32}, 3}(undef, (1, 1, number_of_training_examples)); # initialize

    for i ∈ 1:number_of_training_examples
        x[:, 1, i] = training_headlines_dataset[i][1]; # get the tokenized headline
        # y[:, :, i] = training_headlines_dataset[i][2]; # get the label
    end


    model = lstmmodel; # this is the model we want to train (with default parameters initially)
    tree = opt_state; # details of the optimizer
    for i ∈ 1:number_of_epochs
        
        g = gradient(m -> Flux.logitcrossentropy(m(x), y), model); # Hmmm. This uses automatic differentiation, cool!
        (newtree, newmodel) = Flux.update!(tree, model, g[1]) # run the model to convergence(?) - not sure. Docs are bad. Come on Flux.jl!!
        
        model = newmodel; # reset the model to the new *updated* instance
        tree = newtree; # reset the opt tree to the new *updated* instance (not sure what is going on here, Docs bad! Get it together Flux.jl!)
    end
    model
end

UndefVarError: UndefVarError: `opt_state` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

## Task 3: Analyze the model
Fill me in.