# L13b: Long Short Term Memory (LSTM) Model for Natural Language Text
Fill me in

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: Let's take 5 minutes to load and analyze a weather dataset downloaded from [the National Oceanic and Atmospheric Administration (NOAA)](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00304174/detail). Once we load the data, we'll do some data wrangling (scaling).
* __Task 2: Setup the model structure and training (15 min)__: In this task, we'll construct and train the RNN model, i.e., we'll learn the model parameters, using [the gradient descent with momentum algorithm](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum) to minimize [the mean-squared error (mse) loss function](https://fluxml.ai/Flux.jl/stable/reference/models/losses/#Flux.Losses.mse). 
* __Task 3: Play around with the model structure and parameters (20 min)__: In this task, we'll change the model structure, e.g., how many hidden states we have, and include other layers. We'll also change the learning rate and other hyperparameters and look at their effect on the model performance. We'll also look at the effect of changing the number of training epochs and the batch size.

Let's get started!
___

## Task 1: Setup, Data and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include("Include.jl");

### Text Data
We'll load a public dataset of headlines that have been curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

The data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We've developed a parser to read the sarcasm data file. The [`corpus(...)` method](src/Files.jl) takes the `path::String` argument (the path to the datafile) and returns a [`MySarcasmRecordCorpusModel` instance](src/Types.jl) which holds the data. 

In [2]:
corpusmodel = joinpath(_PATH_TO_DATA, "Sarcasm_Headlines_Dataset_v2.txt") |> corpus;

The [`MySarcasmRecordCorpusModel` instance](src/Types.jl) has the fields that are populated when we read the file:
* The `records::Dict{Int, MySarcasmRecordModel}` field holds the original records data as a dictionary, where the keys of the dictionary correspond to the headline index, and the values are [instances of the `MySarcasmRecordModel` type](src/Types.jl).
* The `tokens::Dict{String, Int64}` field holds the vocabulary computed over the dataset as a dictionary, where the dictionary's keys are the words (called tokens) and the values of the index of the word. We assemble the `tokens` dictionary in alphabetical order. This is initially undefined.
* The `inverse::Dict{Int64, String}` field is the inverse of the `tokens` dictionary, where the keys are the token indexes and the values are the tokens (words).

In [20]:
corpusmodel.records |> length

28619

Each [`MySarcasmRecordModel` instance](src/Types.jl) has the three fields in the original data records: an `issarcastic::Bool` field holding the label for this record, the `headline::String` field holding the headline and the `article::String` field holding a link to the original article.

In [4]:
corpusmodel.records[5].headline

"mother comes pretty close to using word streaming correctly"

### Tokenize the headline records
In this task, we'll use the corpus model, particularly the `tokens::Dict{String, Int64}` dictionary, to tokenize headlines in our dataset, i.e., convert a text representation into a numerical vector representation. 

To better understand how this works, let's first examine a single (random) record and tokenize it.  We'll select a random record from the `number_of_records::Int64` possible records [using the built-in `rand(...)` method](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand), and store it in the `random_test_record::MySarcasmRecordModel` variable

In [26]:
number_of_records = corpusmodel.records |> length; # what is going on here?
random_test_record = rand(1:number_of_records) |> i -> corpusmodel.records[i]

MySarcasmRecordModel(true, "oat farmer seriously thinking about getting into barley", "https://local.theonion.com/oat-farmer-seriously-thinking-about-getting-into-barley-1825109075")

Next, let's call [the `tokenize(...)` method](src/Compute.jl), which takes the `headline::String` that we want to tokenize, and our vocabulary stored in the `tokens::Dict{String, Int64}` dictionary and returns a token vector

In [6]:
tv = tokenize(random_test_record.headline, corpusmodel.tokens)

6-element Vector{Int64}:
  5225
 10571
  9020
  5225
 10571
  9020

### Hmmm. What happens if a token is not in the dataset?
We have created the vocabulary in the `tokens::Dict{String, Int64}` dictionary by analyzing the entire dataset, but suppose we have new samples that aren't in the dataset; what happens then? We've added the `<OOV>` token to our dataset; let's see if that works. 
* Let's take the headline from the `random_test_record::MySarcasmRecordModel` instance and add something to the end, e.g., `#ilovemyroomba`. we should get the `<OOV>` token at the end of the token vector.

In [7]:
words = corpusmodel.tokens |> keys |> collect; # what?? We are getting keys (words) and turning into an array
"#ilovemyroomba" ∈ words # fancy way of checking if item is in array

false

Create a new headline by appending `#ilovemyroomba` to the old headline. String append operations in Julia use [the `*` method](https://docs.julialang.org/en/v1/manual/strings/)

In [9]:
new_test_headline = random_test_record.headline * " " * "#ilovemyroomba"

"chinese food emojis chinese food emojis #ilovemyroomba"

Tokenize the `new_test_headline::String`, and let's see what happens:

In [10]:
tv = tokenize(new_test_headline, corpusmodel.tokens)

7-element Vector{Int64}:
  5225
 10571
  9020
  5225
 10571
  9020
   912

### Compute the maximum pad length
Not every headline has the same length, but we want the token vectors to have the same size. Thus, we'll find the longest vectors in the dataset and pad the token vectors to that length. To do that, let's iterate through each headline, compute its size, and then save this length if it is longer than we've seen before.

In [27]:
max_pad_length = let

    max_pad_length = 0; # initialize: we have 0 length
    for i ∈ 1:number_of_records
        test_record_length = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens) |> length; # tokenize, and calc the number of tokens
        if (test_record_length > max_pad_length)
            max_pad_length = test_record_length; # we've found a new longest headline!
        end
    end
    max_pad_length
end

151

### Compute the vector representation of all headline samples
Finally, now that we have found the `max_pad_length::Int64`, we can tokenize all records using the `max_pad_length::Int64` value as the `pad` value in [the `tokenize(...)` method](src/Compute.jl). 
* We'll use `right-padding` and will store the tokenized records for each headline in the `token_record_dictionary::Dict{Int64, Array{Int64,1}}` dictionary, where the keys of this dictionary are the record indexes, and the values of the tokenized records (which are of type `Array{Int64,1}.`)

In [None]:
token_record_dictionary, labels = let

    # initialize -
    token_record_dictionary = Dict{Int64, Array{Int64,1}}();
    labels = Dict{Int64, Int64}();
    
    for i ∈ 1:number_of_records
        v = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens, 
                pad = max_pad_length); 
        l = corpusmodel.records[i].issarcastic; # 1 for sarcastic, 0 for not sarcastic
        token_record_dictionary[i] = v;
        labels[i] = l;
    end

    # return -
    token_record_dictionary, labels
end

(Dict(24824 => [25877, 6523, 16124, 24452, 13458, 7184, 19562, 4737, 913, 913  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 25754 => [20180, 17482, 12832, 18535, 19766, 25507, 3017, 20259, 28438, 913  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 11950 => [2446, 8040, 4362, 1645, 6930, 18873, 21117, 913, 913, 913  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 1703 => [8236, 6707, 23707, 26826, 29323, 16580, 18615, 4068, 23172, 913  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 12427 => [17647, 22223, 26826, 12327, 20945, 29192, 28514, 21852, 8483, 913  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 7685 => [26618, 26363, 27362, 26826, 16117, 26534, 22568, 22967, 29484, 3017  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 18374 => [10013, 18296, 16586, 29538, 18115, 12257, 23172, 8929, 28693, 22342  …  913, 913, 913, 913, 913, 913, 913, 913, 913, 913], 3406 => [11741, 15064, 1643, 6108, 26534, 20435, 18533, 6330, 13458, 20187  …  913, 913, 

### Save tokenized data and labels to disk
We did a bunch of stuff in this example, and we don't want to have to recompute the corpus, token dictionary, etc. So let's save it [in an HDF5 encoded binary file](https://en.wikipedia.org/wiki/Hierarchical_Data_Format). 

To start, we specify a path. We'll then write data to disk as a `jld2` (binary) saved file using [the `save(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). This will save the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). The save file is [an HDF5 encoded file format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), which is small (compressed), which is excellent! 

In [16]:
let
    # initialize -
    path_to_save_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2"); 
    save(path_to_save_file, Dict("corpus" => corpusmodel, 
        "number_of_records" => number_of_records, 
        "tokenrecorddictionary" => token_record_dictionary, 
        "labeldictionary" => labels)); # encode, and write
end