# L13d: S4-LegS Headline Generator
We'll use the S4-LegS recurrent model in this lab to solve the next token problem. 
* __Problem__: The next token problem in natural language processing involves predicting the most probable next word or token in a sequence based on preceding context. This fundamental task enables applications like text generation, autocompletion, and machine translation, allowing models to generate coherent, contextually relevant text one token at a time.

We've constructed [the `VLS4ModelingKit.jl` package](https://github.com/varnerlab/VLS4ModelingKit.jl) that we'll use today for our model implementation to explore the headline generation problem.

### Tasks
Before we start, execute the `Run All Cells` command to check if you (or your neighbor) have any code or setup issues. Code issues, then raise your hands - and let's get those fixed!
* __Task 1: Setup, Data, Prerequisites (10 min)__: In this task, we'll load a public dataset of headlines curated as either sarcastic or not sarcastic. Our dataset is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection). After loading the data, we'll tokenize the data (convert text strings to numerical arrays).
* __Task 2: Build and Train a HiPPO-LegS model instance (15 min)__: In this task, we will build and train a HiPPO-S4-LegS model instance on the sample input sequence we selected above. We start by creating a model instance, and the we train this instance for different hidden state sizes.
* __Task 3: Does the S4 model generalize? (25 min)__: In this task, we'll explore how the S4-LegS model performs when we give input sequences that are _similar_ but not the same as the training data. We'll take the training data, perturb some words, and feed the perturbed sequence into the model.

Let's get started!
___

## Task 1: Setup, Data and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants. 
* The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [3]:
include("Include.jl");

### Sarcasm Data
We'll load a public dataset of headlines curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. [Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).](https://www.sciencedirect.com/science/article/pii/S2666651023000013?via%3Dihub)
2. [Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).](https://rishabhmisra.github.io/Sculpting_Data_for_ML.pdf)

The sarcasm data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We'll load the data file that we generated in `L13b`.

In [5]:
corpusmodel, token_record_dictionary = let

    # setup path -
    path_to_saved_corpus_file = joinpath(_PATH_TO_DATA, "L13b-SarcasmSamplesTokenizer-SavedData.jld2");
    saveddata = load(path_to_saved_corpus_file);

    # get items from the saveddata -
    corpusmodel = saveddata["corpus"];
    tokenrecorddictionary = saveddata["tokenrecorddictionary"];

    # return 
    (corpusmodel, tokenrecorddictionary)
end;

What's in the data that we just loaded?

In [57]:
corpusmodel.inverse[913]

"<PAD>"

__Input sequence__. Let's select an input sequence from the sarcasm dataset. The input sequence is a tokenized form of the headline. The tokenized form is a numerical array of integers, where each integer represents a word in the headline. We'll store the input sequence in the `inputsignal::Array{Float64}` array. 
* _Why Float64?_ Our implementation of the S4-LegS model assumes that the input signal is a `Float64` array (since we are typically interested in regression tasks). We convert the tokenized form of the headline to a `Float64` array using [`|>` pipe operator](https://docs.julialang.org/en/v1/manual/functions/#Function-composition-and-piping) and the [`Float64(...)` function](https://docs.julialang.org/en/v1/base/numbers/#Base.Float64).

In [63]:
inputsignal, stop, headlineindex = let
   
    # initialize -
    headlineindex = 1; # TODO: select an inputsignal
    padcode = 913;
    record = token_record_dictionary[headlineindex]; 

    # how many time steps for this input signal?
    stop = 0;
    for token ∈ record
        if (token != padcode)
            stop += 1;
        else
            break; # stop
        end
    end

    # return
    inputsignal = record .|> Float64; # Why?

    # return -
    (inputsignal, stop, headlineindex)
end;

In [59]:
inputsignal

151-element Vector{Float64}:
 26617.0
 23295.0
 27980.0
  8295.0
  5553.0
 18533.0
 12047.0
 15828.0
   913.0
   913.0
   913.0
   913.0
   913.0
     ⋮
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0
   913.0

What is in our input sequence?

In [65]:
let
    words = inputsignal |> s-> decode(s, corpusmodel.inverse);
    headline = "";
    [headline *= word * " " for word ∈ words];
    println("Headline: ", headline);
end

Headline: thirtysomething scientists unveil doomsday clock of hair loss <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> 


In [67]:
stop

8

__Constants__: Let's set up some constants we will use in the exercise. Check the comment next to the value for a description of its meaning, permissible values, etc.

In [196]:
number_of_epochs = 10; # TODO: Update this value, how many epochs?
number_of_hidden_states = 2^3; # TODO: Update this value, what is the dimension of hidden state memory
Δt = 1.0; # what is the time step size for a text example?
tspan = (start = 0.0, stop = stop, step = Δt) # why?
L = range(tspan.start, stop=tspan.stop, step = tspan.step) |> collect |> length;

In [197]:
tspan

(start = 0.0, stop = 8, step = 1.0)

In [198]:
L

9

## Task 2: Build and Train a HiPPO-LegS model instance
In this task, we will build and train HiPPO-LegS model instance on the sample input sequence that we selected above. Let's start by creating a model instance.

To build the [MySISOLegSHiPPOModel instance](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.MySisoLegSHippoModel), which holds data about this model, we use a specialized [build function](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.build-Tuple{Type{MySisoLegSHippoModel},%20NamedTuple}). In particular, we pass the `number_of_hidden_states` variable, the time step `Δt,` the initial signal value `uₒ` and an initial guess of the `C` matrix to the [build function](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.build-Tuple{Type{MySisoLegSHippoModel},%20NamedTuple}) and it returns a populated [MySisoLegSHippoModel instance](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.MySisoLegSHippoModel):
* The [build function](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.build-Tuple{Type{MySisoLegSHippoModel},%20NamedTuple}) populates the $\mathbf{A}$ and $\mathbf{B}$ matrices according to the [HiPPO LegS parameterization](https://arxiv.org/abs/2008.07669) and uses a bilinear discretization scheme to compute the discrete $\mathbf{\bar{A}}$ and $\mathbf{\bar{B}}$ matrices. The discrete matrix $\mathbf{\bar{C}}$ is estimated from data using the [`learn(...)` method](https://varnerlab.github.io/VLQuantitativeFinancePackage.jl/dev/equity/#VLQuantitativeFinancePackage.estimate_hippo_parameters-Tuple{MySisoLegSHippoModel,%20NamedTuple,%20Array{Float64}}), see below for further discussion of model identification.

In [200]:
model = VLS4ModelingKit.build(MySISOLegSHiPPOModel, (
    number_of_hidden_states = number_of_hidden_states,
    Δt = Δt,
    uₒ = inputsignal[1],
    C = randn(number_of_hidden_states) # TODO: Does this change anything? 
));

__Run the untrained model__. When we run the untrained model using `solve(...)` method, we expect to get random output sequence (given that our initial default value of the $\mathbf{\bar{C}}$ matrix is random). 

In [202]:
let
    (T1,X1,Y1) = VLS4ModelingKit.solve(model, tspan, inputsignal)
    z = Y1 .|> x-> round(Int64,x) .|> x-> abs(x)
    decode(z, corpusmodel.inverse)
end

9-element Vector{String}:
 "thirtysomething"
 "womanowned"
 "<OOV>"
 "<OOV>"
 "video"
 "chapecoense"
 "<OOV>"
 "duchamp"
 "<OOV>"

__Training loop__. The training loop is implemented in the `learn(...)` method. 

In [204]:
trainedmodel = let

    # initialize -
    should_we_train = false; # TODO: set this to false if you want to load a trained model
    if (should_we_train == true)
        
        localmodel = model;
        for i in 1:number_of_epochs
            localmodel.C̄ = VLS4ModelingKit.learn(localmodel, tspan, inputsignal[1:L], 
                method = Optim.GradientDescent());
        end

        # save the model -
        path_to_saved_model_file = joinpath(_PATH_TO_DATA, "L13d-H$(headlineindex)-H$(number_of_hidden_states)-TrainedModel.jld2");
        save(path_to_saved_model_file, Dict("model" => localmodel)); # encode, and write
    else
        # load a trained model from disk -
        path_to_saved_model_file = joinpath(_PATH_TO_DATA, "L13d-H$(headlineindex)-H$(number_of_hidden_states)-TrainedModel.jld2");
        savedmodel = load(path_to_saved_model_file);
        localmodel = savedmodel["model"];
    end

    # return -
    localmodel; # this is a *trained* model 
end;

__Curious__: What was the training loss? If the training loop was good, we expect this value to be _small_, i.e., $\ll {1}$. 
* __Hmmm__: Suppose we get a training loss that is _not_ small, what can we do?

In [231]:
Y2 = let
    # get the model -
    localmodel = trainedmodel;

    # solve the model -
    (T2,X2,Y2) = VLS4ModelingKit.solve(localmodel, tspan, inputsignal);
    
    # compute the loss -
    loss = (Y2 - inputsignal[1:L]).^2 |> x-> (1/L)*sum(x);
    println("Training loss: ", loss);
end

Training loss: 803478.2888227364


In [237]:
(T2,X2,Y2) = VLS4ModelingKit.solve(model, tspan, inputsignal)

([0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], [23050.99817253041 5951.742135711186 … 1.144497804594944e-12 -4.013119622286175e-12; 23050.998172530413 -13093.83269856463 … -32.72609451618873 15.836176397840745; … ; 10433.008039391136 -9871.70743417262 … 2144.487085340459 -19.59506462249167; 13707.4500911001 -2567.7833958608035 … 2641.849661431802 -3133.816432473654], [26617.0, 29238.652250808, 38555.09006528883, 41821.64442924043, -28348.095332063418, 5008.919601429142, 34747.1755949328, -8551.59345320389, 41448.5475413401])

In [241]:
Y2 .|> x-> round(Int, x) |> x-> abs(x)

9-element Vector{Int64}:
 26617
 29239
 38555
 41822
 28348
  5009
 34747
  8552
 41449

## Task 3: Does the S4 model generalize?
In this task, we'll explore how the S4 model performs when we give it input sequences that are _similar_ but not the same as the training data.
* _How will we do this_? We will use the `generate(...)` method to generate a sequence of tokens. The `generate(...)` method takes a `MySISOLegSHiPPOModel` instance, a `tspan::Tuple` (representing the tokens in the sequence), and an input sequence. 

However, before we give the trained model an _unseen_ input sequence, let's quickly check the training quality. If the training was successful, the model should _echo_, i.e., return the input sequence. Let's check this by passing the input sequence to the `generate(...)` method.

In [227]:
echo_sequence, Y3 = let
    
    # compute raw output -
    (T3,X3,Y3) = VLS4ModelingKit.generate(trainedmodel, tspan, inputsignal, S=L);
    
    z = Y3 .|> x-> round(Int64,x) .|> x-> abs(x); # do a bunch of stuff
    echo_sequence = decode(z, corpusmodel.inverse); # decode the output

    # return -
    echo_sequence
end

9-element Vector{String}:
 "thirtysomething"
 "repairing"
 "twocassette"
 "enjoyed"
 "closes"
 "mocked"
 "gown"
 "mingles"
 "archive"

"repairing"

Next, let's generate a test sequence of tokens that is _similar_ but not the same as the input sequence. We'll save this sequence in the `testsequence::Array{Float64}` variable.

In [210]:
testsequence = let

    # initialize -
    how_many_flips = 3; # TODO: You can change this value -
    flip_indices = rand(1:L, how_many_flips) .|> x-> round(Int64,x);
    testsequence = copy(inputsignal); # make a copy of the input signal

    for i ∈ flip_indices
        testsequence[i] = testsequence[i] - 1; #  substract one from the signal
    end

    # return -
    testsequence;
end;

What's in the `testsequence::Array{Float64}` array? 
Depending upon how many _flips_ we do, this sequence will be _similar_ but not the same as the input sequence.

In [212]:
let
    z = testsequence .|> x-> round(Int64,x) .|> x-> abs(x);
    tmp = decode(z, corpusmodel.inverse);
end

151-element Vector{String}:
 "thirty"
 "scientists"
 "unveil"
 "doomsday"
 "clock"
 "of"
 "haim"
 "losingpowerballnumbers"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 ⋮
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"
 "<PAD>"

Let's give the `testsequence::Array{Float64}` array to the `generate(...)` method and see what it returns.

In [214]:
whathappens = let
    

    # compute raw output -
    (T4,X4,Y4) = VLS4ModelingKit.generate(trainedmodel, tspan, testsequence, S=L);
    
    # do a bunch of stuff
    z = Y4 .|> x-> round(Int64,x) .|> x-> abs(x);
    generated_sequence = decode(z, corpusmodel.inverse);

    # return -
    generated_sequence
end

9-element Vector{String}:
 "thirtysomething"
 "repair"
 "twocassette"
 "enjoyed"
 "closest"
 "mocked"
 "gown"
 "mingles"
 "archive"

___

## Next time
In the lecture `L14a` (and associated lab), we'll introduce (arguably) the most important development in machine learning in the last 10 years: [the transformer model](https://arxiv.org/abs/1706.03762). 
* The [transformer model, introduced in the landmark 2017 paper "Attention Is All You Need"](https://arxiv.org/abs/1706.03762) is a neural network architecture that relies entirely on attention mechanisms—dispensing with recurrence and convolutions—to efficiently model relationships within sequential data. Its core innovation, the self-attention mechanism, allows each element in a sequence to attend to every other element directly, capturing global dependencies with _magical_ precision.
* We'll explore transformers by showing that the transformer model is just a special case of [a Modern Hopfield network](https://arxiv.org/pdf/2008.02217)!
___