# Example: Loading and Analyzing the Sarcasm Dataset
This example will familiarize students with working with unstructured text, particularly the generation of a vocabulary model and the tokenization of text, i.e., the conversion of sentences into a mathematical representation.

### Learning tasks
* __Task 1__: Load the public sarcasm dataset. In this task, we'll load a public dataset of headlines that have been curated as either sarcastic or not sarcastic.
* __Task 2__: Tokenize the headline records. In this task, we'll use the corpus model, particularly the `tokens::Dict{String, Int64}` dictionary, to tokenize headlines in our dataset, i.e., convert a text representation into a numerical vector representation.

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Task 1: Load the public sarcasm dataset
In this task, we'll load a public dataset of headlines that have been curated as either sarcastic or not sarcastic. The dataset we'll use is available on [Kaggle](https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection) and is also discussed in the publications:
1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

The data is encoded as a collection of `JSON` records (although it is not directly readable using a JSON parser). Each record has the following fields:
* `is_sarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
* `headline`: the headline of the article, unstructured text
* `article_link`: link to the original news article. Useful in collecting supplementary data

We've developed a parser to read the sarcasm data file. The [`corpus(...)` method](src/Files.jl) takes the `path::String` argument (the path to the datafile) and returns a [`MySarcasmRecordCorpusModel` instance](src/Types.jl) which holds the data. 

In [6]:
corpusmodel = joinpath(_PATH_TO_DATA, "Sarcasm_Headlines_Dataset_v2.txt") |> corpus;

The [`MySarcasmRecordCorpusModel` instance](src/Types.jl) has the fields that are populated when we read the file:
* The `records::Dict{Int, MySarcasmRecordModel}` field holds the original records data as a dictionary, where the keys of the dictionary correspond to the headline index, and the values are [instances of the `MySarcasmRecordModel` type](src/Types.jl).
* The `tokens::Dict{String, Int64}` field holds the vocabulary computed over the dataset as a dictionary, where the dictionary's keys are the words (called tokens) and the values of the index of the word. We assemble the `tokens` dictionary in alphabetical order. This is initially undefined.
* The `inverse::Dict{Int64, String}` field is the inverse of the `tokens` dictionary, where the keys are the token indexes and the values are the tokens (words).

### TODO: Fix the punctuation issue
* __Hmmm__: In this implementation, we allow punctuation characters in the tokens. We need to update the code to fix this issue. We have `29663` tokens in the lecture example, but we have `38246` tokens here. Fix this!

In [46]:
corpusmodel.tokens;

Each [`MySarcasmRecordModel` instance](src/Types.jl) has the three fields in the original data records: an `issarcastic::Bool` field holding the label for this record, the `headline::String` field holding the headline and the `article::String` field holding a link to the original article.

In [10]:
corpusmodel.records[5].headline

"mother comes pretty close to using word 'streaming' correctly"

## Task 2: Tokenize the headline records
In this task, we'll use the corpus model, particularly the `tokens::Dict{String, Int64}` dictionary, to tokenize headlines in our dataset, i.e., convert a text representation into a numerical vector representation. 

To better understand how this works, let's first examine a single (random) record and tokenize it.  We'll select a random record from the `number_of_records::Int64` possible records [using the built-in `rand(...)` method](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand), and store it in the `random_test_record::MySarcasmRecordModel` variable

In [12]:
number_of_records = corpusmodel.records |> length; # what is going on here?
random_test_record = rand(1:number_of_records) |> i -> corpusmodel.records[i]

MySarcasmRecordModel(true, "convention crowd really hoping bill clinton breaks tension with joke about how terrible he looks", "https://politics.theonion.com/convention-crowd-really-hoping-bill-clinton-breaks-tens-1819579065")

In [13]:
random_test_record.headline

"convention crowd really hoping bill clinton breaks tension with joke about how terrible he looks"

Next, let's call [the `tokenize(...)` method](src/Compute.jl), which takes the `headline::String` that we want to tokenize, and our vocabulary stored in the `tokens::Dict{String, Int64}` dictionary and returns a token vector

In [15]:
tv = tokenize(random_test_record.headline, corpusmodel.tokens)

15-element Vector{Int64}:
 10083
 10694
 28518
 18099
  6432
  9187
  7221
 34267
 37613
 19932
  3616
 18254
 34292
 17445
 21618

### Hmmm. What happens if a token is not in the dataset?
We have created the vocabulary in the `tokens::Dict{String, Int64}` dictionary by analyzing the entire dataset, but suppose we have new samples that aren't in the dataset; what happens then? We've added the `<OOV>` token to our dataset; let's see if that works. 
* Let's take the headline from the `random_test_record::MySarcasmRecordModel` instance and add something to the end, e.g., `#ilovemyroomba`. we should get the `<OOV>` token at the end of the token vector.

In [17]:
words = corpusmodel.tokens |> keys |> collect; # what?? We are getting keys (words) and turning into an array
"#ilovemyroomba" ∈ words # fancy way of checking if item is in array

false

Create a new headline by appending `#ilovemyroomba` to the old headline. String append operations in Julia use [the `*` method](https://docs.julialang.org/en/v1/manual/strings/)

In [19]:
new_test_headline = random_test_record.headline * " " * "#ilovemyroomba"

"convention crowd really hoping bill clinton breaks tension with joke about how terrible he looks #ilovemyroomba"

Tokenize the `new_test_headline::String`, and let's see what happens:

In [21]:
tv = tokenize(new_test_headline, corpusmodel.tokens)

LoadError: KeyError: key "<OOV>" not found

### TODO: Fix the `<OOV>` token issue 
* __Hmmm__: That didn't work as expected! We should have added the `<OOV>` token at the end of the token sequence, but for some reason we can't find the `<OOV>` token. Fix this!

### Compute the maximum pad length
Not every headline has the same length, but we want the token vectors to have the same size. Thus, we'll find the longest vectors in the dataset and pad the token vectors to that length. To do that, let's iterate through each headline, compute its size, and then save this length if it is longer than we've seen before.

In [24]:
max_pad_length = 0; # initialize: we have 0 length
for i ∈ 1:number_of_records
    test_record_length = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens) |> length; # tokenize, and calc the number of tokens
    if (test_record_length > max_pad_length)
        max_pad_length = test_record_length; # we've found a new longest headline!
    end
end
max_pad_length

151

### Compute the vector representation of all headline samples
Finally, now that we have found the `max_pad_length::Int64`, we can tokenize all records using the `max_pad_length::Int64` value as the `pad` value in [the `tokenize(...)` method](src/Compute.jl). 
* We'll use `right-padding` and will store the tokenized records for each headline in the `token_record_dictionary::Dict{Int64, Array{Int64,1}}` dictionary, where the keys of this dictionary are the record indexes, and the values of the tokenized records (which are of type `Array{Int64,1}.`)

In [26]:
token_record_dictionary = Dict{Int64, Array{Int64,1}}();
for i ∈ 1:number_of_records
    
    v = tokenize(corpusmodel.records[i].headline, corpusmodel.tokens, 
            pad = max_pad_length); 
    token_record_dictionary[i] = v;
end
token_record_dictionary[1] # tokenized record 1

151-element Vector{Int64}:
 34518
 30554
 36111
 12532
  9208
 24906
 17031
 21668
     0
     0
     0
     0
     0
     ⋮
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0

## Final: Save data to disk
We did a bunch of stuff in this example, and we don't want to have to recompute the corpus, token dictionary, etc. So let's save it [in an HDF5 encoded binary file](https://en.wikipedia.org/wiki/Hierarchical_Data_Format). To start, specify a path:

In [28]:
path_to_save_file = joinpath(_PATH_TO_DATA, "L4a-SarcasmSamplesTokenizer-SavedData.jld2"); # JLD2 package encodes data

We'll write data to disk as a `jld2` (binary) saved file using [the `save(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). This will save the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). The save file is [an HDF5 encoded file format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format), which is small (compressed), which is excellent! 

In [30]:
save(path_to_save_file, Dict("corpus" => corpusmodel, "number_of_records" => number_of_records)); # encode, and write