# Example: Vector Hashing of Sarcasm Samples
Fill me in

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Prerequisites
Fill me in

In [5]:
path_to_save_file = joinpath(_PATH_TO_DATA, "L4a-SarcasmSamplesTokenizer-SavedData.jld2");

Fill me in

In [7]:
saved_data_dictionary = load(path_to_save_file);

# pull data from the saved_data_dictionary -
corpusmodel = saved_data_dictionary["corpus"];

tokendictionary = corpusmodel.tokens;
inversetokendictionary = corpusmodel.inverse;
number_of_records = saved_data_dictionary["number_of_records"];

# compute some stuff need for later -
number_of_tokens = tokendictionary |> length; # size of the token dictionary

In [8]:
corpusmodel.records

Dict{Int64, MySarcasmRecordModel} with 28619 entries:
  19700 => MySarcasmRecordModel(false, "in memoriam robin thickes career", "htt…
  21664 => MySarcasmRecordModel(false, "security video pokes holes in robbers t…
  11950 => MySarcasmRecordModel(true, "author dismayed by amazon customers othe…
  1703  => MySarcasmRecordModel(true, "dollhousing crisis set to worsen mean ol…
  12427 => MySarcasmRecordModel(false, "muslims respond to hateful protests wit…
  7685  => MySarcasmRecordModel(false, "this teens trying to make the road safe…
  18374 => MySarcasmRecordModel(true, "fear notshe means you no harm says eliza…
  3406  => MySarcasmRecordModel(false, "great lakes amazing connections the pow…
  23970 => MySarcasmRecordModel(true, "clinton credits nevada victory to inesca…
  27640 => MySarcasmRecordModel(true, "merger of advertising giants brings toge…
  28576 => MySarcasmRecordModel(true, "bartender refuses to acknowledge patrons…
  1090  => MySarcasmRecordModel(false, "clinton announc

## Task 1: Compute the hash vectors for sarcastic samples
In this task, we'll compute [the feature hash vector representation](https://en.wikipedia.org/wiki/Feature_hashing) of the text headline for each sarcastic sample using the token dictionary that we developed in the previous example.



In [10]:
sarcasim_hashed_dictionary = let
    sarcasim_hashed_dictionary = Dict{Int64, Array{Int64,1}}();
    for i ∈ 1:number_of_records
        record = corpusmodel.records[i];
        is_sarcastic_flag = record.issarcastic
        if (is_sarcastic_flag == true)        
            fields = split(record.headline, ' ') .|> String        
            sarcasim_hashed_dictionary[i] = hashing(fields, hash = tokendictionary, 
                size = (number_of_tokens+1));
        end
    end
    sarcasim_hashed_dictionary
end

Dict{Int64, Vector{Int64}} with 13634 entries:
  11950 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  1703  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  18374 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  23970 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  27640 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  28576 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  2015  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  11280 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  28165 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  3220  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  422   => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  15370 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  15859 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0,

In this representation, most of the values in the hashed dictionary are zeros. However, the non-zero entries correspond to the index of the word in the token dictionary, where the value at that position is the count of the words in the sample. 

For example, the headline at index `1`:

In [12]:
corpusmodel.records[1].headline

"thirtysomething scientists unveil doomsday clock of hair loss"

has the following non-zero entries in the hashed dictionary, which we calculate using [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}). 
* The first argument to the [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}) is an example of [an anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions). It evaluates to a boolean condition for each value passed in. Thus, the call to [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}) returns the indexes of the values that meet the condition `x != 0`

The non-zero indexes correspond to the indexed of the words in the `tokendictionary::Dict{String,Int64}`:

In [14]:
i = findall(x -> x != 0, sarcasim_hashed_dictionary[1])

8-element Vector{Int64}:
  5552
  8294
 12046
 15827
 18532
 23294
 26616
 27979

where each word occurs only once in this particular headline:

In [16]:
sarcasim_hashed_dictionary[1][i]

8-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1

### Check: What if we have a repeated word? 
Let's quickly check an example where words, i.e., tokens, are repeated in the headline.  For example, let's consider the headline at index `28616`.

In [18]:
corpusmodel.records[28616].headline

"internal affairs investigator disappointed conspiracy doesnt go all the way to the top"

The `the` token is repeated in the sentence. Thus, we should have a value of `2` in the hashed feature vector at the position corresponding to the token `the.` 
* Let's check this out; first, find the indexes of the non-zero elements of the feature vector, then look at the values in the feature vector at those indexes, and finally, show that the index holding a value of `2` corresponds to the `the` token.

In [20]:
let
    story_index_to_check = 28616; # What story do we want to look at?
    i = findall(x -> x != 0, sarcasim_hashed_dictionary[story_index_to_check]); # find indexes non-zero fv values
    j = sarcasim_hashed_dictionary[story_index_to_check][i]; # elements of the feature vector

    # Find the index corresponding to `2` in the feature vector
    k = findfirst(x -> x == 2, j) # Note: this the index of the i-vector

    # check: if not equal to `the`, we'll get an AssertionError
    @assert tokendictionary["the"] == i[k]
end

### Check: Can we recreate the sentence from the hashed feature vector?
Let's quickly check to see if we can recreate the original headline using the feature vector representation of the text. For example, can we reassemble the headline at index `1` in the `sarcasim_hashed_dictionary`?

In [43]:
let
    h = corpusmodel.records[1].headline; # headline string from the corpusmodel
    fv = sarcasim_hashed_dictionary[1]; # get the feature vector from the sarcasim_hashed_dictionary, save this in fv
    iv = findall(x -> x != 0, fv) # find indexes non-zero fv values
    
    tmp = Array{String,1}();
    for i ∈ iv
        push!(tmp, inversetokendictionary[i])
        push!(tmp," ");
    end

    test_headline = join(tmp) |> strip # same as tmp |> join |> strip # recreate the headline string, strip off the trailing space
    @assert h == test_headline # are the same?
end

LoadError: AssertionError: h == test_headline

__Hmmm__: That didn't go like expected. Why?

## Task 2: Compute the Hashvectors for Unsarcastic Samples
In this task, we'll do the same computation as task 1, except we'll construct the feature vectors for the non-sarcastic headline samples. We'll save these in the `unsarcasim_hashed_dictionary::Dict{Int64, Array{Int64,1}}` variable.

In [24]:
unsarcasim_hashed_dictionary = let         
    unsarcasim_hashed_dictionary = Dict{Int64, Array{Int64,1}}();
    for i ∈ 1:number_of_records
        record = corpusmodel.records[i];
        is_sarcastic_flag = record.issarcastic
        if (is_sarcastic_flag == false)
            
            # split -
            fields = split(record.headline, ' ') .|> String        
            unsarcasim_hashed_dictionary[i] = hashing(fields, 
                hash = tokendictionary, size = (number_of_tokens+1));
        end
    end
    unsarcasim_hashed_dictionary
end

Dict{Int64, Vector{Int64}} with 14985 entries:
  12427 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  7685  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  3406  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  1090  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  18139 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  17088 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  16805 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  11251 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  25327 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  8060  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  14167 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  8660  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  18475 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0,

## Task 3: Can we compute a value for the similarity of the feature vectors?