# Example: Feature Hashing of Sarcasm Samples
Another strategy to represent text in mathematical form is to [use a feature-hashing approach](https://en.wikipedia.org/wiki/Feature_hashing) that takes the text and projects it into high-dimensional vectors (called feature vectors) that live in the space of the token dictionary. Thus, each text blurb, e.g., a headline, can represented as a vector in this space. This approach handles the padding challenges we saw previously but has some of its own issues.

### Learning objectives
This example will familiarize students with using [the Weinberger feature hashing algorithm](https://en.wikipedia.org/wiki/Feature_hashing) to compute high-dimensional vectors representing unstructured text. This approach differs from the tokenizer vectors we created in the previous example.
* __Prerequisites__: To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). We can then pull out some stuff we computed last time and reuse it here.
* __Task 1__: Compute the feature hash vectors for the sarcastic samples. In this task, we'll compute [the feature hash vector representation](https://en.wikipedia.org/wiki/Feature_hashing) of the text headline for each sarcastic sample.
* __Task 2__: Compute the feature hash vectors for the unsarcastic samples. In this task, we'll do a similar computation as task 1, except we'll construct the feature vectors for the non-sarcastic headline samples.
* __Task 3__: Can we compute a value for the similarity of the feature vectors? In this task, we calculate the similarity of feature vectors using [a kernel function](https://en.wikipedia.org/wiki/Kernel_method).

## Setup
We set up the computational environment by including [the `Include. jl` file](Include.jl) using [the `include(...)` method](https://docs.julialang.org/en/v1/base/base/#Base.include). The [`Include.jl` file](Include.jl) loads external packages and functions we will use in these examples. 
* For additional information on functions and types used in this example, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [3]:
include("Include.jl");

## Prerequisites
To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). To load the `jld2` (binary) saved file, we pass the path to the file we want to load the [`load(...)` function](https://github.com/JuliaIO/FileIO.jl). This call returns the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). 
* Let's set the path to the save file in the `path_to_save_file::String` variable.

In [5]:
path_to_save_file = joinpath(_PATH_TO_DATA, "L4a-SarcasmSamplesTokenizer-SavedData.jld2");

Then we load the `jld2` file using [the `load(...)` method](https://juliaio.github.io/FileIO.jl/stable/reference/#FileIO.load), where the contents of the file are stored in the `saved_data_dictionary::Dict{String, Any}` variable. 
* We saved the `corpusmodel::MySarcasmRecordCorpusModel` instance, which holds the other interesting data, e.g., the `tokendictionary.` Thus, we can get (most) of everything we need from the `corpusmodel.`

In [7]:
saved_data_dictionary = load(path_to_save_file);

# pull data from the saved_data_dictionary -
corpusmodel = saved_data_dictionary["corpus"];

tokendictionary = corpusmodel.tokens;
inversetokendictionary = corpusmodel.inverse;
number_of_records = saved_data_dictionary["number_of_records"];

# compute some stuff need for later -
number_of_tokens = tokendictionary |> length; # size of the token dictionary

In [8]:
saved_data_dictionary

Dict{String, Any} with 2 entries:
  "corpus"            => MySarcasmRecordCorpusModel(Dict{Int64, MySarcasmRecord…
  "number_of_records" => 28619

In [9]:
corpusmodel.records[1]

MySarcasmRecordModel(true, "thirtysomething scientists unveil doomsday clock of hair loss", "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205")

## Task 1: Compute the feature hash vectors for sarcastic samples
In this task, we'll compute [the feature hash vector representation](https://en.wikipedia.org/wiki/Feature_hashing) of the text headline for each sarcastic sample using the token dictionary we developed in the previous example. We'll save these feature vectors in the `sarcasim_hashed_dictionary::Dict{Int64, Vector{Int64}}` variable, where the keys are the sample indexes, and the values are the feature vectors.

In [11]:
sarcasim_hashed_dictionary = let

    length_of_feature_vector = length(tokendictionary) + 1;
    #length_of_feature_vector = 64;
    
    sarcasim_hashed_dictionary = Dict{Int64, Array{Int64,1}}();
    for i ∈ 1:number_of_records
        record = corpusmodel.records[i];
        is_sarcastic_flag = record.issarcastic
        if (is_sarcastic_flag == true)        
            fields = split(record.headline, ' ') .|> String        
            sarcasim_hashed_dictionary[i] = hashing(fields, hash = tokendictionary, 
                size = length_of_feature_vector);
        end
    end
    sarcasim_hashed_dictionary
end

Dict{Int64, Vector{Int64}} with 13634 entries:
  11950 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
  1703  => [0, 0, 0, 0, 2, 0, 0, 0, 0, 0  …  0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
  18374 => [0, 0, 0, 1, 1, 0, 1, 0, 0, 1  …  0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
  23970 => [0, 0, 0, 0, 0, 1, 0, 0, 0, 0  …  0, 0, 0, 1, 0, 0, 0, 0, 1, 0]
  27640 => [0, 0, 0, 1, 0, 0, 0, 1, 2, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
  28576 => [0, 0, 0, 0, 1, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
  2015  => [0, 0, 0, 0, 0, 1, 0, 0, 0, 0  …  0, 0, 0, 1, 0, 0, 0, 0, 2, 0]
  11280 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
  28165 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  3220  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 1, 0, 1]
  422   => [1, 0, 0, 0, 0, 1, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  15370 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  15859 => [0, 0, 0, 0, 0, 0, 0, 1, 0, 0  …  0, 1, 1,

In this representation, most of the values in the hashed dictionary are zeros. However, the non-zero entries correspond to the index of the word in the token dictionary, where the value at that position is the count of the words in the sample. 

For example, the headline at index `1`:

In [13]:
corpusmodel.records[1].headline

"thirtysomething scientists unveil doomsday clock of hair loss"

has the following non-zero entries in the hashed dictionary, which we calculate using [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}). 
* The first argument to the [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}) is an example of [an anonymous functions](https://docs.julialang.org/en/v1/manual/functions/#man-anonymous-functions). It evaluates to a boolean condition for each value passed in. Thus, the call to [the `findall(...)` method](https://docs.julialang.org/en/v1/base/arrays/#Base.findall-Tuple{Any}) returns the indexes of the values that meet the condition `x != 0`

The non-zero indexes correspond to the indexed of the words in the `tokendictionary::Dict{String,Int64}`:

In [15]:
i = findall(x -> x != 0, sarcasim_hashed_dictionary[1])

8-element Vector{Int64}:
 13
 16
 21
 38
 40
 50
 58
 64

where each word occurs only once in this particular headline:

In [17]:
sarcasim_hashed_dictionary[1][i]

8-element Vector{Int64}:
 1
 1
 1
 1
 1
 1
 1
 1

In [18]:
inversetokendictionary[12048 - 1] 

"hair"

### Check: What if we have a repeated word? 
Let's quickly check an example where words, i.e., tokens, are repeated in the headline.  For example, let's consider the headline at index `28616`.

In [20]:
corpusmodel.records[28616].headline

"internal affairs investigator disappointed conspiracy doesnt go all the way to the top"

The `the` token is repeated in the sentence. Thus, we should have a value of `2` in the hashed feature vector at the position corresponding to the token `the.` 
* Let's check this out; first, find the indexes of the non-zero elements of the feature vector, then look at the values in the feature vector at those indexes, and finally, show that the index holding a value of `2` corresponds to the `the` token.

In [75]:
let
    story_index_to_check = 28616; # What story do we want to look at?
    i = findall(x -> x != 0, sarcasim_hashed_dictionary[story_index_to_check]); # find indexes non-zero fv values

    j = sarcasim_hashed_dictionary[story_index_to_check][i]; # elements of the feature vector
    
    # Find the index corresponding to `2` in the feature vector
    k = findfirst(x -> x != 1, j) # Note: this the index of the i-vector

    # check: if not equal to `the`, we'll get an AssertionError
    @assert tokendictionary["the"] == (i[k] - 1) # why -1?
end

j = [1, 1, 1, 1, 1, 3, 3, 1, 1]


LoadError: AssertionError: tokendictionary["the"] == i[k] - 1

### Check: Can we recreate the headline from the hashed feature vector?
Let's quickly check to see if we can recreate the original headline using the feature vector representation of the text. For example, can we reassemble the headline at index `1` in the `sarcasim_hashed_dictionary`?

In [71]:
let
    h = corpusmodel.records[1].headline; # headline string from the corpusmodel
    fv = sarcasim_hashed_dictionary[1]; # get the feature vector from the sarcasim_hashed_dictionary, save this in fv
    iv = findall(x -> x != 0, fv) # find indexes non-zero fv values
    
    tmp = Array{String,1}();
    for i ∈ iv
        push!(tmp, inversetokendictionary[i-1]); # inverse is 0-based
        push!(tmp," ");
    end

    test_headline = join(tmp) |> strip # same as tmp |> join |> strip. recreate the headline string, strip off the trailing space
    # @assert h == test_headline # are the same?
    test_headline, h
end

("#brownribboncampaign #digitalhealth #feelthebern #napaquake #nevertrump #starwarschristmascarols #trickortreatin100years #xmasgiftsfromtrump", "thirtysomething scientists unveil doomsday clock of hair loss")

__Hmmm__: That didn't go like expected. What happened??

## Task 2: Compute the feature hash vectors for the unsarcastic samples
In this task, we'll do a similar computation as task 1, except we'll construct the feature vectors for the non-sarcastic headline samples. We'll save these in the `unsarcasim_hashed_dictionary::Dict{Int64, Array{Int64,1}}` variable.

In [27]:
unsarcasim_hashed_dictionary = let         
    unsarcasim_hashed_dictionary = Dict{Int64, Array{Int64,1}}();
    for i ∈ 1:number_of_records
        record = corpusmodel.records[i];
        is_sarcastic_flag = record.issarcastic
        if (is_sarcastic_flag == false)
            
            # split -
            fields = split(record.headline, ' ') .|> String        
            unsarcasim_hashed_dictionary[i] = hashing(fields, 
                hash = tokendictionary, size = (number_of_tokens+1));
        end
    end
    unsarcasim_hashed_dictionary
end

Dict{Int64, Vector{Int64}} with 14985 entries:
  12427 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  7685  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  3406  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  1090  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  18139 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  17088 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  16805 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  11251 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  25327 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  8060  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  14167 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  8660  => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  18475 => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0  …  0, 0, 0,

## Task 3: Can we compute a value for the similarity of the feature vectors?
In this task, we calculate the similarity of feature vectors using [a kernel function](https://en.wikipedia.org/wiki/Kernel_method). A [kernel function](https://en.wikipedia.org/wiki/Kernel_method) is a measure of the similarity between two feature vectors $\mathbf{x}\in\mathcal{X}$ and $\mathbf{x}^{\prime}\in\mathcal{X}$ such that $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$. In this task, we'll use the squared exponential kernel function defined as:
$$
\begin{equation}
k\left(\mathbf{x},\mathbf{x}^{\prime}\right) = \exp\left(-\gamma\cdot{d\left(\mathbf{x},\mathbf{x}^{\prime}\right)^{2}}\right)
\end{equation}
$$
where $d\left(\mathbf{x},\mathbf{x}^{\prime}\right) = ||\mathbf{x} - \mathbf{x}^{\prime}||_{2}$, i.e., the $L^{2}$-norm of the difference between $\mathbf{x}$ and $\mathbf{x}^{\prime}$, and $\gamma$ is an adjustable length scale parameter.
* if $d\left(\mathbf{x},\mathbf{x}^{\prime}\right)\rightarrow\infty$, the kernel function $k\left(\mathbf{x},\mathbf{x}^{\prime}\right)\rightarrow{0}$, i.e., for a large distance between the feature vectors, we'll have a small similarity score.
* if $d\left(\mathbf{x},\mathbf{x}^{\prime}\right)\rightarrow{0}$, the kernel function $k\left(\mathbf{x},\mathbf{x}^{\prime}\right)\rightarrow{1}$, i.e., for a small distance between the feature vectors, we'll have a large similarity score, in this case a value of `1.`

### Similarity between sarcastic samples
Let's use the [kernel function](https://en.wikipedia.org/wiki/Kernel_method) to compute the similarity between the like samples, e.g., different sarcasm samples. We'll use the [the `SqExponentialKernel()` function exported from the KernelFunctions.jl package](https://github.com/JuliaGaussianProcesses/KernelFunctions.jl/tree/master).
* What is going with [the `with_lengthscale(...)` function](https://juliagaussianprocesses.github.io/KernelFunctions.jl/stable/userguide/#Kernel-Creation)? We set the $\gamma$ parameter, i.e., the lengthscale or the gain of the squared distance, using this function. We save the kernel function in the `k` variable.

In [30]:
k = with_lengthscale(SqExponentialKernel(), 2.0);

Given our kernel (similarity) function, we can iterate through the sarcasm samples and compute a `sarcasim_distance_matrix::Array{Float16,2}` which is an $N\times{N}$ lower triangular matrix with the self-similarity values on the diagonal, and the comparison similararities in the off-diagonal positions.

In [32]:
sarcasim_distance_matrix = let

    number_of_sarchastic_samples = 20; # let's look at the first N
    samples = keys(sarcasim_hashed_dictionary) |> collect |> sort; # get a sorted list of keys
    sarcasim_distance_matrix = Array{Float16,2}(undef, number_of_sarchastic_samples, number_of_sarchastic_samples) |> x->fill!(x,0.0);

    # process N keys -
    for i ∈ 1:number_of_sarchastic_samples
        x = sarcasim_hashed_dictionary[samples[i]];
        
        for j ∈ 1:i
            x′ = sarcasim_hashed_dictionary[samples[j]]
            sarcasim_distance_matrix[i,j] = k(x,x′) |> Float16
        end
    end
    sarcasim_distance_matrix
end;

In [33]:
sarcasim_distance_matrix

20×20 Matrix{Float16}:
 1.0      0.0      0.0      0.0      0.0      …  0.0     0.0     0.0     0.0
 0.1354   1.0      0.0      0.0      0.0         0.0     0.0     0.0     0.0
 0.11945  0.1969   1.0      0.0      0.0         0.0     0.0     0.0     0.0
 0.11945  0.04395  0.0821   1.0      0.0         0.0     0.0     0.0     0.0
 0.1054   0.1738   0.093    0.05643  1.0         0.0     0.0     0.0     0.0
 0.1533   0.07245  0.1054   0.0388   0.05643  …  0.0     0.0     0.0     0.0
 0.1533   0.11945  0.1354   0.0821   0.1969      0.0     0.0     0.0     0.0
 0.253    0.1533   0.1738   0.0821   0.1969      0.0     0.0     0.0     0.0
 0.0821   0.1354   0.093    0.0342   0.0639      0.0     0.0     0.0     0.0
 0.0821   0.04977  0.07245  0.0342   0.1354      0.0     0.0     0.0     0.0
 0.0388   0.1054   0.093    0.01616  0.04977  …  0.0     0.0     0.0     0.0
 0.1354   0.1738   0.11945  0.05643  0.0821      0.0     0.0     0.0     0.0
 0.0639   0.0388   0.05643  0.04395  0.02351     0.0 

Let's look at similar headlines with significant similarity scores, e.g., index `20` and `1`, which have a similarity score of `0.2231`, versus non-similar ones, such as `13` and `1`, which have a similarity score of `0.0388.`

In [35]:
samples_sarchastic = keys(sarcasim_hashed_dictionary) |> collect |> sort;

#### Similar

In [37]:
corpusmodel.records[samples_sarchastic[20]]

MySarcasmRecordModel(true, "report make it stop", "https://www.theonion.com/report-make-it-stop-1822874962")

In [38]:
corpusmodel.records[samples_sarchastic[1]]

MySarcasmRecordModel(true, "thirtysomething scientists unveil doomsday clock of hair loss", "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205")

#### Not similar

In [40]:
corpusmodel.records[samples_sarchastic[13]]

MySarcasmRecordModel(true, "expansive obama state of the union speech to touch on patent law entomology the films of robert altman", "https://politics.theonion.com/expansive-obama-state-of-the-union-speech-to-touch-on-p-1819574546")

In [41]:
corpusmodel.records[samples_sarchastic[1]]

MySarcasmRecordModel(true, "thirtysomething scientists unveil doomsday clock of hair loss", "https://www.theonion.com/thirtysomething-scientists-unveil-doomsday-clock-of-hai-1819586205")

### Similarity between sarcastic and unsarcastic samples
Ultimately, we want to classify an unseen headline as either sarcastic or not sarcastic. Could we do that using the kernel similarity scores?
* `Hypothesis`: Same-type headlines, i.e., `i-i` headlines, will have higher similarity scores than `i-j` combinations where $i\neq{j}$ and `j` are non-sarcastic headlines. Thus, we would expect cross-similarity scores to have small magnitudes compared with the `i-i` comparisons. 

In [43]:
samples_not_sarchastic = keys(unsarcasim_hashed_dictionary) |> collect |> sort;

Let's compute an $N\times{N}$ cross-similarity matrix where each entry holds the similarity between story `i` (a sarcastic headline) and headline `j` (a non-sarchastic) headline. We'll save this data in the `cross_comparision_matrix::Array{Float16,2}` matrix. Rows will be sarcastic headlines, and the columns will be non-sarcastic headlines.

In [45]:
cross_comparision_matrix = let

    number_of_cross_samples = 20; # let's look at the first 20
    cross_comparision_matrix = Array{Float16,2}(undef, number_of_cross_samples, number_of_cross_samples) |> x->fill!(x,0.0);
    
    for i ∈ 1:number_of_cross_samples

        x = sarcasim_hashed_dictionary[samples_sarchastic[i]];
        
        for j ∈ 1:number_of_cross_samples
            x′ = unsarcasim_hashed_dictionary[samples_not_sarchastic[j]]
            cross_comparision_matrix[i,j] = k(x,x′) |> Float16
        end
    end
    cross_comparision_matrix
end

LoadError: DimensionMismatch: first array has length 64 which does not match the length of the second, 29665.

### Discussion
* Could we use the headline hashed feature vectors and the similarity scores computed using the squared exponential kernel function to classify sarcasm versus non-sarcasm? (__Hint__: look at some of the elements of the cross similarity matrix)