# Example: Feature Hashing of Sarcasm Samples
Another strategy to represent text in mathematical form is [feature-hashing](https://en.wikipedia.org/wiki/Feature_hashing). 

Feature Hashing is a technique used to convert text into a fixed-size numerical representation, __without__ the need for an explicit vocabulary. Let's look at a specific algorithm, the __Weinberger Feature Hashing Algorithm__. This is also known as the __hashing trick__. This approach does __not__ require an explicit vocabulary $\mathcal{V}$. Thus, it can handle large vocabularies and unseen words gracefully.

__Initialization:__ Given an array of tokens $\mathbf{X} = \{x_1, x_2, \ldots, x_n\}$, where $x_{i}\in\mathcal{V}$, and a dimension $d$, initialize a result array $\mathbf R = \mathbf 0\in\mathbb R^d.$


For each $x\in\mathbf{X}$ __do__:
1. Compute the hash value of the current token: $h \gets\texttt{hash}(x)$.
2. Compute the index of the hash value in the result array: $i \gets h \mod d$.
3. Update the result array: $\mathbf{R}_{i} \gets \mathbf{R}_{i} + 1$.

> **Note (Weinberger sign variant):**  
> Optionally use a sign function $s(x)\in\{+1,-1\}$ (e.g., low bit of $h$) so that  
> $$\mathbf R_i \;\mathrel{+}= s(x),$$  
> which helps decorrelate hash collisions.


**Example**  
```text
Tokens: ["Hello", "world!", "This", "is", "a", "test", "."]
d = 10

Possible output:
[0, 1, 4, 0, 2, 1, 0, 1, 1, 0]
```

### Learning objectives
This example will familiarize students with using [the Weinberger feature hashing algorithm](https://en.wikipedia.org/wiki/Feature_hashing) to compute high-dimensional vectors representing unstructured text. The tasks for this example are:
* __Task 1: Prerequisites.__ To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). We can then pull out some stuff we computed last time and reuse it here.
* __Task 2: Explore Feature Hashing.__: Compute the feature hash vectors for the sarcastic samples. In this task, we'll compute [the feature hash vector representation](https://en.wikipedia.org/wiki/Feature_hashing) of the text headline for each sarcastic sample.

Let's get started!
___

## Task 1: Setup, Data, and Prerequisites
First, we set up the computational environment by including the `Include.jl` file and loading any needed resources.
* The [include command](https://docs.julialang.org/en/v1/base/base/#include) evaluates the contents of the input source file, `Include.jl`, in the notebook's global scope. The `Include.jl` file sets paths, loads required external packages, etc. For additional information on functions and types used in this material, see the [Julia programming language documentation](https://docs.julialang.org/en/v1/). 

In [39]:
include("Include.jl");

In addition to standard Julia libraries, we'll also use [the `VLDataScienceMachineLearningPackage.jl` package](https://github.com/varnerlab/VLDataScienceMachineLearningPackage.jl), check out [the documentation](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/) for more information on the functions, types and data used in this material. 

### Data
To save some time, we'll load the saved file from the `SarcasmSamplesTokenizer` example using [the `load(...)` method exported by the FileIO.jl package](https://github.com/JuliaIO/FileIO.jl). To load the `jld2` (binary) saved file, we pass the path to the file we want to load the [`load(...)` function](https://github.com/JuliaIO/FileIO.jl). This call returns the data as a [Julia `Dict` type](https://docs.julialang.org/en/v1/base/collections/#Base.Dict). 

Set the path to the save file in the `path_to_save_file::String` variable. Then load the `jld2` file using [the `load(...)` method](https://juliaio.github.io/FileIO.jl/stable/reference/#FileIO.load), where the contents of the file are stored in the `saved_data_dictionary::Dict{String, Any}` variable. 

We saved the `corpusmodel::MySarcasmRecordCorpusModel` instance, which holds the other interesting data, e.g., the `tokendictionary.` Thus, we can get (most) of everything we need from the `corpusmodel.`

In [40]:
path_to_save_file = joinpath(_PATH_TO_DATA, "CHEME-141-M4-SarcasmSamplesTokenizer-SavedData.jld2"); # JLD2 package encodes data
saved_data_dictionary = load(path_to_save_file);
corpusmodel = saved_data_dictionary["corpus"]; # pull data from the saved_data_dictionary -
list_of_records = corpusmodel.records; # the list of records from the corpus

___

## Task 2: Compute the feature hash vectors for sarcastic samples
In this task, we'll compute the feature hash vector representation of the text headlines for each __sarcastic sample__ in our sarcasm dataset. 

First, we need to collect all the sarcastic samples from the `corpusmodel::MySarcasmRecordCorpusModel` we loaded from the previous example. Let's interate through the records, and grab those that have the `is_sarcastic` flag set to `true`.

> __Record model__: The `corpusmodel.records` are [instances of the `MySarcasmRecordModel` type](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/types/#VLDataScienceMachineLearningPackage.MySarcasmRecordModel). Each record model has the following fields:
>    * `issarcastic`: has a value of `1` if the record is sarcastic; otherwise, `0.`
>    * `headline`: the headline of the article, unstructured text
>    * `article_link`: link to the original news article. Useful in collecting supplementary data

Let's save the sarcastic samples in the `my_sarcastic_samples::Vector{MySarcasmRecordModel}` variable.

In [41]:
my_sarcastic_samples = let

    # initialize -
    records = Vector{MySarcasmRecordModel}(); # we use an empty vector to hold the sarcastic samples

    # proccess each record in the list of records -
    for (k,v) ∈ list_of_records
        if v.issarcastic == 1 # check: is the record sarcastic?
            push!(records, v); # if yes, grab it
        end
    end
    records;
end

13634-element Vector{MySarcasmRecordModel}:
 MySarcasmRecordModel(true, "author dismayed by amazon customers other purchases", "https://www.theonion.com/author-dismayed-by-amazon-customers-other-purchases-1819567854")
 MySarcasmRecordModel(true, "dollhousing crisis set to worsen mean older brother says", "https://entertainment.theonion.com/doll-housing-crisis-set-to-worsen-mean-older-brother-s-1819569425")
 MySarcasmRecordModel(true, "fear notshe means you no harm says elizabeth warren revealing docile hillary clinton to crowd", "https://politics.theonion.com/fear-not-she-means-you-no-harm-says-elizabeth-warren-1819579041")
 MySarcasmRecordModel(true, "clinton credits nevada victory to inescapable pitchblack tide of fate", "https://politics.theonion.com/clinton-credits-nevada-victory-to-inescapable-pitch-bl-1819578631")
 MySarcasmRecordModel(true, "merger of advertising giants brings together largest collection of people with no discernible skills", "https://www.theonion.com/merger-of-

To see what is going on with feature hashing, let's grab a random sarcastic sample from our dataset and compute its feature hash vector representation.

> __Idea__: To better understand how this works, let's first examine a single (random) record and compute the feature hash for it.  We'll select a random record from the `number_of_records::Int64` possible records [using the built-in `rand(...)` method](https://docs.julialang.org/en/v1/stdlib/Random/#Base.rand), and store it in the `random_test_record::MySarcasmRecordModel` variable


Select a random record:

In [42]:
random_test_record = let 
    number_of_records = length(my_sarcastic_samples); # how many sarcastic records do we have?
    random_index = rand(1:number_of_records);          # pick a random index
    my_sarcastic_samples[random_index]                  # get the record at that index
end

MySarcasmRecordModel(true, "new ted cruz attack ad declares beto orourke too good for texas", "https://politics.theonion.com/new-ted-cruz-attack-ad-declares-beto-o-rourke-too-good-1829842240")

Next, let's chop up the headline text into tokens, and compute the feature hash vector for the random record. We'll start by __tokenizing__ the `headline` field of the `random_test_record` variable.
> __Tokenization:__ breaks down a string of text into smaller components, called tokens. In this case, we will tokenize the `headline` field of the `random_test_record` variable into an array of words by using [the split(...) method](https://docs.julialang.org/en/v1/base/strings/#Base.split) to split the `headline` string by cutting at the space characters. 

Let's save the array of tokens (words) in the `tokens::Vector{String}` variable:

In [43]:
tokens = split(random_test_record.headline, " ") .|> String # tokenize the headline text

12-element Vector{String}:
 "new"
 "ted"
 "cruz"
 "attack"
 "ad"
 "declares"
 "beto"
 "orourke"
 "too"
 "good"
 "for"
 "texas"

Now, we compute the feature hash vector for the `random_test_record` variable using [the `featurehashing(...)` method](https://varnerlab.github.io/VLDataScienceMachineLearningPackage.jl/dev/text/#VLDataScienceMachineLearningPackage.featurehashing). This method takes the `tokens::Vector{String}` variable, a length that specifies the size of the output vector, and an algorithm parameter (signed or unsigned). 

In [44]:
example_hash_vectors = let

    # initialize -
    d = 20; # size of the output vector

    # compute the vectors -
    v₁ = featurehashing(tokens, d = d, algorithm = UnsignedFeatureHashing()); # unsigned feature hashing
    v₂ = featurehashing(tokens, d = d, algorithm = SignedFeatureHashing()); # signed feature hashing

    [v₁ v₂] # return: [unsigned signed]
end

20×2 Matrix{Int64}:
 0   0
 0   0
 2   2
 1  -1
 2   2
 1  -1
 1   1
 0   0
 0   0
 0   0
 1   1
 0   0
 0   0
 0   0
 1   1
 0   0
 0   0
 0   0
 2   2
 1  -1

The `example_hash_vectors` array shows two different feature hash representations of our random headline, displayed side-by-side for comparison. The first column shows the basic Weinberger feature hashing algorithm where each token increments its corresponding hash bucket by +1, while the second column shows the signed variant where tokens can either add +1 or -1 to their hash bucket, depending on the sign function.

#### Key Differences from Tokenization (M4 vs. Previous Tokenizer Example)

The `d` parameter controls the fixed dimensionality of our feature vectors, which is fundamentally different from what we did in the tokenizer notebook. Unlike the tokenizer where vocabulary size was determined by the dataset (resulting in around 29,000+ unique tokens), here we choose `d=20` to create exactly 20-dimensional vectors regardless of how many unique words exist in our text.

This represents a completely different philosophy for text representation. Feature hashing doesn't require building or storing a vocabulary dictionary—each word is directly mapped to a position using a hash function, not a lookup table. Every text sample produces the same-sized vector (d dimensions), regardless of text length or vocabulary diversity, whereas the tokenizer produced variable-length sequences that required padding.

The trade-off here is that multiple different words can map to the same position, which we call a hash collision. This is a controlled compromise where we accept some information loss for computational efficiency and memory savings. Unknown words are handled automatically through hashing, without needing special `<unk>` tokens like we used in the tokenization approach.

The signed variant helps reduce the impact of these hash collisions. When two words map to the same bucket, instead of both adding +1 (which could artificially inflate that feature's importance), they might add +1 and -1, potentially canceling each other out and reducing spurious correlations.

This approach trades some precision for significant computational advantages, especially when dealing with large vocabularies or streaming text where building explicit vocabularies is impractical.