# Example: Bag of Words, TF-IDF and PMI
In this example, we'll play around with simple collections of text data and explore how to create basic text embeddings using the Bag of Words model, Term Frequency-Inverse Document Frequency (TF-IDF), and Pointwise Mutual Information (PMI).

> __Learning Objectives:__
> 
> By the end of this example, you should be able to:
>
> Three learning objectives go here.

Let's get started!
___

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants.

> The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

### Data
Here, let's construct a small corpus of simple sentences to work with in the `sentences::Array{String,1}` variable.

In [None]:
sentences = let

    # initialize -
    sentences_array = Array{String,1}(); # initialize an array of sentence strings

    # add sentences -
    push!(sentences_array, "I love machine learning and data science .");
    push!(sentences_array, "Machine learning is fun ."); # the second sentence is close to the third sentence
    push!(sentences_array, "Machine learning is great ."); # expect similarity b/w 2nd and 3rd sentences
    push!(sentences_array, "I love coding in Julia !");
    push!(sentences_array, "Julia is a great programming language ?");
    push!(sentences_array, "I enjoy learning new things about data science , machine learning , and artificial intelligence .");
    
    # return the array
    sentences_array
end

Next, we'll preprocess the data in the `sentences::Array{String,1}` variable by tokenizing the sentences into words, and converting them to lowercase. This, along with our control tokens, will be our vocabulary model.

We'll store the vocabulary in the `vocabulary::Dict{String, Int64}` variable, where the keys are the unique words and the values are their corresponding indices, and the inverse vocabulary in the `inverse_vocabulary::Dict{Int64, String}` variable, where the keys are the indices and the values are the unique words.

In [None]:
vocabulary, inverse_vocabulary = let

    # initialize -
    vocabulary = Dict{String, Int64}(); # initialize the vocabulary dictionary
    inverse_vocabulary = Dict{Int64, String}(); # initialize the inverse vocabulary dictionary
    index = 1; # initialize the index counter

    # control tokens -
    control_tokens = ["<bos>", "<eos>", "<pad>", "<unk>"]; # define control tokens

    # tmp variables -
    words = Set{String}(); # temporary array to hold words
    for sentence in sentences
        tmp = split(lowercase(sentence)); # convert to lowercase and split by whitespace
        push!(words, tmp...); # add words to the set
    end
    words_array = collect(words) |> sort; # convert set to array, and sort it

    # append the control tokens to the words array
    words_array = vcat(words_array, control_tokens);
    for word in words_array
        vocabulary[word] = index;
        inverse_vocabulary[index] = word;
        index += 1;
    end

    # return the vocabulary and inverse vocabulary
    (vocabulary, inverse_vocabulary)
end

(Dict("!" => 1, "is" => 17, "enjoy" => 11, "data" => 10, "language" => 19, "coding" => 9, "science" => 25, "<bos>" => 27, "a" => 5, "and" => 7…), Dict(5 => "a", 16 => "intelligence", 20 => "learning", 12 => "fun", 24 => "programming", 28 => "<eos>", 8 => "artificial", 17 => "is", 30 => "<unk>", 1 => "!"…))

What's in the `vocabulary::Dict{String, Int64}` and `inverse_vocabulary::Dict{Int64, String}` variables? Let's take a look!

In [None]:
vocabulary

Dict{String, Int64} with 30 entries:
  "!"            => 1
  "is"           => 17
  "enjoy"        => 11
  "data"         => 10
  "language"     => 19
  "coding"       => 9
  "science"      => 25
  "<bos>"        => 27
  "a"            => 5
  "and"          => 7
  ","            => 2
  "programming"  => 24
  "love"         => 21
  "?"            => 4
  "."            => 3
  "in"           => 15
  "i"            => 14
  "<unk>"        => 30
  "about"        => 6
  "<pad>"        => 29
  "machine"      => 22
  "artificial"   => 8
  "learning"     => 20
  "intelligence" => 16
  "great"        => 13
  ⋮              => ⋮

### Helper Implementations
Fill me in later.

In [None]:
function hashing_vectorizer(features::Array{String,1}; length::Int64 = 10)::Array{Int64,1}

    # initalize -
    new_hash_vector = zeros(Int,length);
    for i ∈ eachindex(features)
        feature = features[i]; # get feature
        h = hash(feature);
        j = mod(h,length); # this gives us an index in 0:(length-1)
        
        if (j == 0)
            j = length;
        end
        new_hash_vector[j] += 1;
    end
   
    new_hash_vector; # return
end

hashing_vectorizer (generic function with 1 method)

___

## Task 1: Bag of Words Representations
In this task, we'll create a Bag of Words representation for our corpus of sentences. 

> __Bag of Words Model__
>
> The Bag of Words (BoW) model is a technique for text embedding. As the name suggests, we represent a text (such as a sentence or a document) as a "bag" (multiset) of its words, disregarding grammar and word order but keeping multiplicity. Given a vocabulary of unique words, each text is represented as a vector where each dimension corresponds to a word in the vocabulary, and the value in that dimension indicates the frequency of that word in the text.

Let's compute the Bag of Words representation for each sentence in our corpus and store the results in the `bow_matrix::Array{Int64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.

In [None]:
bow_matrix = let

    # initialize -
    num_sentences = length(sentences); # number of sentences
    vocab_size = length(vocabulary); # size of the vocabulary
    bow_matrix = zeros(Int64, num_sentences, vocab_size); # initialize the Bag of Words matrix

    # populate the Bag of Words matrix -
    for (i, sentence) in enumerate(sentences)

        # add the <BOS> ... <EOS> token wrappers
        augmented_sentence = "<bos> " * sentence * " <eos>";
        words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
        for word in words
            if haskey(vocabulary, word)
                index = vocabulary[word];
                bow_matrix[i, index] += 1; # increment the count for the word
            else
                unk_index = vocabulary["<unk>"];
                bow_matrix[i, unk_index] += 1; # increment the count for unknown words
            end
        end
    end

    # return the Bag of Words matrix
    bow_matrix
end

5×30 Matrix{Int64}:
 0  0  1  0  0  0  1  0  0  1  0  0  0  …  0  1  1  1  0  0  1  0  1  1  0  0
 0  0  1  0  0  0  0  0  0  0  0  1  0     0  1  0  1  0  0  0  0  1  1  0  0
 1  0  0  0  0  0  0  0  1  0  0  0  0     0  0  1  0  0  0  0  0  1  1  0  0
 0  0  0  1  1  0  0  0  0  0  0  0  1     1  0  0  0  0  1  0  0  1  1  0  0
 0  2  1  0  0  1  1  1  0  1  1  0  0     0  2  0  1  1  0  1  1  1  1  0  0

### What is wrong with this representation? 
Fill me in later.

Let's consider an altenative representation. Instead of maintaining a vocabulary dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying [a hash function `h(...)`](https://docs.julialang.org/en/v1/base/base/#Base.hash) to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices. 

We've implemented a simple version of a hashing vectorizer in the `hashing_vectorizer(...)` function. We pass in a list of words (from a sentence) and the desired length of the output vector, and the function returns a vector of the specified length with counts of the hashed words.

We save this output in the `vectorized_sentence::Array{Int64,1}` variable. Let's take a look!

In [None]:
vectorized_sentence = let

    # initialize -
    i = 1; # index of sentence to hash
    sentence = sentences[i];
    augmented_sentence = "<bos> " * sentence * " <eos>";
    words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
    tv = hashing_vectorizer(words, length = length(vocabulary));
end

30-element Vector{Int64}:
 0
 1
 1
 2
 0
 0
 1
 1
 0
 0
 0
 1
 1
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

You can change the sentence index `i` in the code block above to see how different sentences are represented using this hashing vectorizer. Futher, you can modify the `length` parameter to see how the size of the output vector affects the representation.

Play around with different sentences and vector lengths to see how the hashing vectorizer performs! We'll save this output in the `alternative_vectorized_sentence::Array{Int64,1}` variable. Let's take a look!

In [None]:
alternative_vectorized_sentence = let

    # initialize -
    i = 1; # index of sentence to hash
    sentence = sentences[i];
    desired_sentence_length = 10; # desired length of the output vector
    augmented_sentence = "<bos> " * sentence * " <eos>";
    words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
    tv = hashing_vectorizer(words, length = desired_sentence_length);
end

10-element Vector{Int64}:
 0
 2
 2
 2
 0
 0
 1
 3
 0
 0

## Task 2: TF-IDF Representations
In this task, we'll compute the Term Frequency-Inverse Document Frequency (TF-IDF) representation for our corpus of sentences. The TF-IDF score for a term $t$ in a document $d$ is given by the product of two terms:
$$
\boxed{
\begin{align*}
\text{TF-IDF}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t, \mathcal{D})
\end{align*}}
$$

The __Term Frequency__ ($\text{tf}$) is the raw count of term $t$ in document $d$, often normalized by the total number of words in $d$. The __Inverse Document Frequency__ ($\text{idf}$) measures how much the term is tied to a subset of documents. It is calculated as:
    $$ \text{idf}(t, \mathcal{D}) = \log \left( \frac{N}{|\{d \in \mathcal{D} : t \in d\}|} \right) $$
where $N$ is the total number of documents in the corpus $\mathcal{D}$, and the denominator is the number of documents where the term $t$ appears. In practice (especially for small corpora), we often use a smoothed IDF to avoid division-by-zero and to reduce extreme values; we'll use that smoothed form below.

First, let's compute the TF values for each term in our vocabulary for each sentence in the corpus. We'll store these values in the `TF_matrix::Array{Float64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.


In [None]:
TF_matrix = let

    # initialize -
    num_sentences = size(bow_matrix, 1); # number of sentences
    vocab_size = size(bow_matrix, 2); # size of the vocabulary
    TF_matrix = zeros(Float64, num_sentences, vocab_size); # initialize the TF matrix

    # populate the TF matrix -
    for i in 1:num_sentences
        total_terms = sum(bow_matrix[i, :]);
        if total_terms == 0
            continue
        end
        for j in 1:vocab_size
            TF_matrix[i, j] = bow_matrix[i, j] / total_terms;
        end
    end

    # return the TF matrix
    TF_matrix
end

5×30 Matrix{Float64}:
 0.0    0.0       0.1        0.0       …  0.1        0.1        0.0  0.0
 0.0    0.0       0.142857   0.0          0.142857   0.142857   0.0  0.0
 0.125  0.0       0.0        0.0          0.125      0.125      0.0  0.0
 0.0    0.0       0.0        0.111111     0.111111   0.111111   0.0  0.0
 0.0    0.111111  0.0555556  0.0          0.0555556  0.0555556  0.0  0.0

Next, let's compute the IDF values for each term in our vocabulary across the entire corpus. We'll store these values in the `IDF_values_dictionary::Dict{String, Float64}` variable, where the keys are the unique words and the values are their corresponding IDF scores.

To keep IDF values finite in small corpora (and to avoid division-by-zero when a term appears in zero documents), we'll use a smoothed IDF:
$$
\boxed{
\begin{align*}
\text{idf}(t, \mathcal{D}) &= \log \left( \frac{N + 1}{|\{d \in \mathcal{D} : t \in d\}| + 1} \right)
\end{align*}}
$$
where $N$ is the total number of documents in the corpus $\mathcal{D}$.


In [None]:
IDF_values_dictionary = let

    # initialize -
    num_sentences = length(sentences); # number of sentences
    IDF_values_dictionary = Dict{String, Float64}(); # initialize dictionary of IDF values

    # compute smoothed IDF for each term in the vocabulary -
    for (word, index) in vocabulary
        doc_frequency = sum(bow_matrix[:, index] .> 0);
        IDF_values_dictionary[word] = log((num_sentences + 1) / (doc_frequency + 1));
    end

    # return dictionary
    IDF_values_dictionary
end


Dict{String, Float64} with 30 entries:
  "!"            => 1.60944
  "is"           => 0.916291
  "enjoy"        => 1.60944
  "data"         => 0.916291
  "language"     => 1.60944
  "coding"       => 1.60944
  "science"      => 0.916291
  "<bos>"        => 0.0
  "a"            => 1.60944
  "and"          => 0.916291
  ","            => 1.60944
  "programming"  => 1.60944
  "love"         => 0.916291
  "?"            => 1.60944
  "."            => 0.510826
  "in"           => 1.60944
  "i"            => 0.510826
  "<unk>"        => 0.0
  "about"        => 1.60944
  "<pad>"        => 0.0
  "machine"      => 0.510826
  "artificial"   => 1.60944
  "learning"     => 0.510826
  "intelligence" => 1.60944
  "great"        => 1.60944
  ⋮              => ⋮

Finally, we can compute the TF-IDF representation for each sentence in our corpus by multiplying the TF values with the corresponding IDF values. We'll store the resulting TF-IDF representations in the `TFIDF_matrix::Array{Float64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.

In [None]:
TFIDF_matrix = let

    # initialize -
    num_sentences = size(bow_matrix, 1); # number of sentences
    vocab_size = size(bow_matrix, 2); # size of the vocabulary
    TFIDF_matrix = zeros(Float64, num_sentences, vocab_size); # initialize the TF-IDF matrix

    # populate the TF-IDF matrix -
    for i in 1:num_sentences
        for j in 1:vocab_size
            word = inverse_vocabulary[j];
            idf_value = IDF_values_dictionary[word];
            TFIDF_matrix[i, j] = TF_matrix[i, j] * idf_value;
        end
    end

    # return the TF-IDF matrix
    TFIDF_matrix
end

5×30 Matrix{Float64}:
 0.0      0.0       0.0510826  0.0       …  0.0        0.0  0.0  0.0  0.0
 0.0      0.0       0.0729751  0.0          0.0        0.0  0.0  0.0  0.0
 0.20118  0.0       0.0        0.0          0.0        0.0  0.0  0.0  0.0
 0.0      0.0       0.0        0.178826     0.0        0.0  0.0  0.0  0.0
 0.0      0.178826  0.0283792  0.0          0.0894132  0.0  0.0  0.0  0.0

So what does the TF-IDF representation tell us about our sentences?

___

## Task 3: PMI Representations
Fill me in later.

## Summary
One concise, direct summary sentence goes here.

> __Key Takeaways:__
> 
> Three key takeaways go here.

One concise, direct concluding sentence goes here.
___