# Example: Bag of Words, TF-IDF and PMI
In this example, we'll play around with simple collections of text data and explore how to create basic text embeddings using the Bag of Words model, Term Frequency-Inverse Document Frequency (TF-IDF), and Pointwise Mutual Information (PMI).

> __Learning Objectives:__
> 
> By the end of this example, you should be able to:
>
> * __Bag of Words construction:__ Construct Bag of Words representations for text data using both dictionary-based and hashing-based vectorization methods.
> * __TF-IDF computation:__ Compute Term Frequency-Inverse Document Frequency (TF-IDF) scores to re-weight word counts and identify distinctive terms in a document.
> * __PMI estimation:__ Estimate Pointwise Mutual Information (PMI) and Positive PMI (PPMI) matrices from corpus statistics to quantify word associations.

Let's get started!
___

## Setup, Data, and Prerequisites
We set up the computational environment by including the `Include.jl` file, loading any needed resources, such as sample datasets, and setting up any required constants.

> The `Include.jl` file also loads external packages, various functions that we will use in the exercise, and custom types to model the components of our problem. It checks for a `Manifest.toml` file; if it finds one, packages are loaded. Other packages are downloaded and then loaded.

In [1]:
include(joinpath(@__DIR__, "Include.jl")); # include the Include.jl file

### Data
Here, let's construct a small corpus of simple sentences to work with in the `sentences::Array{String,1}` variable.

In [2]:
sentences = let

    # initialize -
    sentences_array = Array{String,1}(); # initialize an array of sentence strings

    # add sentences -
    push!(sentences_array, "I love machine learning and data science ."); #1 the first sentence is different than the others
    push!(sentences_array, "Machine learning is fun ."); # the second sentence is close to the third sentence
    push!(sentences_array, "Machine learning is great ."); # expect similarity b/w 2nd and 3rd sentences
    push!(sentences_array, "I love coding machine learning in Julia !"); # 4 the fourth sentence is similar to the first sentence
    push!(sentences_array, "Julia is a great programming language ?");
    push!(sentences_array, "I enjoy learning new things about data science , machine learning , and artificial intelligence .");
    
    # return the array
    sentences_array
end

6-element Vector{String}:
 "I love machine learning and data science ."
 "Machine learning is fun ."
 "Machine learning is great ."
 "I love coding machine learning in Julia !"
 "Julia is a great programming language ?"
 "I enjoy learning new things abo"[93m[1m ⋯ 35 bytes ⋯ [22m[39m", and artificial intelligence ."

Next, we'll preprocess the data in the `sentences::Array{String,1}` variable by tokenizing the sentences into words, and converting them to lowercase. This, along with our control tokens, will be our vocabulary model.

We'll store the vocabulary in the `vocabulary::Dict{String, Int64}` variable, where the keys are the unique words and the values are their corresponding indices, and the inverse vocabulary in the `inverse_vocabulary::Dict{Int64, String}` variable, where the keys are the indices and the values are the unique words.

In [3]:
vocabulary, inverse_vocabulary = let

    # initialize -
    vocabulary = Dict{String, Int64}(); # initialize the vocabulary dictionary
    inverse_vocabulary = Dict{Int64, String}(); # initialize the inverse vocabulary dictionary
    index = 1; # initialize the index counter

    # control tokens -
    control_tokens = ["<bos>", "<eos>", "<pad>", "<unk>"]; # define control tokens

    # tmp variables -
    words = Set{String}(); # temporary array to hold words
    for sentence in sentences
        tmp = split(lowercase(sentence)); # convert to lowercase and split by whitespace
        push!(words, tmp...); # add words to the set
    end
    words_array = collect(words) |> sort; # convert set to array, and sort it

    # append the control tokens to the words array
    words_array = vcat(words_array, control_tokens);
    for word in words_array
        vocabulary[word] = index;
        inverse_vocabulary[index] = word;
        index += 1;
    end

    # return the vocabulary and inverse vocabulary
    (vocabulary, inverse_vocabulary)
end

(Dict("!" => 1, "is" => 17, "enjoy" => 11, "data" => 10, "language" => 19, "coding" => 9, "science" => 25, "<bos>" => 27, "a" => 5, "and" => 7…), Dict(5 => "a", 16 => "intelligence", 20 => "learning", 12 => "fun", 24 => "programming", 28 => "<eos>", 8 => "artificial", 17 => "is", 30 => "<unk>", 1 => "!"…))

What's in the `vocabulary::Dict{String, Int64}` and `inverse_vocabulary::Dict{Int64, String}` variables? Let's take a look!

In [4]:
vocabulary

Dict{String, Int64} with 30 entries:
  "!"            => 1
  "is"           => 17
  "enjoy"        => 11
  "data"         => 10
  "language"     => 19
  "coding"       => 9
  "science"      => 25
  "<bos>"        => 27
  "a"            => 5
  "and"          => 7
  ","            => 2
  "programming"  => 24
  "love"         => 21
  "?"            => 4
  "."            => 3
  "in"           => 15
  "i"            => 14
  "<unk>"        => 30
  "about"        => 6
  "<pad>"        => 29
  "machine"      => 22
  "artificial"   => 8
  "learning"     => 20
  "intelligence" => 16
  "great"        => 13
  ⋮              => ⋮

What can we do with these vocabulary models? One thing we can do is take a sentence, and convert it into an index vector using the vocabulary model. Let's check that out!

In [5]:
let

    # initialize -
    i = 1; # select a sentence index to inspect
    sentence = sentences[i]; # get the sentence

    augmented_sentence = "<bos> " * sentence * " <eos>";
    words = split(lowercase(augmented_sentence)) .|> String;
    word_indices = [get(vocabulary, word, vocabulary["<unk>"]) for word in words]; # wow! nice one-liner to get indices with <unk> fallback

    @show sentence;
    @show words;
    @show word_indices;
end;

sentence = "I love machine learning and data science ."
words = ["<bos>", "i", "love", "machine", "learning", "and", "data", "science", ".", "<eos>"]
word_indices = [27, 14, 21, 22, 20, 7, 10, 25, 3, 28]


### Helper Implementations
We implement a helper function `hashing_vectorizer` to convert a list of features (words) into a fixed-length vector using the hashing trick. This function maps each feature to an index using a hash function and increments the count at that index.

In [None]:
function hashing_vectorizer(features::Array{String,1}; length::Int64 = 10)::Array{Int64,1}

    # initialize -
    new_hash_vector = zeros(Int,length);
    for i ∈ eachindex(features)
        feature = features[i]; # get feature
        j = hash(feature) |> h-> mod1(h, length); # this gives us an index in 1:length (Julia's 1-based indexing)
        new_hash_vector[j] += 1;
    end
   
    new_hash_vector; # return
end

hashing_vectorizer (generic function with 1 method)

___

## Task 1: Bag of Words Representations
In this task, we'll create a Bag of Words representation for our corpus of sentences. 

> __Bag of Words Model__
>
> The Bag of Words (BoW) model is a technique for text embedding. As the name suggests, we represent a text (such as a sentence or a document) as a "bag" (multiset) of its words, disregarding grammar and word order but keeping multiplicity. Given a vocabulary of unique words, each text is represented as a vector where each dimension corresponds to a word in the vocabulary, and the value in that dimension indicates the frequency of that word in the text.

Let's compute the Bag of Words representation for each sentence in our corpus and store the results in the `bow_matrix::Array{Int64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.

In [7]:
bow_matrix = let

    # initialize -
    num_sentences = length(sentences); # number of sentences
    vocab_size = length(vocabulary); # size of the vocabulary
    bow_matrix = zeros(Int64, num_sentences, vocab_size); # initialize the Bag of Words matrix

    # populate the Bag of Words matrix -
    for (i, sentence) in enumerate(sentences)

        # add the <BOS> ... <EOS> token wrappers
        augmented_sentence = "<bos> " * sentence * " <eos>";
        words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
        for word in words
            if haskey(vocabulary, word)
                index = vocabulary[word];
                bow_matrix[i, index] += 1; # increment the count for the word
            else
                unk_index = vocabulary["<unk>"];
                bow_matrix[i, unk_index] += 1; # increment the count for unknown words
            end
        end
    end

    # return the Bag of Words matrix
    bow_matrix
end

6×30 Matrix{Int64}:
 0  0  1  0  0  0  1  0  0  1  0  0  0  …  0  1  1  1  0  0  1  0  1  1  0  0
 0  0  1  0  0  0  0  0  0  0  0  1  0     0  1  0  1  0  0  0  0  1  1  0  0
 0  0  1  0  0  0  0  0  0  0  0  0  1     0  1  0  1  0  0  0  0  1  1  0  0
 1  0  0  0  0  0  0  0  1  0  0  0  0     0  1  1  1  0  0  0  0  1  1  0  0
 0  0  0  1  1  0  0  0  0  0  0  0  1     1  0  0  0  0  1  0  0  1  1  0  0
 0  2  1  0  0  1  1  1  0  1  1  0  0  …  0  2  0  1  1  0  1  1  1  1  0  0

### What is wrong with this representation? 
There are some issues with the Bag of Words representation.

> __So what's the problem?__
>
> We had text, now we have numerical vectors. Great! Why are you complaining?
> 
> * __Sparsity and dimension:__ The counts require us to maintain a vocabulary dictionary that maps words to indices in our vectors. But suppose we have a very large vocabulary. Further, most of the entries in our vectors are zero. This is called a "sparse" representation, and it can be inefficient in terms of both storage and computation. 
> 
> * __Order:__ We also have to impose some order on our vocabulary (e.g., alphabetical order) to ensure that the indices are consistent across different texts. This can be cumbersome, especially when dealing with large and dynamic vocabularies.
> 
> Let's consider an alternative representation. 

Instead of maintaining a vocabulary dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying [a hash function `h(...)`](https://docs.julialang.org/en/v1/base/base/#Base.hash) to the features (e.g., words). The hash function uses these values directly as feature indices, incrementing counts in the resulting vector at those indices. 

> __Hash function?__
> 
> A hash function is a function that takes an input (or 'message') and returns a fixed-size string of bytes. The output, typically a hash code or hash value, is deterministic (same input always produces the same output) but not necessarily unique (different inputs can produce the same hash, called a collision). Hash functions are commonly used in computer science for tasks such as data retrieval, cryptography, and data integrity verification.

We've implemented a simple version of a hashing vectorizer in the `hashing_vectorizer(...)` function that uses [Julia's built-in `hash(...)` function](https://docs.julialang.org/en/v1/base/base/#Base.hash). We pass in a list of words (from a sentence) and the desired length of the output vector, and the function returns a vector of the specified length with counts of the hashed words.

We save this output in the `vectorized_sentence::Array{Int64,1}` variable. Let's take a look!

In [8]:
vectorized_sentence = let

    # initialize -
    i = 1; # index of sentence to hash
    sentence = sentences[i];
    augmented_sentence = "<bos> " * sentence * " <eos>";
    words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
    tv = hashing_vectorizer(words, length = length(vocabulary));
end

30-element Vector{Int64}:
 0
 1
 1
 2
 0
 0
 1
 1
 0
 0
 0
 1
 1
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

You can change the sentence index `i` in the code block above to see how different sentences are represented using this hashing vectorizer. Further, you can modify the `length` parameter to see how the size of the output vector affects the representation.

Play around with different sentences and vector lengths to see how the hashing vectorizer performs! We'll save this output in the `alternative_vectorized_sentence::Array{Int64,1}` variable. Let's take a look!

In [9]:
alternative_vectorized_sentence = let

    # initialize -
    i = 2; # index of sentence to hash
    sentence = sentences[i];
    desired_sentence_length = 10; # desired length of the output vector
    augmented_sentence = "<bos> " * sentence * " <eos>";
    words = split(lowercase(augmented_sentence)) .|> String; # convert to lowercase and split by whitespace
    tv = hashing_vectorizer(words, length = desired_sentence_length);
end

10-element Vector{Int64}:
 0
 0
 1
 2
 0
 0
 1
 2
 1
 0

### Hash Collisions
When using the hashing trick, multiple words may map to the same index in the output vector. This is called a **collision**.

> __What happens with collisions?__
>
> *   **Information loss**: When two different words map to the same index, their counts are combined, and the model cannot distinguish between them.
> *   **Dimensionality trade-off**: Smaller vector lengths increase the probability of collisions but reduce memory usage. Larger vector lengths reduce collisions but increase sparsity and memory usage.
> *   **Mitigation**: In practice, we choose a vector length large enough (e.g., $2^{18}$ or $2^{20}$) to minimize collisions for the given vocabulary size.

___

## Task 2: TF-IDF Representations
In this task, we'll compute the Term Frequency-Inverse Document Frequency (TF-IDF) representation for our corpus of sentences. The TF-IDF score for a term $t$ in a document $d$ is given by the product of two terms:
$$
\boxed{
\begin{align*}
\text{TF-IDF}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t, \mathcal{D})
\end{align*}}
$$

The __Term Frequency__ ($\text{tf}$) is the raw count of term $t$ in document $d$, often normalized by the total number of words in $d$. The __Inverse Document Frequency__ ($\text{idf}$) measures how much the term is tied to a subset of documents. It is calculated as:
    $$ \text{idf}(t, \mathcal{D}) = \ln \left( \frac{N}{|\{d \in \mathcal{D} : t \in d\}|} \right) $$
where $N$ is the total number of documents in the corpus $\mathcal{D}$, and the denominator is the number of documents where the term $t$ appears. In practice (especially for small corpora), we often use a smoothed IDF to avoid division-by-zero and to reduce extreme values; we'll use that smoothed form below.

First, let's compute the TF values for each term in our vocabulary for each sentence in the corpus. We'll store these values in the `TF_matrix::Array{Float64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.


In [10]:
TF_matrix = let

    # initialize -
    num_sentences = size(bow_matrix, 1); # number of sentences
    vocab_size = size(bow_matrix, 2); # size of the vocabulary
    TF_matrix = zeros(Float64, num_sentences, vocab_size); # initialize the TF matrix

    # populate the TF matrix -
    for i in 1:num_sentences
        total_terms = sum(bow_matrix[i, :]);
        if total_terms == 0
            continue
        end
        for j in 1:vocab_size
            TF_matrix[i, j] = bow_matrix[i, j] / total_terms;
        end
    end

    # return the TF matrix
    TF_matrix
end

6×30 Matrix{Float64}:
 0.0  0.0       0.1        0.0       …  0.1        0.1        0.0  0.0
 0.0  0.0       0.142857   0.0          0.142857   0.142857   0.0  0.0
 0.0  0.0       0.142857   0.0          0.142857   0.142857   0.0  0.0
 0.1  0.0       0.0        0.0          0.1        0.1        0.0  0.0
 0.0  0.0       0.0        0.111111     0.111111   0.111111   0.0  0.0
 0.0  0.111111  0.0555556  0.0       …  0.0555556  0.0555556  0.0  0.0

Next, let's compute the IDF values for each term in our vocabulary across the entire corpus. We'll store these values in the `IDF_values_dictionary::Dict{String, Float64}` variable, where the keys are the unique words and the values are their corresponding IDF scores.

To keep IDF values finite in small corpora (and to avoid division-by-zero when a term appears in zero documents), we'll use a smoothed IDF:
$$
\boxed{
\begin{align*}
\text{idf}(t, \mathcal{D}) &= \ln \left( \frac{N + 1}{|\{d \in \mathcal{D} : t \in d\}| + 1} \right)
\end{align*}}
$$
where $N$ is the total number of documents in the corpus $\mathcal{D}$.


In [11]:
IDF_values_dictionary = let

    # initialize -
    num_sentences = length(sentences); # number of sentences
    IDF_values_dictionary = Dict{String, Float64}(); # initialize dictionary of IDF values

    # compute smoothed IDF for each term in the vocabulary -
    for (word, index) in vocabulary
        doc_frequency = sum(bow_matrix[:, index] .> 0);
        IDF_values_dictionary[word] = log((num_sentences + 1) / (doc_frequency + 1));
    end

    # return dictionary
    IDF_values_dictionary
end


Dict{String, Float64} with 30 entries:
  "!"            => 1.25276
  "is"           => 0.559616
  "enjoy"        => 1.25276
  "data"         => 0.847298
  "language"     => 1.25276
  "coding"       => 1.25276
  "science"      => 0.847298
  "<bos>"        => 0.0
  "a"            => 1.25276
  "and"          => 0.847298
  ","            => 1.25276
  "programming"  => 1.25276
  "love"         => 0.847298
  "?"            => 1.25276
  "."            => 0.336472
  "in"           => 1.25276
  "i"            => 0.559616
  "<unk>"        => 1.94591
  "about"        => 1.25276
  "<pad>"        => 1.94591
  "machine"      => 0.154151
  "artificial"   => 1.25276
  "learning"     => 0.154151
  "intelligence" => 1.25276
  "great"        => 0.847298
  ⋮              => ⋮

Finally, we can compute the TF-IDF representation for each sentence in our corpus by multiplying the TF values with the corresponding IDF values. We'll store the resulting TF-IDF representations in the `TFIDF_matrix::Array{Float64, 2}` variable, where each row corresponds to a sentence and each column corresponds to a word in the vocabulary.

In [12]:
TFIDF_matrix = let

    # initialize -
    num_sentences = size(bow_matrix, 1); # number of sentences
    vocab_size = size(bow_matrix, 2); # size of the vocabulary
    TFIDF_matrix = zeros(Float64, num_sentences, vocab_size); # initialize the TF-IDF matrix

    # populate the TF-IDF matrix -
    for i in 1:num_sentences
        for j in 1:vocab_size
            word = inverse_vocabulary[j];
            idf_value = IDF_values_dictionary[word];
            TFIDF_matrix[i, j] = TF_matrix[i, j] * idf_value;
        end
    end

    # return the TF-IDF matrix
    TFIDF_matrix
end

6×30 Matrix{Float64}:
 0.0       0.0       0.0336472  0.0       …  0.0        0.0  0.0  0.0  0.0
 0.0       0.0       0.0480675  0.0          0.0        0.0  0.0  0.0  0.0
 0.0       0.0       0.0480675  0.0          0.0        0.0  0.0  0.0  0.0
 0.125276  0.0       0.0        0.0          0.0        0.0  0.0  0.0  0.0
 0.0       0.0       0.0        0.139196     0.0        0.0  0.0  0.0  0.0
 0.0       0.139196  0.0186929  0.0       …  0.0695979  0.0  0.0  0.0  0.0

So what does the TF-IDF representation tell us about our sentences? Let's compute the similarity between the TF-IDF vectors of two sentences to see how similar they are in terms of their content. 

> __Test:__ Let's compute the cosine similarity between the TF and the TF-IDF vectors of sentence 2 and sentence 3 (we know these sentences share some words) versus sentence 1 and sentence 5 (we know these sentences do not share any words). Vector 2 and 3 should be more similar than Vector 1 and 5.

What do we get?

In [13]:
let

    # initialize -
    i = 2; # index of first sentence to inspect
    j = 3; # index of second sentence to inspect
    D = TFIDF_matrix; # you can use TF-IDF or TF matrix for this example

    # get the TF-IDF vectors for the two sentences
    vᵢ = D[i, :];
    vⱼ = D[j, :];

    dot_product = dot(vᵢ, vⱼ);
    magnitude_vᵢ = norm(vᵢ);
    magnitude_vⱼ = norm(vⱼ);
    
    if magnitude_vᵢ == 0 || magnitude_vⱼ == 0
        return 0.0 # Return 0 for vectors with zero magnitude
    end
    
    value = dot_product / (magnitude_vᵢ * magnitude_vⱼ);
    println("Cosine similarity between sentence $i and sentence $j: $value");
end

Cosine similarity between sentence 2 and sentence 3: 0.3036826972774764


___

## Task 3: PMI Representations
In this task, we'll compute a Pointwise Mutual Information (PMI) representation for our corpus using word co-occurrence counts within a fixed context window.

> __Pointwise Mutual Information (PMI)__
>
> PMI compares how often a word $w$ and a context word $c$ appear together relative to how often we would expect them to co-occur if they were independent.

The PMI between a word $w$ and a context word $c$ is defined as:
$$
\boxed{
\begin{align*}
\text{PMI}(w, c) &= \log_2 \frac{P(w, c)}{P(w)P(c)}
\end{align*}}
$$
where $P(w, c)$ is the probability of observing $w$ and $c$ in the same window, and $P(w)$ and $P(c)$ are the corresponding marginal probabilities. We'll estimate these probabilities from word-context co-occurrence counts in the corpus.

First, let's build a word-context co-occurrence matrix using a window size of $m=2$ tokens on each side. We'll store these counts in the `cooccurrence_matrix::Array{Int64, 2}` variable, where rows correspond to target words and columns correspond to context words.

In [14]:
cooccurrence_matrix = let

    # initialize -
    window_size = 2; # number of tokens on each side
    vocab_size = length(vocabulary); # size of the vocabulary
    cooccurrence_matrix = zeros(Int64, vocab_size, vocab_size); # initialize the co-occurrence matrix

    # populate the co-occurrence matrix -
    for sentence in sentences
        augmented_sentence = "<bos> " * sentence * " <eos>";
        words = split(lowercase(augmented_sentence)) .|> String;
        word_indices = [get(vocabulary, word, vocabulary["<unk>"]) for word in words]; # wow! nice one-liner to get indices with <unk> fallback

        for i ∈ eachindex(word_indices)
            target_index = word_indices[i];
            left = max(1, i - window_size); # ensure we don't go out of bounds to the left
            right = min(length(word_indices), i + window_size); # ensure we don't go out of bounds to the right
            
            
            for j ∈ left:right
                if j == i
                    continue
                end
                context_index = word_indices[j];
                cooccurrence_matrix[target_index, context_index] += 1;
            end
        end
    end

    # return the co-occurrence matrix
    cooccurrence_matrix
end

30×30 Matrix{Int64}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  1  0  0
 0  0  0  0  0  0  1  1  0  1  0  0  0     0  2  0  2  0  0  1  0  0  0  0  0
 0  0  0  0  0  0  0  1  0  1  0  1  1     0  0  0  0  0  0  1  0  0  4  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     1  0  0  0  0  1  0  0  0  1  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  1     0  0  0  0  0  1  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  1  0  0  0  …  0  0  0  0  1  0  1  1  0  0  0  0
 0  1  0  0  0  0  0  1  0  1  0  0  0     0  2  0  1  0  0  1  0  0  0  0  0
 0  1  1  0  0  0  1  0  0  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0     0  1  1  1  0  0  0  0  0  0  0  0
 0  1  1  0  0  1  1  0  0  0  0  0  0     0  1  0  0  0  0  2  1  0  0  0  0
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  1  0  0  1  0  0  0  1  0  0  0
 0  0  1  0  0  0  0  0  0  0  0  0  0     0  1  0  0  0  0  0  0  0  1  0  0
 0  0  1  0  1  0  0  0  0  0  0  0  0     

Next, we'll convert the co-occurrence counts to probabilities and compute PMI. We'll store the PMI values in the `PMI_matrix::Array{Float64, 2}` variable, and the Positive PMI values in the `PPMI_matrix::Array{Float64, 2}` variable by zeroing out negative entries.

In [15]:
PMI_matrix, PPMI_matrix = let

    # initialize -
    vocab_size = size(cooccurrence_matrix, 1); # size of the vocabulary
    total_pairs = sum(cooccurrence_matrix); # total number of word-context pairs observed

    # compute probabilities - all from the co-occurrence sample space
    P_wc = cooccurrence_matrix / total_pairs; # P(w,c) = joint probability of co-occurrence
    
    # Marginal probabilities from co-occurrence matrix
    # P(w) = sum over all contexts of P(w,c) = probability that word w appears as target
    # P(c) = sum over all targets of P(w,c) = probability that word c appears as context
    P_w = vec(sum(P_wc, dims=2)); # sum across columns (all contexts for each target word)
    P_c = vec(sum(P_wc, dims=1)); # sum across rows (all targets for each context word)

    # compute PMI -
    PMI_matrix = fill(-Inf, vocab_size, vocab_size); # initialize PMI matrix
    for i in 1:vocab_size
        for j in 1:vocab_size
            p_wc = P_wc[i, j];
            p_w = P_w[i];
            p_c = P_c[j];
            if p_wc == 0 || p_w == 0 || p_c == 0
                continue
            end
            PMI_matrix[i, j] = log2(p_wc / (p_w * p_c));
        end
    end

    # compute PPMI -
    PPMI_matrix = max.(PMI_matrix, 0.0);

    # return PMI and PPMI matrices
    (PMI_matrix, PPMI_matrix)
end;

Now we can inspect the PMI or PPMI matrices to see which word pairs co-occur more than expected in this small corpus.

In [16]:
PPMI_matrix

30×30 Matrix{Float64}:
 0.0      0.0      0.0      0.0      0.0      …  0.0       2.53051  0.0  0.0
 0.0      0.0      0.0      0.0      0.0         0.0       0.0      0.0  0.0
 0.0      0.0      0.0      0.0      0.0         0.0       2.53051  0.0  0.0
 0.0      0.0      0.0      0.0      0.0         0.0       2.53051  0.0  0.0
 0.0      0.0      0.0      0.0      0.0         0.0       0.0      0.0  0.0
 0.0      0.0      0.0      0.0      0.0      …  0.0       0.0      0.0  0.0
 0.0      1.70044  0.0      0.0      0.0         0.0       0.0      0.0  0.0
 0.0      2.70044  2.11548  0.0      0.0         0.0       0.0      0.0  0.0
 0.0      0.0      0.0      0.0      0.0         0.0       0.0      0.0  0.0
 0.0      1.70044  1.11548  0.0      0.0         0.0       0.0      0.0  0.0
 0.0      0.0      0.0      0.0      0.0      …  2.11548   0.0      0.0  0.0
 0.0      0.0      2.11548  0.0      0.0         0.0       2.11548  0.0  0.0
 0.0      0.0      1.11548  0.0      2.70044     0.0 

### Understanding Window Size and Co-occurrence
Two words "co-occur" in PMI if they appear within a specified window size of each other, not just in the same sentence.

> __Important:__ With `window_size = 2`, a word only "sees" the 2 tokens immediately before and 2 tokens immediately after it. Words that are more than 2 positions apart do NOT co-occur, even if they're in the same sentence.
>
> For example, in "I love coding machine learning in Julia", the words "love" and "julia" are 5 positions apart, so they don't co-occur with `window_size = 2`.

Let's test this by comparing two cases:

In [19]:
let
    # Example 1: Words that are FAR apart in a sentence (window_size = 2)
    word_1 = "love" |> lowercase;
    word_2 = "julia" |> lowercase;
    index_1 = vocabulary[word_1];
    index_2 = vocabulary[word_2];
    
    # Check the co-occurrence count
    cooccurrence_count = cooccurrence_matrix[index_1, index_2];
    ppmi_value = PPMI_matrix[index_1, index_2];
    
    println("Example 1: \"$word_1\" and \"$word_2\":");
    println("  Co-occurrence count: $cooccurrence_count");
    println("  PPMI value: $ppmi_value");
    println("  Note: These words appear in sentence 4 but are 5 positions apart (beyond window_size=2)\n");
    
    # Example 2: Words that ARE within the window
    word_3 = "machine" |> lowercase;
    word_4 = "learning" |> lowercase;
    index_3 = vocabulary[word_3];
    index_4 = vocabulary[word_4];
    
    cooccurrence_count_2 = cooccurrence_matrix[index_3, index_4];
    ppmi_value_2 = PPMI_matrix[index_3, index_4];
    
    println("Example 2: \"$word_3\" and \"$word_4\":");
    println("  Co-occurrence count: $cooccurrence_count_2");
    println("  PPMI value: $ppmi_value_2");
    println("  Note: These words appear together frequently within window_size=2");
end

Example 1: "love" and "julia":
  Co-occurrence count: 0
  PPMI value: 0.0
  Note: These words appear in sentence 4 but are 5 positions apart (beyond window_size=2)

Example 2: "machine" and "learning":
  Co-occurrence count: 5
  PPMI value: 1.267480310864986
  Note: These words appear together frequently within window_size=2


___

## Summary
In this example, we implemented and explored three fundamental text embedding techniques: Bag of Words, TF-IDF, and PMI.

> __Key Takeaways:__
> 
> * **Bag of Words limitations:** BoW provides a simple frequency-based representation but suffers from sparsity, high dimensionality, and a lack of context or semantic meaning.
> * **TF-IDF re-weighting:** TF-IDF improves upon raw counts by penalizing common words and emphasizing terms that are distinctive to specific documents.
> * **PMI for associations:** PMI and PPMI quantify the statistical association between words based on co-occurrence, capturing semantic relationships that simple counts miss.

These methods form the foundation for more advanced natural language processing tasks.
___