# L9a: Text Embeddings: Bag of Words, TF-IDF and Pointwise Mutual Information (PMI)
In this lecture, we cover methods for representing text data as vectors of numbers, a step in natural language processing (NLP) tasks. We cover Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Pointwise Mutual Information (PMI). These methods convert text into data for machine learning models.

> __Learning Objectives:__
> 
> By the end of this lecture, you should be able to:
> 
> * __Bag of Words representation:__ Explain how BoW converts text into count vectors with zeros for most entries and identify limitations in dimensionality, context, and semantics.
> * __TF-IDF and PMI weighting:__ Compute TF-IDF scores to re-weight terms and use PMI to measure word association from co-occurrence probabilities.
> * __CBOW model and embeddings:__ Outline the CBOW architecture, training objective with cross-entropy loss, and how to extract word embeddings from the hidden layer.

Let's get started!
___


## Concept: Text Embedding Models
The goal of [embedding models](https://en.wikipedia.org/wiki/Word_embedding) is to represent language sequences (characters, words, documents, etc.) in a vector space so that words with similar usage are near each other.

Before we discuss the details of these models, let's introduce some concepts, terminology, and notation that will be used throughout this lecture.

### Vocabulary, Tokens, and Tokenization
Let $\mathcal{V}$ be the vocabulary of tokens (characters, sub-words, whole words, documents, etc.) in our [corpus](https://en.wikipedia.org/wiki/Corpus), and let $N_{\mathcal{V}} = |\mathcal{V}|$ be the vocabulary size. Let $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ with $x_i\in\mathcal{V}$ be a sequence of tokens in the corpus, i.e., a sentence or document, where $n$ is the length of the sequence and $x_i$ is the $i$-th token in the sequence.

Let's consider an example: `My grandma makes the best apple pie.`

Tokens are the units of text that we will be working with. In this setting, tokens can be characters, sub-words, whole words, or documents. Converting a sequence of text into tokens is called _tokenization_.
* __Character-level tokenization__. Given the example above, one possible choice is to let the vocabulary $\mathcal{V}$ be the (English) alphabet (plus punctuation). Thus, we'd get a sequence $\mathbf{x}\in\mathcal{V}$ of length 36: `['M', 'y', ' ' , ..., ' .']`. Character-level tokenization yields sequences of this length in the example.
* __Word-level tokenization__. Another possible choice is to let the vocabulary $\mathcal{V}$ be the set of all words in the corpus (plus punctuation). Thus, we'd get a sequence $\mathbf{x}\in\mathcal{V}$ of length 8: `['My', 'grandma', 'makes', 'the', 'best', 'apple', 'pie', ' .']`. Word-level tokenization uses a vocabulary of words and cannot represent new words at test time.
* __Sub-word tokenization__. A third possible choice is to let the vocabulary $\mathcal{V}$ be the set of word segments like `cious`, `ing`, `pre`. Words like `is` are often a separate token, and single characters are included in the vocabulary $\mathcal{V}$ to ensure all words are expressible.

Given a choice of tokenization/vocabulary, each vocabulary element is assigned an index $i\in\left\{1, 2,\dots,N_{\mathcal{V}},N_{\mathcal{V}}+1,N_{\mathcal{V}}+2,N_{\mathcal{V}}+3,\dots\right\}$ where we've added several control tokens to the vocabulary. For example (there can be more control tokens, depending on the application):
* $\texttt{mask} \rightarrow N_{\mathcal{V}} + 1$: the `mask` token that is used to mask out a token in the input sequence. This is used in training to predict the masked word.
* $\texttt{bos} \rightarrow N_{\mathcal{V}} + 2$: the beginning of the sequence (bos) token is used to indicate the start of a sequence.
* $\texttt{eos} \rightarrow N_{\mathcal{V}} + 3$: the end of sequence (eos) token is used to indicate the end of a sequence.

A piece of text of length $n$ is represented as a sequence of indices (called token IDs) $i_{1}, i_{2}, \ldots, i_{n}$ corresponding to its (sub)words, preceded by the $\texttt{bos}$-token and followed by the $\texttt{eos}$-token. Alternatively, the text can be represented as a sequence of one-hot encoded vectors $\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}$, where each vector $\mathbf{x}_{i}\in\{0,1\}^{N_{\mathcal{V}}}$ has a 1 in the position corresponding to the token ID $i_{i}$ and 0s elsewhere.

Now that we have introduced some terminology, let's start with Bag of Words (BoW).

___

## Bag of Words (BoW)
The Bag of Words (BoW) model is a technique for text embedding. As the name suggests, we represent a text (such as a sentence or a document) as a "bag" (multiset) of its words, disregarding grammar and word order but keeping multiplicity.

> __What is it?__
>
> The Bag of Words (BoW) model represents a document as a vector $\mathbf{x} \in \mathbb{R}^{N_{\mathcal{V}}}$, where each element $x_j$ corresponds to the number of times the $j$-th word in the vocabulary $\mathcal{V}$ appears in the document.

### Example
Consider the example vocabulary $\mathcal{V}$ derived from the sentence: `My grandma makes the best apple pie.`
Suppose our vocabulary is: $\mathcal{V} = \{\text{apple}, \text{best}, \text{grandma}, \text{makes}, \text{my}, \text{pie}, \text{the}\}$ (sorted alphabetically, $N_{\mathcal{V}}=7$).

If we have a document: `The grandma makes the pie.`
The BoW representation would be a vector of counts: $\mathbf{x} = [0, 0, 1, 1, 0, 1, 2]$
corresponding to `[apple, best, grandma, makes, my, pie, the]`. Note that `the` appears twice, so its corresponding entry is 2.

### Limitations
BoW has limitations:

> __What are the BoW model's limitations?__
>
> *   **Dimensionality and sparsity**: For vocabularies with >100k words, the feature vectors have $N_{\mathcal{V}}$ dimensions and many zeros. This increases computational cost and the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality), so models need more data to learn patterns.
> *   **Loss of sequence and context**: The BoW model discards word order and grammatical structure. For example, "The dog bit the man" and "The man bit the dog" have identical BoW representations despite having opposite meanings. This limits performance on tasks where context matters.
> *   **Lack of semantic meaning**: The model does not capture semantic similarity between words. In the vector space, `grandma` and `grandmother` are orthogonal (perpendicular) to each other, just as `grandma` and `truck` are. The model does not represent relationships in meaning between words.

To reduce the influence of words that appear in many documents, we re-weight counts with TF-IDF.
___

## Term Frequency-Inverse Document Frequency (TF-IDF)
BoW counts treat words such as "the", "is", and "and" the same as other terms, so they can dominate counts even though they carry less topic information. TF-IDF re-weights counts to reduce the impact of common words.

> __What is it?__
>
> **Term Frequency-Inverse Document Frequency (TF-IDF)** is a statistic that re-weights word counts by penalizing terms that appear frequently across *all* documents.

The TF-IDF score for a term $t$ in a document $d$ is given by the product of two terms:
$$
\boxed{
\begin{align*}
\text{TF-IDF}(t, d) &= \text{tf}(t, d) \cdot \text{idf}(t, \mathcal{D})
\end{align*}}
$$

The __Term Frequency__ ($\text{tf}$) is the raw count of term $t$ in document $d$, often normalized by the total number of words in $d$. The __Inverse Document Frequency__ ($\text{idf}$) measures how much the term is tied to a subset of documents. It is calculated as:
    $$ \text{idf}(t, \mathcal{D}) = \log \left( \frac{N}{|\{d \in \mathcal{D} : t \in d\}|} \right) $$
where $N$ is the total number of documents in the corpus $\mathcal{D}$, and the denominator is the number of documents where the term $t$ appears. 

> __Intepretation__
> 
> The 


Next, we define PMI to measure word association from co-occurrence.


___

## Pointwise Mutual Information (PMI)
Pointwise Mutual Information (PMI) measures the association between two events (e.g., the occurrence of two words).

> __What is it?__
>
> **Pointwise Mutual Information (PMI)** quantifies the discrepancy between the probability of two words $w_i$ and $w_j$ co-occurring in a context window versus the probability of them occurring independently.

The PMI between a word $w$ and a context word $c$ is defined as:
$$
\boxed{
\begin{align*}
\text{PMI}(w, c) &= \log_2 \frac{P(w, c)}{P(w)P(c)}
\end{align*}}
$$
where: $P(w, c)$ is the probability of word $w$ and context $c$ appearing together, $P(w)$ and $P(c)$ are the marginal probabilities of word $w$ and context $c$ appearing independently.

In practice, these probabilities are estimated from corpus statistics: $P(w, c)$ is the count of co-occurrences of $w$ and $c$ within a window size divided by the total number of word-context pairs in the corpus, $P(w)$ is the count of $w$ divided by the total number of words, and similarly for $P(c)$.

> **Interpretation**:
>
>*   **PMI > 0**: The words co-occur more often than expected by chance (association).
>*   **PMI $\approx$ 0**: The words are independent.
>*   **PMI < 0**: The words co-occur less often than expected (complementary distribution).

In practice, we often use **Positive PMI (PPMI)**, which replaces negative values with zero: $\text{PPMI}(w, c) = \max(\text{PMI}(w, c), 0)$.
Next, we move from count-based statistics to prediction-based embeddings with CBOW.
___

## Continuous Bag of Words (CBOW)
The Continuous Bag of Words (CBOW) model is a neural network architecture for learning word embeddings. It is described in the [word2vec work](https://arxiv.org/abs/1301.3781).

> __What is it?__ 
>
> The Continuous Bag of Words (CBOW) model predicts the probability of a _target word_ based on its surrounding _context words_. The CBOW is encoded as a feedforward neural network with a single hidden layer. The input vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is the sum (or average) of the [one-hot encoded vectors](https://en.wikipedia.org/wiki/One-hot) of the _context words_. The output is a _softmax layer_ that computes the probability of the target word given the context.
> 
> __Reference__: [Rong, X. (2014). word2vec Parameter Learning Explained. ArXiv, abs/1411.2738.](https://arxiv.org/abs/1411.2738)

Let $\mathcal{C}$ be the set of indices of the context words surrounding a target word $w_t$. For a window size of $m$, the context is $\mathcal{C} = \{t-m, \dots, t-1, t+1, \dots, t+m\}$. Let $\mathbf{v}_k \in \{0,1\}^{N_{\mathcal{V}}}$ be the one-hot encoded vector for the $k$-th word in the vocabulary.

The input vector $\mathbf{x}$ to the network is the aggregate of the one-hot vectors of the context words:
$$
\mathbf{x} = \sum_{k \in \mathcal{C}} \mathbf{v}_{w_k}
$$
Alternatively, we can take the average: $\mathbf{x} = \frac{1}{|\mathcal{C}|}\sum_{k \in \mathcal{C}} \mathbf{v}_{w_k}$. This input vector $\mathbf{x}\in\mathbb{R}^{N_{\mathcal{V}}}$ is connected to a hidden layer $\mathbf{h}\in\mathbb{R}^{h}$ which is computed using a linear identity transformation, i.e., with no activation function:
$$
\begin{align*}
\mathbf{h} &= \mathbf{W}_{1} \cdot \mathbf{x} \\
\end{align*}
$$
where $\mathbf{W}_{1}\in\mathbb{R}^{h\times{N_{\mathcal{V}}}}$ is the weight matrix of the hidden layer. The hidden layer is then mapped through another linear layer:
$$
\begin{align*}
\mathbf{u} &= \mathbf{W}_{2} \cdot \mathbf{h} \\
\end{align*}
$$
which produces the $\mathbf{u}\in\mathbb{R}^{N_{\mathcal{V}}}$ vector, where $\mathbf{W}_{2}\in\mathbb{R}^{N_{\mathcal{V}}\times{h}}$ is the weight matrix for the output layer. The output layer is then passed through a softmax activation function to obtain the probability distribution over the vocabulary:
$$
\begin{align*}
p(w_{i} | \mathbf{x}) = \hat{y}_i &= \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \\
\end{align*}
$$
where $\hat{y}_i$ is the predicted probability of observing the $i$-th token in the vocabulary as the target, given the context vector $\mathbf{x}$. The term $N_{\mathcal{V}}$ is the size of the vocabulary, and $e^{\mathbf{u}_i}$ is the exponential function applied to the $i$-th element of the vector $\mathbf{u}$.

> __Why no activation function?__
>
> The hidden layer uses a linear map. This reduces computation and matches the linear structure of the embedding space.
> *   **Computation**: Removing the non-linearity reduces cost and supports training on corpora with billions of words.
> *   **Linear map**: The linear projection preserves linear relationships in the embedding space, enabling vector arithmetic like $\texttt{vector}(\text{King}) - \texttt{vector}(\text{Man}) + \texttt{vector}(\text{Woman}) \approx \texttt{vector}(\text{Queen})$.
>
> This supports embedding learning but limits nonlinear patterns that networks with more layers can capture.

#### Training
The training objective of the CBOW model is to _maximize_ the likelihood of the target word given the context words. This is done by _minimizing_ the __negative log-likelihood loss function__ (cross-entropy loss) for one training example. The loss function is defined as:
$$
\begin{align*}
\mathcal{L} &= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log \hat{y}_i \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\log \left( \frac{e^{\mathbf{u}_i}}{\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}} \right) \\
&= -\sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left( \mathbf{u}_i - \log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) \right) \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j} \right) -  \mathbf{u}_i\right)\quad\text{substitute}~\mathbf{u}_{i} = \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle \\
&= \sum_{i=1}^{N_{\mathcal{V}}} y_{i}\cdot\left(\log \left( \sum_{j=1}^{N_{\mathcal{V}}} e^{\langle \mathbf{w}_{2}^{(j)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle} \right) -  \langle \mathbf{w}_{2}^{(i)},\mathbf{W}_{1}\cdot\mathbf{x}\rangle\right)\blacksquare\\
\end{align*}
$$
where $\mathcal{L}$ is the loss function, $y_{i}$ is the $i$-th element of the **ground truth** one-hot encoded vector (1 for the actual target word, 0 otherwise), and $\hat{y}_i$ is the model's predicted probability. $\mathbf{W}_{1}$ and $\mathbf{W}_{2}$ are the weight matrices, and $\langle \cdot,\cdot\rangle$ is the inner product. Finally, the term $\mathbf{w}_{2}^{(i)}$ is the $i$-th row of the weight matrix $\mathbf{W}_{2}$.

### Extracting the Embeddings
Once the model is trained, we are not interested in the prediction task itself (i.e., predicting the center word). Instead, our goal is the weight matrix $\mathbf{W}_{1}$.

> __Where are the embeddings?__
>
> The rows of the weight matrix $\mathbf{W}_{1} \in \mathbb{R}^{N_{\mathcal{V}} \times h}$ constitute the **word embeddings**. Each row $i$ corresponds to the $h$-dimensional vector representation of the $i$-th word in the vocabulary. To get the embedding for a specific word, we simply "look up" the corresponding row in $\mathbf{W}_{1}$. This layer is called the **Lookup Table**.

### Practical Note: The Softmax Bottleneck
In the derivation above, the denominator of the softmax function requires summing over the entire vocabulary size $N_{\mathcal{V}}$:
$$
\sum_{j=1}^{N_{\mathcal{V}}} e^{\mathbf{u}_j}
$$
For vocabularies with $N_{\mathcal{V}} \approx 1,000,000$, computing this sum for every training example is a bottleneck. Implementations (like the original Word2Vec C code or Gensim) use approximation techniques such as **Negative Sampling** or **Hierarchical Softmax** to approximate the denominator without summing over all words.
___

## Summary
This lecture covers methods based on counts and prediction models.

> __Key Takeaways:__
> 
> * __BoW and TF-IDF:__ BoW provides word count vectors, while TF-IDF re-weights them based on document frequency.
> * __PMI for associations:__ PMI measures word co-occurrence, with positive values indicating association and negative values showing avoidance.
> * __CBOW embeddings:__ CBOW learns word embeddings by predicting targets from context, with the hidden layer serving as the embedding matrix.

These methods support later NLP models.

___
