# L10a: The GloVe Model for Word Embeddings
In this lecture, we explore the GloVe (Global Vectors) model, an approach for learning word embeddings from global word-word co-occurrence statistics in a corpus. Unlike prediction-based methods like CBOW, GloVe leverages count-based statistics to capture semantic relationships, providing a bridge between statistical methods (like PMI) and neural embeddings.

> __Learning Objectives:__
>
> By the end of this lecture, you should be able to:
> * __Define the GloVe model:__ Describe how GloVe connects word-context co-occurrence counts to vector representations with a weighted least-squares objective.
> * __Construct co-occurrence statistics:__ Build a word-word co-occurrence matrix from a corpus using a sliding context window and distance-based weighting.
> * __Interpret the training objective:__ Explain the roles of the weighting function, bias terms, and separate word/context vectors in the GloVe loss.

Let's get started!
___

## Examples
Today, we will use the following examples to illustrate key concepts:

> [â–¶ Implement GloVe Embeddings](CHEME-5820-L10a-Example-GloVeEmbeddings-Spring-2026.ipynb). In this example, we build a co-occurrence matrix from a small corpus, train GloVe embeddings, and evaluate their semantic properties through vector arithmetic.

___

## What is GloVe?
GloVe (Global Vectors) is a count-based model that learns word embeddings by factorizing a global word-word co-occurrence matrix. Unlike local context methods (like CBOW) that predict nearby words, GloVe captures the overall statistical structure of word relationships across the entire corpus.

> __Why GloVe?__
>
> GloVe combines the advantages of count-based methods (like PMI, which capture global statistics) with the efficiency of neural embeddings. It addresses PMI's limitations by learning low-dimensional vectors that preserve linear relationships, enabling semantic arithmetic like $\text{vector}(\text{King}) - \text{vector}(\text{Man}) + \text{vector}(\text{Woman}) \approx \text{vector}(\text{Queen})$.

GloVe assigns each vocabulary item $i$ a vector $\mathbf{w}_{i} \in \mathbb{R}^{d}$ so that the geometry of the vector space encodes co-occurrence structure.

> __Two Vectors per Word__
>
> GloVe learns two vectors per word and combines them after training:
> * __Word vectors:__ $\mathbf{w}_{i}$ is used when word $i$ is the center word in the context window.
> * __Context vectors:__ $\tilde{\mathbf{w}}_{i}$ is used when word $i$ appears as a context word.
> * __Final Embedding:__ Use $\mathbf{w}_{i} + \tilde{\mathbf{w}}_{i}$ (or their average) as the embedding for word $i$.

This asymmetry accounts for the different roles of words in co-occurrence pairs Next, we define the co-occurrence counts that form the foundation of the model.
___

## From Co-occurrence Counts to Meaning
The core idea of GloVe is to model the relationship between words through their co-occurrence probabilities. Let $X_{ij}$ denote the weighted number of times word $j$ occurs in the context window of word $i$. We define the total co-occurrence count for word $i$ as:
$$
X_{i} = \sum_{k} X_{ik}
$$
and the conditional probability as:
$$
P_{ij} = \frac{X_{ij}}{X_{i}}
$$
so that $P_{ij}$ is the empirical probability that word $j$ appears near word $i$.

> __Interpreting Ratios__
>
> The key insight is to examine ratios of these probabilities:
> * __Large ratio:__ If $P_{ik}/P_{jk}$ is large, context word $k$ is much more associated with center word $i$ than with $j$.
> * __Ratio near 1:__ If $P_{ik}/P_{jk} \approx 1$, word $k$ is related to $i$ and $j$ in similar ways.
> * __Small ratio:__ If $P_{ik}/P_{jk}$ is small, word $k$ is more associated with $j$ than $i$.

For example, for words $i=\text{ice}$ and $j=\text{steam}$, the ratio $P_{\text{solid}} / P_{\text{gas}}$ would be large because "solid" co-occurs more with "ice" than "steam".

> __Numerical Example__
>
> Consider a small corpus with words: "cat", "dog", "food", "sleep". Suppose we observe these co-occurrence counts within a context window:
> * $X_{\text{cat},\text{food}} = 15$, $X_{\text{cat},\text{sleep}} = 8$
> * $X_{\text{dog},\text{food}} = 12$, $X_{\text{dog},\text{sleep}} = 2$
> * Total counts: $X_{\text{cat}} = 23$, $X_{\text{dog}} = 14$
>
> The conditional probabilities are:
> * $P_{\text{cat},\text{food}} = 15/23 \approx 0.65$
> * $P_{\text{cat},\text{sleep}} = 8/23 \approx 0.35$
> * $P_{\text{dog},\text{food}} = 12/14 \approx 0.86$
> * $P_{\text{dog},\text{sleep}} = 2/14 \approx 0.14$
>
> The ratio $P_{\text{food},\text{cat}} / P_{\text{food},\text{dog}} = 0.65/0.86 \approx 0.76$ (close to 1, indicating similar association), while $P_{\text{sleep},\text{cat}} / P_{\text{sleep},\text{dog}} = 0.35/0.14 \approx 2.5$ (large ratio, indicating "sleep" is more associated with "cat" than "dog").

We now connect these statistical insights to the vector learning objective.

## Building the Co-occurrence Matrix
The co-occurrence matrix $X$ is built by scanning the corpus with a sliding context window and accumulating weighted counts.

> __Construction Steps__
>
> * __Step 1 (Vocabulary):__ Tokenize the corpus, clean the text, and select a vocabulary of the most frequent words.
> * __Step 2 (Context Window):__ Choose a symmetric context window of size $c$ around each center word.
> * __Step 3 (Weighted Counts):__ For each center word $i$ and context word $j$ within the window, add a weight (e.g., $1/d$ where $d$ is the distance) to $X_{ij}$. This gives higher weight to closer words.
> * __Asymmetry Note:__ $X_{ij}$ may not be symmetric because the roles of word and context differ, though in practice it often is made symmetric for simplicity.

With $X$ in hand, we can train the model parameters.

___

## Learning GloVe Embeddings
GloVe learns word and context vectors by minimizing a weighted least-squares objective that captures the relationship between co-occurrence statistics and vector geometry.

> __Objective Function__
> 
> The model fits vectors by minimizing this objective over all pairs $(i,j)$ where $X_{ij} > 0$:
> $$
\boxed{
\begin{align*}
J = \sum_{i,j: X_{ij} > 0} f(X_{ij})\left(\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij}\right)^{2}
\end{align*}}
> $$
>
> __Components of the Objective__
>
> * __Dot Product:__ $\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j}$ models the log co-occurrence count.
> * __Bias Terms:__ $b_{i}$ and $\tilde{b}_{j}$ account for word frequency differences. Frequent words like "the" need larger bias terms to match their higher co-occurrence counts.
> * __Weighting Function:__ $f(X_{ij})$ prevents rare and common co-occurrences from dominating training. Without weighting, stop words like "the" (very common) and rare pairs would distort the learning. A common choice is $f(x) = \min\left(1, \left(\frac{x}{x_{\max}}\right)^{\alpha}\right)$ with $x_{\max} = 100$ and $\alpha = 3/4$, as used in the original GloVe paper. For example, a co-occurrence count of 50 gets weight $\min(1, (50/100)^{3/4}) = \min(1, 0.56) = 0.56$, while a count of 200 gets weight 1.

This objective encourages the vectors to satisfy $\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} \approx \log X_{ij}$, capturing the statistical relationships between words.

### Optimization Strategy
The GloVe objective is optimized using stochastic gradient descent methods, as the full matrix $X$ is typically too large for batch optimization.

> __Gradient Computation__
>
> For each training pair $(i,j)$, the gradients with respect to the parameters are:
> $$
\begin{align*}
\frac{\partial J}{\partial \mathbf{w}_{i}} &= 2f(X_{ij})(\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij})\tilde{\mathbf{w}}_{j} \\
\frac{\partial J}{\partial \tilde{\mathbf{w}}_{j}} &= 2f(X_{ij})(\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij})\mathbf{w}_{i} \\
\frac{\partial J}{\partial b_{i}} &= 2f(X_{ij})(\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij}) \\
\frac{\partial J}{\partial \tilde{b}_{j}} &= 2f(X_{ij})(\mathbf{w}_{i}^{\top}\tilde{\mathbf{w}}_{j} + b_{i} + \tilde{b}_{j} - \log X_{ij})
\end{align*}
>$$
>
> With gradients in hand, we can now describe the practical training algorithm that implements these updates.

> __Training Algorithm__
>
> * __Mini-batch SGD:__ Process small batches of word pairs $(i,j)$ to compute gradient estimates and update parameters.
> * __Adaptive Optimizers:__ Use AdaGrad or Adam to handle the sparse gradients and varying scales across different word frequencies.
> * __Learning Rate Schedule:__ Start with higher learning rates and decay over time to ensure convergence.
> * __Early Stopping:__ Monitor validation performance on semantic tasks to prevent overfitting.

This training approach raises the question: why choose gradient descent over other optimization methods?

> __Why Gradient Descent?__
>
> While heuristic methods like simulated annealing or genetic algorithms could theoretically optimize the GloVe objective, gradient descent is the practical choice for several reasons:
>
> * __Scalability:__ The GloVe objective involves millions of parameters and training pairs. Gradient-based methods scale to this size efficiently.
> * __Theoretical Guarantees:__ For convex or locally convex objectives like GloVe's weighted least squares, gradient descent provides convergence guarantees.
> * __Computational Efficiency:__ Each gradient computation is O(d) where d is embedding dimension (typically 50-300), making it feasible for large vocabularies.
> * __Established Success:__ Gradient descent has proven effective for learning embeddings that capture semantic relationships.

While gradient descent is the standard choice, it's worth considering alternative optimization approaches that could theoretically be applied to the GloVe objective.

> __Alternative Approaches__
>
> Other optimization methods are possible but less practical:
> * __Second-order Methods:__ Newton's method or quasi-Newton (L-BFGS) could converge faster but require computing or approximating the Hessian matrix, which is computationally prohibitive for high-dimensional problems.
> * __Heuristic Methods:__ Simulated annealing or evolutionary algorithms lack the theoretical convergence properties needed for reliable embedding learning.
> * __Coordinate Descent:__ Could optimize one parameter at a time but would be slower than mini-batch gradient methods.

Despite gradient descent's potential issues (local minima, sensitivity to learning rates), it remains the standard approach due to its balance of efficiency, scalability, and effectiveness.

### GloVe Training Pseudocode
__Initialization__: Given vocabulary size $V$, embedding dimension $d$, co-occurrence matrix $X$, and hyperparameters $(\alpha, x_{\max}, \eta_0, \tau)$, initialize word vectors $\mathbf{W} \in \mathbb{R}^{V \times d}$ and context vectors $\tilde{\mathbf{W}} \in \mathbb{R}^{V \times d}$ randomly (e.g., from $\mathcal{N}(0, 0.01)$), bias vectors $\mathbf{b}, \tilde{\mathbf{b}} \in \mathbb{R}^{V}$ to zeros. Set learning rate $\eta \gets \eta_0$, batch size $B$, maximum iterations $K$, tolerance $\epsilon > 0$. Set $\texttt{converged} \gets \texttt{false}$, iteration counter $k \gets 0$.

While not $\texttt{converged}$ __do__:
1. Sample a mini-batch of word pairs $\{(i^{(m)}, j^{(m)})\}_{m=1}^B$ where $X_{i^{(m)}j^{(m)}} > 0$. For each pair $(i,j)$ in the batch, compute prediction $\hat{y}_{ij} = \mathbf{w}_i^\top \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j$, error $e_{ij} = \hat{y}_{ij} - \log X_{ij}$, weighting factor $f_{ij} = \min(1, (X_{ij}/x_{\max})^\alpha)$, and gradients $\frac{\partial J}{\partial \mathbf{w}_i} = 2f_{ij} e_{ij} \tilde{\mathbf{w}}_j$, $\frac{\partial J}{\partial \tilde{\mathbf{w}}_j} = 2f_{ij} e_{ij} \mathbf{w}_i$, $\frac{\partial J}{\partial b_i} = 2f_{ij} e_{ij}$, $\frac{\partial J}{\partial \tilde{b}_j} = 2f_{ij} e_{ij}$.
2. Update all parameters using the accumulated gradients and current learning rate $\eta$. $\texttt{Note}$: Use an adaptive optimizer like AdaGrad to handle sparse gradients and varying scales across word frequencies.
3. Check convergence: If change in objective function $\leq \epsilon$, set $\texttt{converged} \gets \texttt{true}$. Return the learned embeddings. If $k \geq K$, set $\texttt{converged} \gets \texttt{true}$. Warn that maximum iterations reached without convergence. $\texttt{Note}$: Other convergence criteria could monitor gradient norms or validation performance on semantic tasks.
4. Update learning rate $\eta \gets \eta_0 / (1 + \tau k)$, increment iteration counter $k \gets k + 1$.

__Post-training__: Combine word and context vectors $\mathbf{w}_i \gets \mathbf{w}_i + \tilde{\mathbf{w}}_i$ for each word $i$ to get the final embeddings.

This optimization approach scales efficiently to large corpora while learning embeddings that preserve semantic relationships.

**Reference:** Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. In *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)* (pp. 1532-1543). [PDF](https://nlp.stanford.edu/pubs/glove.pdf)

___

## Summary
GloVe provides a global view of word co-occurrence statistics to learn word embeddings that encode semantic meaning.

> __Key Takeaways__
>
> * GloVe learns embeddings by predicting co-occurrence probabilities using a weighted least squares objective.
> * The model captures both local and global statistics through the ratio of co-occurrence probabilities.
> * Training is efficient and scalable, making GloVe suitable for large corpora.
> * Resulting embeddings enable semantic arithmetic and improve performance on NLP tasks like analogy completion.

GloVe bridges the gap between count-based and prediction-based methods, offering interpretable and effective word representations.

___