---
title: "Word Embeddings"
jupyter: applsoftcomp
execute:
    enabled: true
    cache: true
---



### The Spoiler
Meaning isn't stored in words; it's stored in the geometric relationship between them.

### Words are not containers for meaning

We intuitively assume words are containers for meaning—that "dog" holds the concept of a canine. This is incorrect. Structural linguistics reveals that a sign is defined solely by its relationships: "dog" means "dog" only because it is not "cat," "wolf," or "log." Meaning is differential, not intrinsic.


::: {#fig-structuralism}

![](../figs/word2vec-manga-1.jpg)

Green is the color that is not non-green (not red, not blue, not yellow, ...).

:::


**Word2vec**---the very foundational model that grounds modern NLP---learns to map the statistical topology of language. Think of it like mapping a city based purely on traffic data. You don't know what a "school" is, but you see that "buses" and "children" congregate there at 8 AM. By placing these entities close together on a map, you reconstruct the city's functional structure. Word2vec does this for language, turning semantic proximity into geometric distance.

## Word2vec

Let's first learn the power of Word2Vec and then understand how it works. We will use a pre-trained model. We aren't teaching it anything; we are simply inspecting the map it created from 100 billion words of Google News.

In [None]:
import gensim.downloader as api
import numpy as np

# Load pre-trained Word2vec embeddings
print("Loading Word2vec model...")
model = api.load("word2vec-google-news-300")
print(f"Loaded embeddings for {len(model):,} words.")

If the map is accurate, "dog" should be surrounded by its semantic kin. We query the nearest neighbors in the vector space.

In [None]:
similar_to_dog = model.most_similar("dog", topn=10)

print("Words most similar to 'dog':")
for word, similarity in similar_to_dog:
    print(f"  {word:20s} {similarity:.3f}")

The model groups "dog" with "dogs," "puppy," and "pooch" **not because it knows biology**, but because they are statistically interchangeable in sentences.

Since words are vectors, we can perform arithmetic on meaning. The relationship between "King" and "Man" is a vector. If we add that vector to "Woman," we should arrive at "Queen."

$$ \vec{\text{King}} - \vec{\text{Man}} + \vec{\text{Woman}} \approx \vec{\text{Queen}} $$

In [None]:
result = model.most_similar(
  positive=['king', 'woman'],
   negative=['man'], topn=5
)

print("king - man + woman =")
for word, similarity in result:
    print(f"  {word:15s} {similarity:.3f}")

We cannot see in 300 dimensions, but we can project the space down to 2D using PCA. This reveals the consistent structures—like the "capital city" relationship—that the model has learned.

In [None]:
#| fig-cap: The 'Capital Of' relationship appears as a consistent direction in vector space.
#| code-fold: true
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd

countries = ["Germany", "France", "Italy", "Spain", "Portugal", "Greece"]
capitals = ["Berlin", "Paris", "Rome", "Madrid", "Lisbon", "Athens"]

# Get embeddings
country_embeddings = np.array([model[country] for country in countries])
capital_embeddings = np.array([model[capital] for capital in capitals])

# PCA to 2D
pca = PCA(n_components=2)
embeddings = np.vstack([country_embeddings, capital_embeddings])
embeddings_pca = pca.fit_transform(embeddings)

# Create DataFrame
df = pd.DataFrame(embeddings_pca, columns=["PC1", "PC2"])
df["Label"] = countries + capitals
df["Type"] = ["Country"] * len(countries) + ["Capital"] * len(capitals)

# Plot
fig, ax = plt.subplots(figsize=(12, 10))

for idx, row in df.iterrows():
    color = "#e74c3c" if row["Type"] == "Country" else "#3498db"
    marker = "o" if row["Type"] == "Country" else "s"
    ax.scatter(
        row["PC1"],
        row["PC2"],
        c=color,
        marker=marker,
        s=200,
        edgecolors="black",
        linewidth=1.5,
        alpha=0.7,
        zorder=3,
    )
    ax.text(
        row["PC1"],
        row["PC2"] + 0.15,
        row["Label"],
        fontsize=12,
        ha="center",
        va="bottom",
        fontweight="bold",
        bbox=dict(facecolor="white", edgecolor="none", alpha=0.8),
    )

# Draw arrows
for i in range(len(countries)):
    country_pos = df.iloc[i][["PC1", "PC2"]].values
    capital_pos = df.iloc[i + len(countries)][["PC1", "PC2"]].values
    ax.arrow(
        country_pos[0],
        country_pos[1],
        capital_pos[0] - country_pos[0],
        capital_pos[1] - country_pos[1],
        color="gray",
        alpha=0.6,
        linewidth=2,
        head_width=0.15,
        head_length=0.1,
        zorder=2,
    )

ax.set_title(
    'The "Capital Of" Relationship as Parallel Transport',
    fontsize=16,
    fontweight="bold",
    pad=20,
)
ax.grid(alpha=0.3, linestyle="--")
plt.tight_layout()
plt.show()

## Let's unbox Word2Vec.

We intuitively treat words as containers that hold meaning—that the word "Green" contains the visual concept of a specific color. This is incorrect. Nature presents us with a messy, continuous spectrum without hard borders; language is simply the set of arbitrary cuts we make in that continuum to create order.

**Word2Vec** operationalizes this by treating meaning as a game of contrast. It functions as a pair of "Linguistic Scissors." It does not learn what a word is by looking up a definition; it learns what a word is *like* by pulling it close to neighbors, and more importantly, it learns what a word is *not* by pushing it away from random noise. The meaning of "Green" is simply the geometric region that remains after we have pushed away "Red," "Purple," and "Banana."

::: {#fig-contrastive}

![](../figs/word2vec-manga-2.jpg)


Starting from initially random vectors, word2vec learns iteratively to push away the words that are not related, and pull the words that are related. The resulting vector space is a map of the relationships between words.

:::

This process of carving structure out of noise relies on a technique called **Contrastive Learning**. We cannot teach the model the exact meaning of each word but we can let it to learn the relationship between words through a binary classification problem: are these two words neighbors, or are they strangers?

The training loop provides a **positive pair** from the text, instructing the model to maximize the similarity between their vectors. Simultaneously, it grabs random **negative samples**--imposters from the vocabulary--and demands the model minimize their similarity. This push-and-pull mechanic creates the vector space; the "Green" cluster forms not because the model understands color, but because those words are statistically interchangeable when opposed to "Red."

To generate these pairs without human labeling, we employ a **sliding window technique**. This moves over the raw text corpus, converting a sequence of words into a system of geometric queries.

::: {#fig-sliding-window}

![](../figs/word2vec-manga-3.jpg)

Without human labeling, word2vec assumes that the words in the same context are related. The context is defined as the words that are within a window of an predefined size. For example, in the sentence "The quick brown fox jumps over the lazy dog", the context of the word "fox" is the words "brown", "jumps", "over", and "lazy".

:::


::: {.column-margin}
Word2Vec is a simple neural network with one-hidden layer. The input is one-hot encoded vector of a word, which triggers the neurons in the hidden layer to fire.
The neural connection strength from the neuron representing the word to the neurons in the hidden layer (marked by red arrows) represents the query vector, $u$.
The hidden layer neurons then trigger the firing of the output layer neurons, which represents the probability of word $w$ appearing in the context of the word $w_i$.
The connection strength from an output word neuron to the hidden layer neurons represents the key vector, $v$.

![](../figs/word2vec.jpg)
:::

The word in the center of the window acts as the Query vector ($u$), broadcasting its position to the surrounding Context words, which act as Keys ($v$). The neural network adjusts its weights to maximize the dot product $u \cdot v$ for these specific context pairs while suppressing the dot product for the negative samples. The probability of a word appearing in context is thus a function of their vector alignment.

$$
P(j \vert i) = \frac{P(j) \exp(u_i \cdot v_j)}{\sum_{k=1}^{V} P(k) \exp(u_i \cdot v_k)}
$$

where $P(j)$ is the probability of word $j$ appearing in the vocabulary.

::: {.callout-note  collapse="true"}

## Original Formulation of Word2Vec is different from the one we use here

The original paper of word2vec puts the following formula for the probability:

$$
P(j \vert i) = \frac{\exp(u_i \cdot v_j)}{\sum_{k=1}^{V} \exp(u_i \cdot v_k)}.
$$

Notice that $P(j)$---the marginal probability of word $j$---is omitted in this formulation [@mikolov2013distributed]. This original formulation is correct conceptually but not practically. In practice, we train word2vec with an efficient but biased training algorithm (i.e., negative sampling). Term $P(j)$ enters the $P(j \vert i)$ when taking into account the bias [@kojaku2021residual], which is why we include it in the formula above.
:::

This closes the loop between high-level linguistic philosophy and low-level matrix operations. The machine proves the structuralist hypothesis: that meaning is relational. By mechanically slicing the continuum of language and applying the pressure of negative sampling, the model reconstructs a functional map of human concepts. We have successfully turned a philosophy of meaning into a runnable algorithm.

::: {#fig-structuralism-loop}

![](../figs/word2vec-manga-4.jpg)

:::


## Other references

There is a nice blog post that walks through the inner workings of Word2Vec by Chris McCormick. See [here](https://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/). Strongly encourage you to read it.

## The Takeaway

You don't need to know what a thing *is* to understand it; you only need to know where it stands relative to everything it isn't.