## *Teaching computers what words mean*
---
**Word embeddings**

How does ChatGPT know that **"dog"** and **"puppy"** are similar?

Or that **"Paris"** is to **"France"** as **"Tokyo"** is to **"Japan"**?

The answer: **word embeddings** — representing every word as a list of numbers (a *vector*) so that words with similar meanings end up close together in space.

```
king   → [0.32, -0.51, 0.78, 0.14, ...300 numbers...]
queen  → [0.30, -0.48, 0.75, 0.19, ...]
dog    → [-0.62, 0.33, -0.10, 0.88, ...]
```

This is the **foundation of all modern AI language models** — ChatGPT, Gemini, Claude, all of them.

In [10]:
from pathlib import Path
import urllib.request
import zipfile

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patheffects as pe
import seaborn as sns
import ipywidgets as widgets
from IPython.display import display
from gensim.models import KeyedVectors
from sklearn.decomposition import PCA

sns.set_theme(style="ticks", font_scale=1.15)
plt.ioff()

data_dir = Path("../data")
zip_path = data_dir / "wiki-news-300d-1M.vec.zip"
vec_path = data_dir / "wiki-news-300d-1M.vec"
url = "https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip"

if not zip_path.exists():
    print("Downloading FastText vectors (first run only, large file)...")
    urllib.request.urlretrieve(url, zip_path)

if not vec_path.exists():
    print("Extracting .vec file...")
    with zipfile.ZipFile(zip_path, "r") as zf:
        zf.extract("wiki-news-300d-1M.vec", data_dir)

# Use a practical limit for live demos; set to None to load the full 1M vocabulary.
LOAD_LIMIT = 300_000
ft = KeyedVectors.load_word2vec_format(vec_path, binary=False, limit=LOAD_LIMIT)

print(f"Loaded FastText vectors: {len(ft):,} words, {ft.vector_size} dimensions.")

Loaded FastText vectors: 300,000 words, 300 dimensions.


## What is a Vector?

Think of it as **coordinates** for meaning.

In 2D, we can place words on a map. In 300D, we can capture much richer relationships.

**Cosine similarity** measures how *similar* two word-vectors point:
- Score **1.0** = identical direction (same meaning)
- Score **0.0** = perpendicular (unrelated)
- Score **-1.0** = opposite direction (antonyms)

In [11]:
def cosine_similarity(v1: np.ndarray, v2: np.ndarray) -> float:
    return float(np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2)))


def get_vector(word: str) -> np.ndarray:
    return ft[word]


def nearest_neighbors(word: str, n: int = 5) -> list[tuple[str, float]]:
    return ft.most_similar(word, topn=n)

Pick a word and see which other words are most similar to it.

In [12]:
demo_words = [
    "king", "queen", "doctor", "nurse", "dog", "cat", "lion", "python",
    "computer", "internet", "music", "guitar", "happy", "sad", "city", "country"
]

@widgets.interact(
    word=widgets.Dropdown(options=demo_words, value="king", description="Word:",
                          layout=widgets.Layout(width="260px")),
    n=widgets.IntSlider(value=8, min=3, max=12, step=1, description="Show top:",
                        style={"description_width": "initial"},
                        layout=widgets.Layout(width="300px")),
)
def show_neighbors(word, n):

    neighbors = nearest_neighbors(word, n)
    neighbor_words = [w for w, _ in neighbors]
    neighbor_scores = [s for _, s in neighbors]

    fig, ax = plt.subplots(figsize=(8.5, max(2.2, n * 0.4)))
    colors = plt.cm.RdYlGn(np.clip(neighbor_scores, 0, 1))

    bars = ax.barh(range(n), neighbor_scores, color=colors, edgecolor="white", linewidth=1)
    ax.set_yticks(range(n))
    ax.set_yticklabels(neighbor_words, fontsize=11)
    ax.invert_yaxis()
    ax.set_xlim(0.4, 1.0)
    ax.set_xlabel("Cosine Similarity")
    ax.set_title(f'FastText nearest neighbors for "{word}"', fontsize=14, weight="bold")
    ax.bar_label(bars, fmt="%.3f", padding=4, fontsize=10)

    # sns.despine(left=True, bottom=True)
    plt.tight_layout()
    plt.show()

interactive(children=(Dropdown(description='Word:', layout=Layout(width='260px'), options=('king', 'queen', 'd…

FastText vectors are **300-dimensional**, so we still need dimensionality reduction to visualize them.

We use **PCA** to project carefully chosen semantic groups into 2D.
Even after projection, related concepts should stay close together.

In [13]:
word_groups = {
    "royalty": ["king", "queen", "prince", "princess", "duke", "duchess", "throne", "crown"],
    "countries": ["france", "japan", "china", "germany", "italy", "canada", "brazil", "india"],
    "capitals": ["paris", "tokyo", "beijing", "berlin", "rome", "ottawa", "brasilia", "delhi"],
    "technology": ["computer", "internet", "software", "hardware", "python", "java", "database", "algorithm"],
    "animals": ["dog", "cat", "lion", "tiger", "wolf", "bear", "eagle", "shark"],
    "emotions": ["happy", "sad", "angry", "calm", "joy", "fear", "love", "stress"],
}

cat_palette = {
    "royalty": "#E8575A",
    "countries": "#00B4D8",
    "capitals": "#9B5DE5",
    "technology": "#6BCB77",
    "animals": "#5B8FB9",
    "emotions": "#FF6B9D",
}

# Keep only words that were loaded into the current FastText vocabulary limit.
word_groups = {
    group: [w for w in words if w in ft.key_to_index]
    for group, words in word_groups.items()
}

@widgets.interact(
    cats=widgets.SelectMultiple(
        options=sorted(word_groups.keys()),
        value=tuple(sorted(word_groups.keys())),
        description="Groups:",
        layout=widgets.Layout(height="160px", width="230px"),
    ),
    show_labels=widgets.Checkbox(value=True, description="Show word labels"),
)
def plot_word_map(cats, show_labels):
    selected_words = []
    selected_cats = []
    for cat in cats:
        for word in word_groups[cat]:
            selected_words.append(word)
            selected_cats.append(cat)

    if len(selected_words) < 3:
        print("Please select at least one group with 3+ words in vocabulary.")
        return

    vectors = np.vstack([get_vector(w) for w in selected_words])
    pca = PCA(n_components=2, random_state=42)
    coords = pca.fit_transform(vectors)
    var_exp = pca.explained_variance_ratio_

    fig, ax = plt.subplots(figsize=(11, 8))

    for cat in cats:
        mask = np.array([c == cat for c in selected_cats])
        ax.scatter(
            coords[mask, 0], coords[mask, 1],
            c=cat_palette[cat], s=120, label=cat.capitalize(),
            edgecolors="white", linewidths=0.9, zorder=3, alpha=0.9,
        )

        if show_labels:
            for i, (x, y) in enumerate(coords[mask]):
                w = np.array(selected_words)[mask][i]
                ax.text(
                    x + 0.02, y, w, fontsize=8.5, color=cat_palette[cat],
                    path_effects=[pe.withStroke(linewidth=2, foreground="white")],
                )

    ax.set_xlabel(f"PC 1 ({var_exp[0]*100:.1f}% variance)", fontsize=11)
    ax.set_ylabel(f"PC 2 ({var_exp[1]*100:.1f}% variance)", fontsize=11)
    ax.set_title("FastText Word Map — 300D to 2D via PCA", fontsize=14, weight="bold")
    ax.legend(title="Group", bbox_to_anchor=(1.01, 1), loc="upper left", fontsize=10, title_fontsize=11)
    ax.grid(True, alpha=0.25)
    sns.despine()
    plt.tight_layout()
    plt.show()

interactive(children=(SelectMultiple(description='Groups:', index=(0, 1, 2, 3, 4, 5), layout=Layout(height='16…

## The Word Arithmetic

FastText also supports analogy-style arithmetic via vector operations, for example:

- `paris - france + japan ≈ tokyo`
- `walk - walking + swimming ≈ swim`
- `good - bad + terrible ≈ awful`

These relationships are learned from context in large text corpora.

In [19]:
manual_candidates = {
    "Capital transfer": ("paris", "france", "japan"),
    "Country transfer": ("Japan", "Tokyo", "Delhi"),
    "Gender relation": ("father", "son", "daughter"),
    "Royal gender": ("queen", "woman", "man"),
    "Verb morphology": ("walking", "walk", "swim"),
    "Adjective relation": ("good", "bad", "terrible"),
    "Plural form": ("people", "person", "child"),
}


def in_vocab(*words: str) -> bool:
    return all(w in ft.key_to_index for w in words)


analogy_examples = {
    name: triple for name, triple in manual_candidates.items()
    if in_vocab(*triple)
}

# Auto-add robust morphology analogies from words guaranteed to be loaded.
# Form: "X-ing relation" uses X+ing - X + Y -> Y+ing.
vocab_words = [w for w in ft.key_to_index if w.isalpha()]

if not analogy_examples:
    print(
        "No analogy examples available with the current vocabulary limit "
        f"(LOAD_LIMIT={LOAD_LIMIT:,}). Increase the limit in the setup cell."
    )
else:
    options = list(analogy_examples.keys())

    @widgets.interact(
        example=widgets.Dropdown(
            options=options,
            value=options[0],
            description="Example:",
            layout=widgets.Layout(width="320px"),
        ),
        topn=widgets.IntSlider(
            value=8,
            min=3,
            max=12,
            step=1,
            description="Show top:",
            style={"description_width": "initial"},
            layout=widgets.Layout(width="300px"),
        ),
    )
    def word_arithmetic(example, topn):
        a, b, c = analogy_examples[example]

        top = ft.most_similar(positive=[a, c], negative=[b], topn=topn)
        result_vec = get_vector(a) - get_vector(b) + get_vector(c)

        best_word, _ = top[0]
        manual_score = cosine_similarity(result_vec, get_vector(best_word))

        print(f'"{a}" - "{b}" + "{c}" = ?')
        print(f"Top prediction: {best_word} (manual cosine to result vector: {manual_score:.3f})")

        labels = [w for w, _ in top]
        scores = [s for _, s in top]

        fig, ax = plt.subplots(figsize=(8.5, max(3.2, topn * 0.48)))
        bars = ax.barh(range(topn), scores, color=plt.cm.viridis(scores), edgecolor="white")
        ax.set_yticks(range(topn))
        ax.set_yticklabels(labels, fontsize=11)
        ax.invert_yaxis()
        ax.set_xlim(0.35, 1.0)
        ax.set_xlabel("Cosine Similarity")
        ax.set_title(f'Analogy Results: "{a}" - "{b}" + "{c}"', fontsize=13.5, weight="bold")
        ax.bar_label(bars, fmt="%.3f", padding=4, fontsize=10)

        sns.despine(left=True, bottom=True)
        plt.tight_layout()
        display(fig)
        plt.close(fig)

interactive(children=(Dropdown(description='Example:', layout=Layout(width='320px'), options=('Capital transfe…

## Key Takeaways

1. **Words are vectors.** Here we used pretrained **FastText 300D embeddings**.
2. **Similar meanings → nearby vectors.** We measured this with cosine similarity.
3. **Gensim APIs** make practical tasks easy (`most_similar`, analogy queries).
4. **PCA maps** help humans inspect structure in high-dimensional spaces.
5. **Word arithmetic** can reveal semantic and grammatical relations.

This embedding intuition still underpins modern language models, even though production systems now use richer contextual representations.
