# Embeddings and Vector Spaces — Try it in PyTorch

This is an **optional** hands-on companion to [Chapter 4](https://learnai.robennals.org/04-embeddings). You'll build one-hot vectors, train your own embeddings, explore real word analogies with GloVe, and see how modern tokenizers work.

In [None]:
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

## One-Hot Encoding

Before a neural network can work with words, we need to turn them into numbers. The simplest approach is **one-hot encoding**: give each word a vector (a list of numbers) that's all zeros except for a single 1 in a unique position. It works, but it has a big problem...

In [None]:
# One-hot encoding: each word gets a vector with a single 1
words = ["cat", "dog", "fish", "car"]
one_hot = torch.eye(len(words))

print("One-hot encodings:")
for word, vec in zip(words, one_hot):
    print(f"  {word:>4}: {vec.tolist()}")

# Problem: all words are equally distant from each other!
print("\nDistances between words (Euclidean):")
for i in range(len(words)):
    for j in range(i+1, len(words)):
        dist = torch.dist(one_hot[i], one_hot[j])
        print(f"  {words[i]:>4} ↔ {words[j]:<4}: {dist.item():.3f}")

print("\nEvery pair has distance √2 ≈ 1.414 — 'cat' is as far from 'dog' as from 'car'!")

## Learning Embeddings

An **embedding** is a short list of numbers that represents a word (or any item). Unlike one-hot vectors, embeddings are *learned* — the network adjusts the numbers during training so that similar words end up close together.

`nn.Embedding(6, 2)` creates an embedding table: 6 words, each represented by 2 numbers. We use just 2 dimensions here so we can plot them on a flat graph. Real embeddings use hundreds of dimensions.

In [None]:
# Instead of one-hot, learn a dense vector for each word
# Words that appear in similar contexts should end up close together

vocab = ["cat", "dog", "kitten", "puppy", "car", "truck"]
similar_pairs = [(0,2), (1,3), (4,5), (0,1), (2,3)]  # cat-kitten, dog-puppy, etc.

# Create a 2D embedding (so we can visualize it)
torch.manual_seed(42)
embedding = nn.Embedding(len(vocab), 2)
optimizer = torch.optim.Adam(embedding.parameters(), lr=0.05)

# Train: make similar words close, random words far
for epoch in range(500):
    total_loss = 0
    for i, j in similar_pairs:
        vi = embedding(torch.tensor(i))
        vj = embedding(torch.tensor(j))
        # Pull similar words together
        loss = torch.dist(vi, vj)**2
        
        # Push a random word away
        k = torch.randint(len(vocab), (1,)).item()
        vk = embedding(torch.tensor(k))
        loss -= 0.3 * torch.dist(vi, vk)**2
        loss += 2.0  # Keep loss positive
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

# Visualize the learned embeddings
plt.figure(figsize=(6, 5))
with torch.no_grad():
    for i, word in enumerate(vocab):
        vec = embedding(torch.tensor(i))
        plt.plot(vec[0], vec[1], 'o', markersize=10)
        plt.annotate(word, (vec[0]+0.02, vec[1]+0.02), fontsize=12)

plt.title("Learned 2D embeddings: similar words cluster together")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Word Analogies with Real Embeddings

This section downloads pre-trained **GloVe** embeddings (~66MB) from Stanford. These are real word vectors trained on billions of words of text. We use **cosine similarity** to compare vectors — it measures whether two vectors point in the same direction, regardless of their length (1.0 = identical direction, 0 = unrelated, -1 = opposite).

In [None]:
import urllib.request
import os
import zipfile

glove_path = "glove.6B.50d.txt"
if not os.path.exists(glove_path):
    print("Downloading GloVe embeddings (~66MB)...")
    url = "https://nlp.stanford.edu/data/glove.6B.zip"
    urllib.request.urlretrieve(url, "glove.6B.zip")
    with zipfile.ZipFile("glove.6B.zip", 'r') as z:
        z.extract("glove.6B.50d.txt")
    print("Done!")

# Load embeddings
print("Loading embeddings...")
word2vec = {}
with open(glove_path, 'r') as f:
    for line in f:
        parts = line.split()
        word = parts[0]
        vec = torch.tensor([float(x) for x in parts[1:]])
        word2vec[word] = vec
print(f"Loaded {len(word2vec)} word vectors")

def analogy(a, b, c):
    """a is to b as c is to ???"""
    target = word2vec[b] - word2vec[a] + word2vec[c]
    best_word, best_sim = None, -1
    for word, vec in word2vec.items():
        if word in (a, b, c):
            continue
        sim = torch.cosine_similarity(target.unsqueeze(0), vec.unsqueeze(0))
        if sim > best_sim:
            best_word, best_sim = word, sim.item()
    return best_word

print(f"\nking - man + woman = {analogy('man', 'king', 'woman')}")
print(f"paris - france + japan = {analogy('france', 'paris', 'japan')}")
print(f"slow - slower + fast = {analogy('slow', 'slower', 'fast')}")

## Tokenization

Modern AI models don't work with whole words — they split text into **tokens**, which are pieces of words. Common words like "the" stay whole, but rare words get broken into smaller chunks. This lets the model handle any word, even ones it's never seen before, by combining familiar pieces.

`tiktoken` is the tokenizer used by GPT-4. You'll need to install it: `pip install tiktoken`

In [None]:
# Modern AI models don't work with whole words — they use tokens
# Tokens are pieces of words (subword units)

# pip install tiktoken
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

examples = [
    "Hello, world!",
    "The cat sat on the mat.",
    "Supercalifragilisticexpialidocious",
    "PyTorch is awesome!",
]

for text in examples:
    tokens = enc.encode(text)
    decoded = [enc.decode([t]) for t in tokens]
    print(f"Text: {text!r}")
    print(f"  Token IDs: {tokens}")
    print(f"  Tokens:    {decoded}")
    print(f"  Count:     {len(tokens)} tokens")
    print()

---

*This notebook accompanies [Chapter 4: Embeddings and Vector Spaces](https://learnai.robennals.org/04-embeddings). The interactive widgets in the web version let you explore these concepts visually.*

*New to PyTorch? See the [PyTorch from Scratch](https://learnai.robennals.org/appendix-pytorch) appendix for a beginner-friendly introduction.*