<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>

# Building a Large Language Model from Scratch — A Step-by-Step Guide Using Python and PyTorch
## Chapter 6 — From Words to Vectors
**© Dr. Yves J. Hilpisch**<br>AI-Powered by GPT-5.

## How to Use This Notebook

- Build intuition for distributional semantics with co-occurrence counts and simple embeddings.
- Visualize learned vectors to validate that geometry matches linguistic expectations.
- Evaluate embedding quality with quick intrinsic metrics before using them downstream.

### Roadmap

You will progress from count-based representations to learned embeddings, finishing with visual diagnostics that highlight semantic structure.

### Study Tips

Keep the plots visible as you tweak hyperparameters. Subtle changes in window size or negative sampling show up immediately in the geometry.

In [None]:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'


In [None]:
# Ensure torch is available (Colab friendly)
try:
    import torch  # noqa
    print('torch:', torch.__version__)
except Exception:
    import os
    gpu = os.system('nvidia-smi > /dev/null 2>&1') == 0
    index = (
        'https://download.pytorch.org/whl/cu121'
        if gpu else 'https://download.pytorch.org/whl/cpu'
    )
    get_ipython().run_line_magic('pip', f'install -q torch --index-url {index}')
    import torch
    print('torch:', torch.__version__)


In [None]:
# A tiny corpus (use data/philosophy.txt if you have it)
from pathlib import Path
text_path = Path('mini.txt')
if not text_path.exists():
    text_path.write_text('Hello world. Hello vectors.', encoding='utf-8')
text_path, text_path.read_text()


In [None]:
from __future__ import annotations
from collections import Counter
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable, List, Dict

@dataclass
class Vocab:
    token_to_id: Dict[str, int]
    id_to_token: List[str]
    pad: int
    unk: int
    @classmethod
    def build(
        cls,
        tokens: Iterable[str],
        min_freq: int = 1,
        specials: Iterable[str] = ('<PAD>', '<UNK>'),
    ) -> 'Vocab':
        counter = Counter(tokens)
        id_to_token = list(specials)
        for tok, freq in counter.most_common():
            if freq >= min_freq and tok not in id_to_token:
                id_to_token.append(tok)
        token_to_id = {t: i for i, t in enumerate(id_to_token)}
        pad = token_to_id[specials[0]]
        unk = token_to_id[specials[1]]
        return Vocab(token_to_id, id_to_token, pad, unk)
    def __len__(self) -> int:
        return len(self.id_to_token)

class SimpleTokenizer:
    """Tiny tokenizer for chapter 6 (char or word level)."""
    def __init__(self, vocab: Vocab, level: str = 'char') -> None:
        assert level in {'char','word'}
        self.vocab = vocab
        self.level = level
        self.pad = vocab.pad
        self.unk = vocab.unk
    @staticmethod
    def _split(text: str, level: str) -> List[str]:
        if level == 'char':
            return list(text)
        out: List[str] = []
        token: List[str] = []
        for ch in text:
            if ch.isalnum():
                token.append(ch.lower())
            else:
                if token:
                    out.append(''.join(token)); token = []
                if ch.strip():
    def from_file(
        cls,
        path: str | Path,
        level: str = 'char',
        min_freq: int = 1,
    ) -> 'SimpleTokenizer':
        ids = []
        for tok in self._split(text, self.level):
            ids.append(self.vocab.token_to_id.get(tok, self.unk))
        return ids
            out.append(''.join(token))
        return out
    @classmethod
    def from_file(
        cls,
        path: str | Path,
        level: str = 'char',
        min_freq: int = 1,
    ) -> 'SimpleTokenizer':
        ids = []
        for tok in self._split(text, self.level):
            ids.append(self.vocab.token_to_id.get(tok, self.unk))
        return ids
        tokens = cls._split(text, level)
        vocab = Vocab.build(tokens, min_freq=min_freq)
        return cls(vocab=vocab, level=level)
    def encode(self, text: str) -> List[int]:
        ids = []
        for tok in self._split(text, self.level):
            ids.append(self.vocab.token_to_id.get(tok, self.unk))
        return ids
    def decode(self, ids: Iterable[int]) -> str:
        toks: List[str] = []
        for i in ids:
            if 0 <= i < len(self.vocab.id_to_token):
                tok = self.vocab.id_to_token[i]
                if tok not in {'<PAD>','<UNK>'}:
                    toks.append(tok)
            else:
                toks.append('<UNK>')
        if self.level == 'char':
            return ''.join(toks)
        out: List[str] = []
        for t in toks:
            if not out: out.append(t)
            elif t.isalnum(): out.append(' ' + t)
            else: out.append(t)
        return ''.join(out)


In [None]:
# Build a character-level tokenizer using the class above
tok = SimpleTokenizer.from_file(str(text_path), level='char')
len(tok.vocab), list(tok.vocab.token_to_id.items())[:10]


In [None]:
# Ensure `tok` exists (create if missing)
try:
    tok
except NameError:
    tok = SimpleTokenizer.from_file(str(text_path), level='char')
# Encode and decode
ids = tok.encode('Hello world.')
decoded = tok.decode(ids)
ids, decoded


In [None]:
# Embedding table
E = torch.nn.Embedding(num_embeddings=len(tok.vocab), embedding_dim=8)
E


In [None]:
# Build a small batch of token ids
batch = [tok.encode('Hello'), tok.encode('vectors')]
batch


In [None]:
# Pad to equal length (PAD=0)
P = tok.pad
P


In [None]:
lens = max(len(x) for x in batch)
lens


In [None]:
x = torch.tensor([s + [P]*(lens-len(s)) for s in batch])
x


In [None]:
# Lookup embeddings
E(x).shape


## Word‑Level Example

In [None]:
# Using SimpleTokenizer defined in a previous cell
tok_w = SimpleTokenizer.from_file(str(text_path), level='word')
len(tok.vocab), len(tok_w.vocab)


In [None]:
tok_w.encode('Hello vectors.')


## Padding Strategies and Masks

In [None]:
P = tok.pad
P


In [None]:
batch = [tok.encode('Hello'), tok.encode('vectors')]
batch


In [None]:
L = max(len(s) for s in batch)
L


In [None]:
right_pad = [s + [P]*(L-len(s)) for s in batch]
right_pad


In [None]:
left_pad  = [[P]*(L-len(s)) + s for s in batch]
left_pad


In [None]:
import torch
x = torch.tensor(right_pad)
pad_mask = (x != P).float()
T = x.size(1)
causal = torch.tril(torch.ones(T, T))
combined = pad_mask[:, None, :] * causal
pad_mask.shape, causal.shape, combined.shape


## Exercises

- Train embeddings on a new corpus and compare cosine similarities for a handful of anchor words.
- Experiment with dimensionalities of 16, 64, and 128 to see how expressiveness and overfitting trade off.
- Add an intrinsic evaluation metric (analogy completion or clustering) and report the results.

<img src="https://theaiengineer.dev/tae_logo_gw_flatter.png" width=35% align=right>