# Makemore Part 2: MLP Character-Level Language Model

Extending beyond bigrams to use a context window of multiple characters. Following the approach from [Bengio et al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf): embed characters into a learned vector space, then predict the next character with a hidden-layer MLP.

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

## Data Loading and Vocabulary

In [None]:
words = open("names.txt", 'r').read().splitlines()

chars = sorted(list(set("".join(words))))
stoi = {s:i+1 for i,s in enumerate(chars)}
stoi["."] = 0
itos = {i:s for s,i in stoi.items()}
print(f'{len(words)} names, {len(stoi)} unique characters')

## Build the Dataset

Each training example is a context window of `block_size` characters that predicts the next character. The context is padded with `.` (index 0) at the start of each name.

In [None]:
block_size = 3
X, Y = [], []

for w in words:
    context = [0] * block_size
    for ch in w + '.':
        idx = stoi[ch]
        X.append(context)
        Y.append(idx)
        context = context[1:] + [idx]

X = torch.tensor(X)
Y = torch.tensor(Y)
print(f'Dataset: {X.shape[0]} examples, context size {X.shape[1]}')

## Model Architecture

- **Embedding layer** `C`: maps each of 27 characters to a 2D vector
- **Hidden layer**: `tanh` activation, 100 neurons, takes concatenated embeddings (3 chars x 2 dims = 6 inputs)
- **Output layer**: projects to 27 logits (one per character)

In [None]:
C = torch.randn((27, 2))
W1 = torch.randn((6, 100))
b1 = torch.randn(100)
W2 = torch.randn((100, 27))
b2 = torch.randn(27)

parameters = [C, W1, b1, W2, b2]
for p in parameters:
    p.requires_grad = True

print(f'{sum(p.nelement() for p in parameters)} total parameters')

## Training

In [None]:
for i in range(1000):
    emb = C[X]
    h = torch.tanh(emb.view(emb.shape[0], 6) @ W1 + b1)
    logits = h @ W2 + b2
    loss = F.cross_entropy(logits, Y)

    for p in parameters:
        p.grad = None
    loss.backward()
    for p in parameters:
        p.data += -0.1 * p.grad

    if i % 100 == 0:
        print(f'Step {i:4d} | Loss = {loss.item():.4f}')

print(f'Final loss: {loss.item():.4f}')