# Makemore: Implement the paper MLP

Following [Bengio et al. 2003](https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and [karpathy Andrej's](https://github.com/karpathy/makemore) repo, we try to implement the model step by step with the following

1. `makemore-practice-part1-paper-mlp`: 
    - Demonstration of how statisitcal models can be transform to scalable nn models. For example, bi-gram to nn from scratch.
2. `makemore-practice-part2-paper-mlp`: 
    - Implement nn from paper. 
    - Introduce basic concepts, for example creating dataset for language models, embedding, tensor indexing, matrix operations, train-test-split, model fitting, mini-batch, stocastic gradient concept, learning weight visualization for model explanation, and prediction. 
3. `makemore-practice-part3-paper-mlp`: 
    - nn initialization diagostic (after linear/non-linear transformation) and tricks to have good init, for example: ideal init loss, avoid over confidence by scaling, and gain. 
    - Introduce batch-morn approach to stablize nn training from scratch
4. `makemore-practice-part4-paper-mlp`: 
    - Reformat the code by mimicing Pytorch. 
    - Then, introduce nn training diagnostic like weight, grad, grad-weight-ratio, grad-weight-std-log-ratio
5. `makemore-practice-part5-paper-mlp`:
    - Everything from scratch

In [1]:
import math
import os
import random
import string

import matplotlib.pyplot as plt
import numpy as np
import torch

from typing import Any
from graphviz import Digraph
from torch.nn import functional as F

os.chdir("/home/project")
%matplotlib inline

## 1. Make the process more like Pytorch 

**1.1 Model global settings**

In [2]:
block_size = 3
v_size = 27
n_embed = 8
n_hidden = 64
g = torch.Generator().manual_seed(2147483647)

**1.2 Read and Prepare Data**

In [8]:
def read_txt(nrows: int = None):
    words = []
    with open("practice/makemore/name.txt", "r") as f:
        if nrows:
            for _ in range(nrows):
                words.append(f.readline().splitlines()[0])
        else:
            for line in f:
                words.append(line.splitlines()[0])
    return words


def get_lookup_map():
    lowercases = string.ascii_lowercase
    stoi = {s: i for i, s in enumerate(lowercases, start=1)}
    stoi["."] = 0
    itos = {val: key for key, val in stoi.items()}
    return stoi, itos


def preprocess(words: list, stoi: dict, size: int = 3):
    x, y = [], []
    for word in words:
        context = [0] * size
        for s in word + ".":
            idx = stoi[s]
            x.append(context)
            y.append(idx)
            context = context[1:] + [idx]
    x = torch.tensor(x)
    y = torch.tensor(y)
    print(x.shape, y.shape)
    return (x, y)


def train_eval_test_split(X: torch.tensor, Y: torch.tensor) -> torch.tensor:
    train_cnt = int(X.shape[0] * 0.8)
    test_cnt = int(X.shape[0] * 0.9)
    idx = torch.randperm(train_cnt)
    x_train, y_train = X[idx], Y[idx]
    x_eval, y_eval = X[train_cnt:test_cnt], Y[train_cnt:test_cnt]
    x_test, y_test = X[test_cnt:], Y[test_cnt:]
    return x_train, y_train, x_eval, y_eval, x_test, y_test


words = read_txt(None)
stoi, itos = get_lookup_map()
X, Y = preprocess(words=words[:], stoi=stoi, size=block_size)
x_train, y_train, x_eval, y_eval, x_test, y_test = train_eval_test_split(X, Y)
print(f"Rows: {len(words)}")
print(f"Top 5 rows: {words[:5]}")
print(stoi)
print(itos)
print(x_train[:5])
print(y_train[:5])
print(x_eval[:5])
print(y_eval[:5])

torch.Size([228146, 3]) torch.Size([228146])
Rows: 32033
Top 5 rows: ['emma', 'olivia', 'ava', 'isabella', 'sophia']
{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '.': 0}
{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}
tensor([[26,  9,  1],
        [12,  5,  1],
        [ 0, 13,  9],
        [ 9,  5,  1],
        [13,  5, 18]])
tensor([ 8,  8, 12, 14,  1])
tensor([[ 0,  0, 26],
        [ 0, 26, 21],
        [26, 21,  8],
        [21,  8,  1],
        [ 8,  1,  9]])
tensor([21,  8,  1,  9,  2])


In [9]:
g = torch.Generator().manual_seed(2147483647)  # for reproducibility
C = torch.randn((v_size, n_embed), generator=g)
# Layer 1
W1 = (
    torch.randn((n_embed * block_size, n_hidden), generator=g)
    * (5 / 3)
    / ((n_embed * block_size) ** 0.5)
)
b1 = (
    torch.randn(n_hidden, generator=g) * 0.1
)  # using b1 just for fun, it's useless because of BN
# Layer 2
W2 = torch.randn((n_hidden, v_size), generator=g) * 0.1
b2 = torch.randn(v_size, generator=g) * 0.1
# BatchNorm parameters
bngain = torch.randn((1, n_hidden)) * 0.1 + 1.0
bnbias = torch.randn((1, n_hidden)) * 0.1

# Note: I am initializating many of these parameters in non-standard ways
# because sometimes initializating with e.g. all zeros could mask an incorrect
# implementation of the backward pass.

parameters = [C, W1, b1, W2, b2, bngain, bnbias]
print(sum(p.nelement() for p in parameters))  # number of parameters in total
for p in parameters:
    p.requires_grad = True

3699


In [10]:
batch_size = 32
n = batch_size # a shorter variable also, for convenience
# construct a minibatch
ix = torch.randint(0, x_train.shape[0], (batch_size,), generator=g)
x_train_batch, y_train_batch = x_train[ix], y_train[ix] # batch X,Y

In [11]:
# forward pass, "chunkated" into smaller steps that are possible to backward one at a time

emb = C[x_train_batch] # embed the characters into vectors
embcat = emb.view(emb.shape[0], -1) # concatenate the vectors
# Linear layer 1
hprebn = embcat @ W1 + b1 # hidden layer pre-activation
# BatchNorm layer
bnmeani = 1/n*hprebn.sum(0, keepdim=True)
bndiff = hprebn - bnmeani
bndiff2 = bndiff**2
bnvar = 1/(n-1)*(bndiff2).sum(0, keepdim=True) # note: Bessel's correction (dividing by n-1, not n)
bnvar_inv = (bnvar + 1e-5)**-0.5
bnraw = bndiff * bnvar_inv
hpreact = bngain * bnraw + bnbias
# Non-linearity
h = torch.tanh(hpreact) # hidden layer
# Linear layer 2
logits = h @ W2 + b2 # output layer
# cross entropy loss (same as F.cross_entropy(logits, Yb))
logit_maxes = logits.max(1, keepdim=True).values
norm_logits = logits - logit_maxes # subtract max for numerical stability
counts = norm_logits.exp()
counts_sum = counts.sum(1, keepdims=True)
counts_sum_inv = counts_sum**-1 # if I use (1.0 / counts_sum) instead then I can't get backprop to be bit exact...
probs = counts * counts_sum_inv
logprobs = probs.log()
loss = -logprobs[range(n), y_train_batch].mean()

# PyTorch backward pass
for p in parameters:
  p.grad = None
for t in [logprobs, probs, counts, counts_sum, counts_sum_inv, # afaik there is no cleaner way
          norm_logits, logit_maxes, logits, h, hpreact, bnraw,
         bnvar_inv, bnvar, bndiff2, bndiff, hprebn, bnmeani,
         embcat, emb]:
  t.retain_grad()
loss.backward()
loss

tensor(3.3522, grad_fn=<NegBackward0>)

In [12]:
counts.shape, counts_sum_inv.shape

(torch.Size([32, 27]), torch.Size([32, 1]))

In [None]:
# Exercise 1: backprop through the whole thing manually, 
# backpropagating through exactly all of the variables 
# as they are defined in the forward pass above, one by one

# d loss / d logprobs
# -logprobs[range(n), y_train_batch].mean() let the not specified cell becomes grad==0. Not counted in the loss anymore
# [a1, a2, a3] -> mean() ->  1 / 3
dlogprobs = torch.zeros_like(logprobs)
dlogprobs[range(n), y_train_batch] = -1 / batch_size

# d loss / d probs =  (d logprobs / d probs) * (d loss / d logprobs)
dprobs = probs ** -1 * dlogprobs

# d loss / d counts_sum_inv =  (d probs / d counts_sum_inv) * (d loss / d probs)
# d loss / d counts = (d probs / d counts) * (d loss / d probs)
# counts.shape, counts_sum_inv.shape -> (torch.Size([32, 27]), torch.Size([32, 1]))
# a11 + b1, a12 + b1, a13 + b1 
# a21 + b2, a22 + b2, a23 + b2 -> need to sum by columns for dcounts_sum_inv 
dcounts_sum_inv = (counts * dprobs).sum(1, keepdim=True)
dcounts = counts_sum_inv * dprobs

# d loss / d counts_sum = (d counts_sum_inv / d counts_sum) * (d loss / d counts_sum_inv)
dcounts_sum = -(counts_sum**2) * dcounts_sum_inv

# d loss / d counts = (d counts_sum / d counts) * (d loss / d counts_sum)
dcounts += torch.ones_like(counts) * dcounts_sum

# d loss / d norm_logits = (d counts / d norm_logits) * (d loss / d counts)
dnorm_logits = counts * dcounts

# d loss / d logits = (d norm_logits / d logits) * (d loss / d norm_logits)
# d loss / d logit_maxes = (d norm_logits / d logit_maxes) * (d loss / d norm_logits)
# logits.shape, logit_maxes.shape -> (torch.Size([32, 27]), torch.Size([32, 1]))
dlogits = (1 - 0)  * dnorm_logits.clone()
dlogit_maxes = (0 - 1) * dnorm_logits.sum(1, keepdim=True)

# d loss / d logits = (d norm_logits / d logits) * (d loss / d norm_logits)
dlogits += 0

In [14]:
logits.shape, logit_maxes.shape

(torch.Size([32, 27]), torch.Size([32, 1]))

In [1]:
import torch
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt


training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:01<00:00, 14157045.80it/s]


Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 258821.05it/s]


Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:00<00:00, 4768799.52it/s]


Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<00:00, 4869706.13it/s]


Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw



In [3]:
type(training_data)

torchvision.datasets.mnist.FashionMNIST

In [5]:
training_data.data.shape

torch.Size([60000, 28, 28])

In [7]:
training_data.__dict__.keys()

dict_keys(['root', 'transform', 'target_transform', 'transforms', 'train', 'data', 'targets'])

In [9]:
training_data.targets.shape

torch.Size([60000])

In [14]:
training_data.transform

ToTensor()

In [15]:
ToTensor

torchvision.transforms.transforms.ToTensor