# Exercises from Makemore Part 1
## 1. Build a trigram language model
E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
### 1.1 Parse Dataset

In [39]:
import torch

In [65]:
chars = set("".join(words))
chars.add('.')
chars = sorted(chars)
stoi = {s:i for i,s in enumerate(chars)}
itos = {i:s for s,i in stoi.items()}

twochars = set()
chars.append('.')
for a in chars:
    for b in chars:
        twochars.add(a+b)
twochars = sorted(twochars)
twochars.pop(0)
sstoi = {s:i for i, s, in enumerate(twochars)}
itoss = {i:s for s,i in sstoi.items()}

sstoi

{'.a': 0,
 '.b': 1,
 '.c': 2,
 '.d': 3,
 '.e': 4,
 '.f': 5,
 '.g': 6,
 '.h': 7,
 '.i': 8,
 '.j': 9,
 '.k': 10,
 '.l': 11,
 '.m': 12,
 '.n': 13,
 '.o': 14,
 '.p': 15,
 '.q': 16,
 '.r': 17,
 '.s': 18,
 '.t': 19,
 '.u': 20,
 '.v': 21,
 '.w': 22,
 '.x': 23,
 '.y': 24,
 '.z': 25,
 'a.': 26,
 'aa': 27,
 'ab': 28,
 'ac': 29,
 'ad': 30,
 'ae': 31,
 'af': 32,
 'ag': 33,
 'ah': 34,
 'ai': 35,
 'aj': 36,
 'ak': 37,
 'al': 38,
 'am': 39,
 'an': 40,
 'ao': 41,
 'ap': 42,
 'aq': 43,
 'ar': 44,
 'as': 45,
 'at': 46,
 'au': 47,
 'av': 48,
 'aw': 49,
 'ax': 50,
 'ay': 51,
 'az': 52,
 'b.': 53,
 'ba': 54,
 'bb': 55,
 'bc': 56,
 'bd': 57,
 'be': 58,
 'bf': 59,
 'bg': 60,
 'bh': 61,
 'bi': 62,
 'bj': 63,
 'bk': 64,
 'bl': 65,
 'bm': 66,
 'bn': 67,
 'bo': 68,
 'bp': 69,
 'bq': 70,
 'br': 71,
 'bs': 72,
 'bt': 73,
 'bu': 74,
 'bv': 75,
 'bw': 76,
 'bx': 77,
 'by': 78,
 'bz': 79,
 'c.': 80,
 'ca': 81,
 'cb': 82,
 'cc': 83,
 'cd': 84,
 'ce': 85,
 'cf': 86,
 'cg': 87,
 'ch': 88,
 'ci': 89,
 'cj': 90,
 'ck': 91

In [50]:
words = open("names.txt", 'r').read().splitlines()

In [66]:
xs, ys = [], []
# Inputs need to be "aa", "bb", etc. (26*26+26+26)
for w in words:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        xs.append(sstoi[ch1+ch2])
        ys.append(stoi[ch3])

In [None]:
# Sanity check

for x, y in zip(xs, ys):
    x2 = x % 27
    x1 = x // 27
    print(f"{itos[x1]} {itos[x2]} -> {itos[y]}")

### 1.2 Initialize neural network

In [None]:
# Input: two characters
# Output: next character, then continue until the end character
# Need a 728 x 27 tensor for weights

In [67]:
xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples:', num)

# Initiliaze the model
# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((728, 27), generator=g, requires_grad=True)

number of examples: 196113


### 1.3 Train neural network

In [74]:
import torch.nn.functional as F

# Gradient descent
for i in range(100):

    # Forward pass
    xenc = F.one_hot(xs, num_classes=728).float()
    logits = xenc @ W
    counts = logits.exp()
    probs = counts / counts.sum(dim=1, keepdim=True)
    loss = -probs[torch.arange(num), ys].log().mean() + 0.01*(W**2).mean()

    # Backward pass
    W.grad = None
    loss.backward()

    # Update weights
    with torch.no_grad():
        W += -5 * W.grad

    if (i+1) % 20 == 0:
        print(f"Loss at iteration {i+1}: {loss.item()}")

Loss at iteration 20: 2.361903429031372
Loss at iteration 40: 2.358445882797241
Loss at iteration 60: 2.355074405670166
Loss at iteration 80: 2.351785182952881
Loss at iteration 100: 2.348576307296753


In [89]:
loss

tensor(2.3486, grad_fn=<AddBackward0>)

### 1.4 Generate from neutral network

In [147]:
g = torch.Generator().manual_seed(2147483647)

for i in range(5):
  start = 'd'
  out = [start]
  ix =  sstoi['.' + start] # start with '.a' - '.z', need to be between 0 and 25
  while True:
    logits = W[[ix]]
    counts = logits.exp() # counts, equivalent to N
    p = counts / counts.sum(1, keepdims=True) # probabilities for next character
    
    ix = torch.multinomial(p, num_samples=1, replacement=True, generator=g).item()
    next = itos[ix]
    if ix == 0:
      break
    out.append(next)
    ix = sstoi[start + next]
    start = next
  print(''.join(out))

daexze
daogjkus
dila
da
dah


## 2. Split the dataset
E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [93]:
import random

words = open("names.txt", 'r').read().splitlines()
random.shuffle(words)
numwords = len(words)

train = words[:int(numwords*0.8)]
val = words[int(numwords*0.8):int(numwords*0.9)]
test = words[int(numwords*0.9):]

In [135]:
xs, ys = [], []
# Inputs need to be "aa", "bb", etc. (26*26+26+26)
for w in train:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        xs.append(sstoi[ch1+ch2])
        ys.append(stoi[ch3])

xs = torch.tensor(xs)
ys = torch.tensor(ys)
num = xs.nelement()
print('number of examples:', num)

# Initiliaze the model
# initialize the 'network'
g = torch.Generator().manual_seed(2147483647)
W = torch.randn((728, 27), generator=g, requires_grad=True)

number of examples: 156942


In [136]:
import torch.nn.functional as F

# Gradient descent
for i in range(200):

    # Forward pass
    logits = W[xs]
    loss = F.cross_entropy(logits, ys) + 0.1*(W**2).mean()

    # Backward pass
    W.grad = None
    loss.backward()

    # Update weights
    with torch.no_grad():
        W += -50 * W.grad

    if (i+1) % 20 == 0:
        print(f"Loss at iteration {i+1}: {loss.item()}")


Loss at iteration 20: 3.027872323989868
Loss at iteration 40: 2.761096954345703
Loss at iteration 60: 2.6268115043640137
Loss at iteration 80: 2.5431036949157715
Loss at iteration 100: 2.485710620880127
Loss at iteration 120: 2.4437415599823
Loss at iteration 140: 2.411585807800293
Loss at iteration 160: 2.3861031532287598
Loss at iteration 180: 2.3653981685638428
Loss at iteration 200: 2.3482425212860107


In [137]:
print('Training loss without L2 regularization:', loss.item())

Training loss without L2 regularization: 2.3482425212860107


In [116]:
print('Training loss with 1 L2:', loss.item() - 1*(W**2).mean().item())

Training loss with 1 L2: 2.3131976425647736


In [139]:
print('Training loss with 0.1 L2:', loss.item() - 0.1*(W**2).mean().item())

Training loss with 0.1 L2: 2.2590897500514986


In [127]:
print('Training loss with 0.01 L2:', loss.item() - 0.01*(W**2).mean().item())

Training loss with 0.01 L2: 2.2583543026447295


In [140]:
# Validation and testing loss

valxs, valys = [], []
# Inputs need to be "aa", "bb", etc. (26*26+26+26)

for w in val:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        valxs.append(sstoi[ch1+ch2])
        valys.append(stoi[ch3])

valxs = torch.tensor(valxs)
valys = torch.tensor(valys)

valnum = valxs.nelement()
print('number of validation examples:', valnum)

logits = W[valxs]
loss = F.cross_entropy(logits, valys)
print(f"Validation loss: {loss.item()}")

number of validation examples: 19529
Validation loss: 2.2621829509735107


In [141]:
testxs, testys = [], []

for w in test:
    chs = ['.'] + list(w) + ['.']
    for ch1, ch2, ch3 in zip(chs, chs[1:], chs[2:]):
        testxs.append(sstoi[ch1+ch2])
        testys.append(stoi[ch3])

testxs = torch.tensor(testxs)
testys = torch.tensor(testys)

testnum = testxs.nelement()
print('number of test examples:', testnum)

logits = W[testxs]
loss = F.cross_entropy(logits, testys)
print(f"Test loss: {loss.item()}")

number of test examples: 19642
Test loss: 2.2907931804656982


val loss was 2.28ish, test loss was 2.29 ish was without L2 regularization
1 L2 regularization increased training and validation losses to 2.31

## 3. Smoothing the trigram model
E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?

## 4. Deleting one-hot
E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

## 5. Cross entropy instead of NLL
E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?

## 6. Meta-exercise
E06: meta-exercise! Think of a fun/interesting exercise and complete it.