
Exercises:
- E01: train a trigram language model, i.e. take two characters as an input to predict the 3rd one. Feel free to use either counting or a neural net. Evaluate the loss; Did it improve over a bigram model?
- E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?
- E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?
- E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels - wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?
- E05: look up and use F.cross_entropy instead. You should achieve the same result. Can you think of why we'd prefer to use F.cross_entropy instead?
- E06: meta-exercise! Think of a fun/interesting exercise and complete it.


### Tri-gram model

In [35]:
from micrograd import MLP
import torch
import matplotlib.pyplot as plt
import torch.nn.functional as F
%matplotlib inline

In [36]:
words = open("names.txt", "r").read().splitlines()

In [37]:
for j in range(len(words[:1])):
    if len(words) >= 3:
        temp = '.' + words[j] + '.'
        for i in range(len(temp)-2):
            print("prev:", temp[i:i+2])
            print("next:", temp[i+2:i+3])

prev: .e
next: m
prev: em
next: m
prev: mm
next: a
prev: ma
next: .


In [38]:
char = sorted(list(set(''.join(words))))
char = ['.'] + char
char

['.',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [39]:
stoi = {}
for i in range(len(char)):
    stoi[char[i]] = i
stoi

{'.': 0,
 'a': 1,
 'b': 2,
 'c': 3,
 'd': 4,
 'e': 5,
 'f': 6,
 'g': 7,
 'h': 8,
 'i': 9,
 'j': 10,
 'k': 11,
 'l': 12,
 'm': 13,
 'n': 14,
 'o': 15,
 'p': 16,
 'q': 17,
 'r': 18,
 's': 19,
 't': 20,
 'u': 21,
 'v': 22,
 'w': 23,
 'x': 24,
 'y': 25,
 'z': 26}

In [40]:
itoi = {}
for i in range(len(char)):
    itoi[i] = char[i]
itoi

{0: '.',
 1: 'a',
 2: 'b',
 3: 'c',
 4: 'd',
 5: 'e',
 6: 'f',
 7: 'g',
 8: 'h',
 9: 'i',
 10: 'j',
 11: 'k',
 12: 'l',
 13: 'm',
 14: 'n',
 15: 'o',
 16: 'p',
 17: 'q',
 18: 'r',
 19: 's',
 20: 't',
 21: 'u',
 22: 'v',
 23: 'w',
 24: 'x',
 25: 'y',
 26: 'z'}

In [41]:
for j in range(len(words[:1])):
    if len(words) >= 3:
        words[j] = '.' + words[j] + '.'
        for i in range(len(words[j])-2):
            prev = list(words[j][i:i+2])
            after = words[j][i+2:i+3]
            print("prev:", stoi[prev[0]], stoi[prev[1]])
            print("next:",stoi[after])

prev: 0 5
next: 13
prev: 5 13
next: 13
prev: 13 13
next: 1
prev: 13 1
next: 0


In [42]:
# reset dataset
words = open("names.txt", "r").read().splitlines()

In [43]:
xs = []
ys = []

for j in range(len(words)):
    if len(words) >= 3:
        words[j] = '.' + words[j] + '.'
        for i in range(len(words[j])-2):
            prev = list(words[j][i:i+2])
            after = words[j][i+2:i+3]
            prev_i = [stoi[prev[0]], stoi[prev[1]]]
            after_i = stoi[after]
            xs.append(prev_i)
            ys.append(after_i)

In [44]:
xs[0]

[0, 5]

In [45]:
ys[0]

13

In [46]:
xs = torch.tensor(xs)
ys = torch.tensor(ys)
xs.shape, ys.shape

(torch.Size([196113, 2]), torch.Size([196113]))

In [47]:
xs = F.one_hot(xs, num_classes=27).float()
ys = F.one_hot(ys, num_classes=27).float()

In [48]:
xs[0]

tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [49]:
ys[0]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [50]:
xs_combined = []
for i in range(xs.shape[0]):
    xs_combined.append(torch.cat((xs[i][0], xs[i][1]), axis=0))

In [51]:
xs_combined = torch.stack(xs_combined)

In [52]:
print(xs_combined.shape)
print(xs_combined[0])

torch.Size([196113, 54])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


In [53]:
ys[0]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [54]:
xs_combined.shape, ys.shape

(torch.Size([196113, 54]), torch.Size([196113, 27]))

In [55]:
nn = MLP(27+27,[256,128,27])

In [56]:
output = nn(xs_combined[:10], activation="softmax")

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([1., 0., 0., 0., 0., 0., 0., 

In [57]:
print("output shape:", len(output), len(output[0]), "\n")
print(len(output[0]), output[0])
print("\n",sum(output[0]))

output shape: 10 27 

27 [Value(data=0.06676130969401942, grad=0.0), Value(data=0.009079568342766015, grad=0.0), Value(data=0.06708854209468193, grad=0.0), Value(data=0.06669485239783271, grad=0.0), Value(data=0.06708942817219318, grad=0.0), Value(data=0.06707802573033872, grad=0.0), Value(data=0.06708943983705477, grad=0.0), Value(data=0.015617456442614383, grad=0.0), Value(data=0.009239423088165217, grad=0.0), Value(data=0.009139958654519695, grad=0.0), Value(data=0.06708943247718163, grad=0.0), Value(data=0.009309121956545294, grad=0.0), Value(data=0.009079568342797363, grad=0.0), Value(data=0.009079643882747911, grad=0.0), Value(data=0.009079828905049601, grad=0.0), Value(data=0.0670894398377483, grad=0.0), Value(data=0.0664424537373773, grad=0.0), Value(data=0.009175716785548385, grad=0.0), Value(data=0.06705835889698941, grad=0.0), Value(data=0.009079568342663772, grad=0.0), Value(data=0.009089081770884494, grad=0.0), Value(data=0.06708903688552054, grad=0.0), Value(data=0.019001

In [58]:
loss = nn.cross_entropy_loss(output, ys[:10])
print(loss)
print("number of loss:",len(loss))

[Value(data=4.701720237731934, grad=0.0), Value(data=4.531747817993164, grad=0.0), Value(data=3.7306344509124756, grad=0.0), Value(data=3.667306423187256, grad=0.0), Value(data=4.799770832061768, grad=0.0), Value(data=4.819977283477783, grad=0.0), Value(data=2.789294958114624, grad=0.0), Value(data=4.67116641998291, grad=0.0), Value(data=3.3350625038146973, grad=0.0), Value(data=2.6640658378601074, grad=0.0)]
number of loss: 10


In [59]:
lr = 0.01
for i in range(len(xs[:10])):
    input = torch.cat((xs[i][0], xs[i][1]), dim=0)
    label = ys[i]

    logits = nn([input], activation="softmax")
    loss = nn.cross_entropy_loss(logits[0], label)
    print("loss: ", loss)
    
    # zero grad
    for p in nn.parameters():
        p.grad = 0.0

    # backward   
    loss.backward()
    
    for p in nn.parameters():
        p.data += -lr * p.grad

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


TypeError: object of type 'Value' has no len()

##### E02: split up the dataset randomly into 80% train set, 10% dev set, 10% test set. Train the bigram and trigram models only on the training set. Evaluate them on dev and test splits. What can you see?

In [62]:
# reset dataset

print("dataset x:", xs.shape)
print("dataset y:",ys.shape)

train_percent = int((.8 * xs.shape[0]))
train_set_xs = xs[:train_percent]

rest_xs = xs[train_percent+1:]
dev_set_xs, test_set_xs = rest_xs[:len(rest_xs)//2], rest_xs[len(rest_xs)//2+1:]

print("\ntrain_set_xs: ", train_set_xs.shape)
print("dev_set_xs: ", dev_set_xs.shape)
print("test_set_xs: ", test_set_xs.shape)

train_set_ys = ys[:train_percent]

rest_ys = ys[train_percent+1:]
dev_set_ys, test_set_ys = rest_ys[:len(rest_ys)//2], rest_ys[len(rest_ys)//2+1:]

print("\ntrain_set_ys: ", train_set_ys.shape)
print("dev_set_ys: ", dev_set_ys.shape)
print("test_set_ys: ", test_set_ys.shape)

dataset x: torch.Size([196113, 2, 27])
dataset y: torch.Size([196113, 27])

train_set_xs:  torch.Size([156890, 2, 27])
dev_set_xs:  torch.Size([19611, 2, 27])
test_set_xs:  torch.Size([19610, 2, 27])

train_set_ys:  torch.Size([156890, 27])
dev_set_ys:  torch.Size([19611, 27])
test_set_ys:  torch.Size([19610, 27])


In [None]:
lr = 0.01
epochs = 1

for epoch in range(epochs):
    loss = 0

    inputs = [torch.cat((xs[i][0], xs[i][1]), dim=0) for xs in train_set_xs[:10]]
    labels = train_set_ys[:10]

    logits = nn(inputs, activation="softmax")
    local_loss = nn.cross_entropy_loss(logits, labels)
    
    
    # zero grad
    for p in nn.parameters():
        p.grad = 0.0

    # backward   
    local_loss.backward()
    loss += local_loss
    
    for p in nn.parameters():
        p.data += -lr * p.grad
        
print("loss: ", loss/10)

In [63]:
import numpy as np

def get_char_from_one_hot(one_hot):
    for i in range(len(one_hot)):
        if one_hot[i] == 1:
            return itoi[i]

In [None]:

for i in range(len(test_set_xs[:10])):
    
    input = torch.cat((test_set_xs[i][0], test_set_xs[i][1]), dim=0)
    logits = nn(input, activation="softmax")
    max = logits[0].data
    index = 0
    for j in range(len(logits)):
        if logits[j].data > max:
            max = logits[j].data
            index = j
    
    print("characters:", get_char_from_one_hot(test_set_xs[i][0]), get_char_from_one_hot(test_set_xs[i][1]))
    print("next char prediction:", itoi[index])
    print("actual answer:", get_char_from_one_hot(test_set_ys[i]))
    print("\n")

In [65]:
# 54 input -> 1 neuron -> 1 output
W = torch.randn((54,1))

train_set_combined = torch.cat((train_set_xs[0][0],train_set_xs[0][1]), axis=0)

print("input shape:",train_set_combined.shape)
print(train_set_combined)
print("\n")

print("layer shape:", W.shape)
#print("layer:", W)
print("\n")
forward = torch.matmul(train_set_combined, W)
print("output shape -> input * W:",forward.shape)
print(forward)
# Single neuron of 54 weights

input shape: torch.Size([54])
tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


layer shape: torch.Size([54, 1])


output shape -> input * W: torch.Size([1])
tensor([-0.2924])


In [66]:
# 54 input -> 27 NEURONS -> 27 output vector
W = torch.randn((54,27))

train_set_combined = [torch.cat((x[0],x[1]), axis=0) for x in train_set_xs[:1]]
train_set_combined = torch.stack(train_set_combined)

In [67]:
print("input shape:",train_set_combined.shape)
print(train_set_combined)
print("\n")

print("layer shape:", W.shape)
print("layer:", W)
print("\n")
forward = torch.matmul(train_set_combined, W)
print("output shape -> input * W:",forward.shape)
print(forward)
# Single

input shape: torch.Size([1, 54])
tensor([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])


layer shape: torch.Size([54, 27])
layer: tensor([[ 1.0635, -1.0277,  0.2941,  ..., -1.5864, -0.6445, -0.1288],
        [-0.9908,  0.4976, -1.1646,  ..., -1.8226,  1.3122, -0.9762],
        [ 0.6450, -0.6897, -0.8464,  ..., -1.3007, -0.6714, -1.4587],
        ...,
        [-0.7110,  0.2368, -1.1992,  ...,  0.5836, -0.0486,  1.0179],
        [-0.3976,  1.1758, -1.8394,  ...,  1.1138, -0.6366, -1.6548],
        [-0.7019, -0.2638,  0.1192,  ...,  0.8710,  0.5048,  0.0938]])


output shape -> input * W: torch.Size([1, 27])
tensor([[ 0.1861, -1.0648, -0.9486, -0.5944,  0.4486,  1.8571, -1.9449, -0.4053,
          2.9364, -0.9600,  0.2285, -2.4027,  1.1911, -0.1877, -1.2792,  1.5652,
         -1.3624, -2.4760, -1.3422

In [68]:
train_set_combined = [torch.cat((x[0],x[1]), axis=0) for x in train_set_xs[:100]]
train_set_combined = torch.stack(train_set_combined)
print("input shape:",train_set_combined.shape)
print(train_set_combined)
print("\n")

print("layer shape:", W.shape)
print("layer:", W)
print("\n")
forward = torch.matmul(train_set_combined, W)
print("output shape -> input * W:",forward.shape)
print(forward)
# Doesn't matter if there are multiple inputs, output will be (no of inputs, neuron output size)

input shape: torch.Size([100, 54])
tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])


layer shape: torch.Size([54, 27])
layer: tensor([[ 1.0635, -1.0277,  0.2941,  ..., -1.5864, -0.6445, -0.1288],
        [-0.9908,  0.4976, -1.1646,  ..., -1.8226,  1.3122, -0.9762],
        [ 0.6450, -0.6897, -0.8464,  ..., -1.3007, -0.6714, -1.4587],
        ...,
        [-0.7110,  0.2368, -1.1992,  ...,  0.5836, -0.0486,  1.0179],
        [-0.3976,  1.1758, -1.8394,  ...,  1.1138, -0.6366, -1.6548],
        [-0.7019, -0.2638,  0.1192,  ...,  0.8710,  0.5048,  0.0938]])


output shape -> input * W: torch.Size([100, 27])
tensor([[ 0.1861, -1.0648, -0.9486,  ...,  0.0481,  1.3135,  0.4489],
        [ 0.2379,  2.0707,  0.2634,  ...,  3.5340,  4.6433, -0.8014],
        [-0.1425, -0.3339, -0.5058,  ...,  

In [69]:
def softmax(x):
    
    counts = [logit.exp() for logit in x]
    denominator = sum(counts)
    out = [c / denominator for c in counts]
    
    return out

softmax_layer = []
for i in range(len(forward)):
    softmax_layer.append(softmax(forward[i]))

In [70]:
print(torch.tensor(softmax_layer[0]))
print(sum(softmax_layer[0]))
print(len(softmax_layer[0]))

tensor([0.0238, 0.0068, 0.0076, 0.0109, 0.0309, 0.1265, 0.0028, 0.0132, 0.3722,
        0.0076, 0.0248, 0.0018, 0.0650, 0.0164, 0.0055, 0.0945, 0.0051, 0.0017,
        0.0052, 0.0407, 0.0063, 0.0023, 0.0031, 0.0005, 0.0207, 0.0734, 0.0309])
tensor(1.0000)
27


In [71]:
# 100 of 54 vector input -> 54 of 128 neurons -> 100 of 128 output vector
# 100 of 128 vector input -> 128 of 64 neurons -> 100 of 64 output vector 
# 100 of 64 vector input -> 64 of 27 neurons -> 100 of 27 output vector 

layer_1 = torch.randn((54,128)) # Layer(54,128)
layer_2 = torch.randn((128,64)) # Layer(128, 64)
layer_3 = torch.randn((64,27)) # Layer(64,27)

layer_1_output = torch.matmul(train_set_combined,layer_1)
layer_2_output = torch.matmul(layer_1_output,layer_2)
layer_3_output = torch.matmul(layer_2_output,layer_3)

In [72]:
print(layer_1_output.shape)
print(layer_2_output.shape)
print(layer_3_output.shape)

torch.Size([100, 128])
torch.Size([100, 64])
torch.Size([100, 27])


##### E04: we saw that our 1-hot vectors merely select a row of W, so producing these vectors explicitly feels - wasteful. Can you delete our use of F.one_hot in favor of simply indexing into rows of W?

In [100]:
# Since it just selects the row of weight matrix, basically if one hot vector is 1 at 4th index row wise, it is practically returning
# the 4th row of weight matrix(layer matrix), since all the 0s multiplying with other items don't count, we end just multiplying
# 1 * 4th number of each neuron(column)

one_hot_27 = train_set_combined[0][:27]
layer_27_neuron = torch.randn((27,27))

print(one_hot_27)

tensor([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0.])


In [99]:
forward = torch.matmul(one_hot_27, layer_27_neuron)
layer_row_0th = layer_27_neuron[0] # 0th index at one hot is 1

print(forward)
print(layer_row_0th)

tensor([-1.4642, -0.5628,  1.4523,  0.7894, -1.0376,  0.8167,  2.0185, -0.5827,
        -0.3450, -0.1342, -0.3770, -0.4076,  0.5959, -0.9479,  0.4270, -0.4514,
         1.5176, -1.2231, -1.3207, -0.7192,  0.9132, -0.1520, -0.3985,  1.1625,
         0.9509,  1.0119,  0.8860])
tensor([-1.4642, -0.5628,  1.4523,  0.7894, -1.0376,  0.8167,  2.0185, -0.5827,
        -0.3450, -0.1342, -0.3770, -0.4076,  0.5959, -0.9479,  0.4270, -0.4514,
         1.5176, -1.2231, -1.3207, -0.7192,  0.9132, -0.1520, -0.3985,  1.1625,
         0.9509,  1.0119,  0.8860])


In [98]:
forward == layer_row_0th

tensor([True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True, True, True, True, True, True, True, True, True, True,
        True, True, True])

##### E03: use the dev set to tune the strength of smoothing (or regularization) for the trigram model - i.e. try many possibilities and see which one works best based on the dev set loss. What patterns can you see in the train and dev set loss as you tune this strength? Take the best setting of the smoothing and evaluate on the test set once and at the end. How good of a loss do you achieve?