# Building A Multi Layer Perceptron

## Building Training Dataset

In [7]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plot
%matplotlib inline

In [2]:
words = open('../data/names.txt', 'r').read().splitlines()
words[:8]

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte', 'mia', 'amelia']

In [3]:
len(words)

32033

In [14]:
# build vocab and mappings
chars = sorted(list(set(''.join(words))))
chtoi ={ch:i+1 for i, ch in enumerate(chars)} # char to int mapping
chtoi['.'] = 0 # add separator
itoch = {i:ch for ch, i in chtoi.items()} # int to char mapping
print(itoch)

{1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 0: '.'}


In [15]:
# building the dataset
block_size = 3 

`block_size` represents the context length i.e. how many characters does the model take to predict the next one.

In [17]:
X, Y = [], []

for w in words[:3]:
    print(w)
    context = [0] * block_size ## start with padded context of 0s
    for ch in w + '.':
        ix = chtoi[ch]
        X.append(context)
        Y.append(ix)
        print(''.join(itoch[i] for i in context), '->', itoch[ix])
        context = context[1:] + [ix] # update the context window

emma
... -> e
..e -> m
.em -> m
emm -> a
mma -> .
olivia
... -> o
..o -> l
.ol -> i
oli -> v
liv -> i
ivi -> a
via -> .
ava
... -> a
..a -> v
.av -> a
ava -> .


In [20]:
print(f'inputs:\n{X}\n')
print(f'targets:\n{Y}\n')

inputs:
[[0, 0, 0], [0, 0, 5], [0, 5, 13], [5, 13, 13], [13, 13, 1], [0, 0, 0], [0, 0, 15], [0, 15, 12], [15, 12, 9], [12, 9, 22], [9, 22, 9], [22, 9, 1], [0, 0, 0], [0, 0, 1], [0, 1, 22], [1, 22, 1]]

targets:
[5, 13, 13, 1, 0, 15, 12, 9, 22, 9, 1, 0, 1, 22, 1, 0]



In [21]:
X = torch.tensor(X)
Y = torch.tensor(Y)

Notice that out of the word "emma" we can generate 5 different example for inputs and targets.

In [22]:
X.shape, X.dtype, Y.shape, Y.dtype

(torch.Size([16, 3]), torch.int64, torch.Size([16]), torch.int64)

## Building The Embedding Table

In [24]:
C = torch.randn((27, 2)) # creating a 2D vector space for a 27 token vocab size
C 

tensor([[-2.5095e-01,  6.3268e-01],
        [ 6.7109e-01, -6.0272e-01],
        [ 1.6476e-01, -1.6241e+00],
        [-4.4414e-01, -2.6809e-01],
        [ 1.9324e+00,  1.2073e+00],
        [-9.7633e-01, -1.6983e+00],
        [-5.4336e-01,  8.6897e-01],
        [ 2.1105e-03,  4.4206e-01],
        [-3.1510e-01, -1.7038e+00],
        [ 8.8760e-01, -1.7088e+00],
        [ 1.1976e+00,  4.5825e-01],
        [ 3.7455e-01,  3.0131e-01],
        [-1.2639e-01, -3.1526e-01],
        [-7.3973e-01,  1.1733e+00],
        [-2.5549e-01,  2.6838e-01],
        [-5.1717e-01, -9.5439e-01],
        [ 2.3148e+00, -6.8375e-01],
        [-9.4651e-01,  1.0856e+00],
        [-3.5373e-01,  1.7227e+00],
        [-1.2434e+00, -5.8994e-01],
        [ 2.1959e-01,  1.4281e+00],
        [-7.1352e-01, -1.0851e-01],
        [ 1.4134e+00,  9.6002e-01],
        [-2.3733e+00, -4.9556e-01],
        [-1.1405e-01, -1.9218e+00],
        [ 1.3865e+00,  1.0302e+00],
        [-5.4251e-01, -3.1156e+00]])

Embedding a single integer in the embedding lookup table

In [25]:
C[5] # row corresponsding to index 5 in the embedding table

tensor([-0.9763, -1.6983])

In [26]:
F.one_hot(torch.tensor(5), num_classes=27) # encoding integer 5 to a vector

tensor([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

In [28]:
F.one_hot(torch.tensor(5), num_classes=27).float() @ C

tensor([-0.9763, -1.6983])

Notice that the output is identical. Because of all the 0s in the one-hot vector, all other values in C get masked out except for the row corresponding to the integer i.e. index 5. 

So the embedding of an integer can either be seen as the integer indexing into a lookup table C or it can be seen as the first layer of a larger NN. The neurons in this layer have no non-linearity and their weight matrix is C. We are encoding ints in one-hot and then feeding them to this layer.

Getting embeddings of multiple integers

In [29]:
C[[5, 6, 7]]

tensor([[-0.9763, -1.6983],
        [-0.5434,  0.8690],
        [ 0.0021,  0.4421]])

In [31]:
C[torch.tensor([5, 6, 7])]

tensor([[-0.9763, -1.6983],
        [-0.5434,  0.8690],
        [ 0.0021,  0.4421]])

In [33]:
# we can also index with multi dim tensors
C[X][:3]

tensor([[[-0.2509,  0.6327],
         [-0.2509,  0.6327],
         [-0.2509,  0.6327]],

        [[-0.2509,  0.6327],
         [-0.2509,  0.6327],
         [-0.9763, -1.6983]],

        [[-0.2509,  0.6327],
         [-0.9763, -1.6983],
         [-0.7397,  1.1733]]])

In [36]:
X.shape

torch.Size([16, 3])

In [37]:
C[X].shape # for everyone of the vectors in X we have a 2D embedding

torch.Size([16, 3, 2])

In [40]:
# example 
print(X[13, 2])
print(C[X][13, 2])
print(C[1])

tensor(1)
tensor([ 0.6711, -0.6027])
tensor([ 0.6711, -0.6027])


In [41]:
# so embedding all ints
emb = C[X]
emb.shape

torch.Size([16, 3, 2])

## Implementing The Hidden Layer

According to the paper we have 3 inputs of (in our case) 2 dims so the number of inputs for this layer is 6. Let's take the number of neurons in this layer to be 100 (upto us).

In [45]:
W1 = torch.randn((6, 100)) # weights
B1 = torch.randn(100) # biases

We want to do the operation `emb @ W + b` but the embeddings are stacked in the input tensor so matmul will not work. So we need to concatenate these inputs together i.e. convert 16, 3, 2 to a 16, 6 tensor.

In [47]:
emb[:, 0, :].shape # get everything at the 0 index

torch.Size([16, 2])

In [48]:
torch.cat([emb[:, 0, :], emb[:, 1, :], emb[:, 2, :]], 1).shape # concatenate the seq across dim 1

torch.Size([16, 6])

In [51]:
# to generalize the above operation
len(torch.unbind(emb, 1)) # list of tensors as in the above operation

3

In [52]:
# so now we can do the general operation
torch.cat(torch.unbind(emb, 1), 1).shape

torch.Size([16, 6])

Turns out there is an even better way of doing this.

In [54]:
example = torch.arange(18)
example

tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17])

In [56]:
# we can represent example tensor in different sized n dim tensors
example.view(2, 9)

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7,  8],
        [ 9, 10, 11, 12, 13, 14, 15, 16, 17]])

In [59]:
example.view(3, 3, 2) # product of dim numbers should be the same

tensor([[[ 0,  1],
         [ 2,  3],
         [ 4,  5]],

        [[ 6,  7],
         [ 8,  9],
         [10, 11]],

        [[12, 13],
         [14, 15],
         [16, 17]]])

In pytorch `.view` operation is extremely efficient because a tensor is stored as a 1D sequence of bytes in memory and the attributes of the tensor add the dimesionality to it. When we call `.view` we only modify these attributes but the underlying tensor remains the same.

In [65]:
emb.shape

torch.Size([16, 3, 2])

In [63]:
emb.view(16, 6)

tensor([[-0.2509,  0.6327, -0.2509,  0.6327, -0.2509,  0.6327],
        [-0.2509,  0.6327, -0.2509,  0.6327, -0.9763, -1.6983],
        [-0.2509,  0.6327, -0.9763, -1.6983, -0.7397,  1.1733],
        [-0.9763, -1.6983, -0.7397,  1.1733, -0.7397,  1.1733],
        [-0.7397,  1.1733, -0.7397,  1.1733,  0.6711, -0.6027],
        [-0.2509,  0.6327, -0.2509,  0.6327, -0.2509,  0.6327],
        [-0.2509,  0.6327, -0.2509,  0.6327, -0.5172, -0.9544],
        [-0.2509,  0.6327, -0.5172, -0.9544, -0.1264, -0.3153],
        [-0.5172, -0.9544, -0.1264, -0.3153,  0.8876, -1.7088],
        [-0.1264, -0.3153,  0.8876, -1.7088,  1.4134,  0.9600],
        [ 0.8876, -1.7088,  1.4134,  0.9600,  0.8876, -1.7088],
        [ 1.4134,  0.9600,  0.8876, -1.7088,  0.6711, -0.6027],
        [-0.2509,  0.6327, -0.2509,  0.6327, -0.2509,  0.6327],
        [-0.2509,  0.6327, -0.2509,  0.6327,  0.6711, -0.6027],
        [-0.2509,  0.6327,  0.6711, -0.6027,  1.4134,  0.9600],
        [ 0.6711, -0.6027,  1.4134,  0.9

So we can get the hidden states of the hidden layer by:

In [66]:
h = emb.view(16, 6) @ W1 + B1

In [67]:
h.shape

torch.Size([16, 100])

In [69]:
# we're still hardcoding emb's dimensionality so to fix that we can do
h = emb.view(-1, 6) @ W1 + B1 # -1 means pytorch can derive the appropriate dim

So now we perform a `tanh` (from the paper) to get the 100 dimensional activations of all of our 16 examples.

In [70]:
h = torch.tanh(emb.view(-1, 6) @ W1 + B1)

## Implementing The Output Layer

In [71]:
W2 = torch.randn((100, 27)) # input is 100 and output number of neurons is 27 for our vocab size
B2 = torch.randn(27)

In [72]:
logits = h @ W2 + B2

In [73]:
logits.shape

torch.Size([16, 27])

In [74]:
# getting prob dist from logits
counts = logits.exp()
probs = counts / counts.sum(1, keepdims=True)
probs.shape

torch.Size([16, 27])

In [75]:
probs[0]

tensor([1.5913e-06, 2.0015e-06, 1.7259e-09, 3.1433e-03, 1.4249e-12, 3.9630e-02,
        2.3779e-08, 1.7425e-07, 1.4293e-09, 9.1998e-01, 8.2286e-03, 7.5289e-09,
        2.4810e-07, 2.4319e-02, 2.4521e-11, 6.1648e-08, 2.4133e-03, 6.8097e-07,
        8.1947e-09, 1.5347e-10, 5.5814e-05, 5.8947e-06, 5.1442e-04, 1.6971e-03,
        6.1583e-07, 6.3879e-08, 1.1128e-05])

In [76]:
probs[0].sum()

tensor(1.)

## Loss Function

In [79]:
# iterate through the rows of probs
probs[torch.arange(16), Y] # probs assigned by this NN to the correct character in the seq

tensor([3.9630e-02, 3.5144e-09, 1.3506e-04, 7.9603e-07, 5.3318e-15, 6.1648e-08,
        4.9938e-08, 1.9571e-05, 3.8193e-16, 3.0336e-09, 4.4803e-06, 2.1925e-05,
        2.0015e-06, 6.8455e-11, 5.8309e-08, 6.7171e-07])

In [81]:
nlll = - probs[torch.arange(16), Y].log().mean() # negative log likelihood loss
nlll

tensor(16.7703)