# Road to Generative AI - Part 2: Multi-Layer Perceptron

## Introduction

The purpose of this notebook is to explore the capabilities of Generative AI. In the [first part](https://tdody.github.io//bigram-and-nn/) of this series, we explored the concept of Generative AI and built a simple model that generates text using a bigram model and a single-layer NN. In this part, we will build a more complex model using Multi-Layer Perceptron (MLP) to generate text.

The reason behind using MLP is that N-gram models suffer from the curse of dimensionality. As the size of the n-gram increases, the number of possible n-grams grows exponentially. This makes it difficult to store and process the n-grams. In contrast, MLP can learn the patterns in the data and generate text without the need to store all possible n-grams.

In the paper "A Neural Probabilistic Language Model" by Yoshua Bengio et al., the authors proposed a neural network-based language model that can learn to predict the next word in a sentence. We are going to use a similar approach but our focus is on generating the next character in a sequence of characters.

## Dataset

We will use a dataset containing common bird names. Our source data can be found [here](https://www.kaggle.com/datasets/thepushkarp/common-bird-names).

In [10]:
DATASET_PATH = "./datasets/birds/birds.csv"

birds = open(DATASET_PATH, "r").read().splitlines()

print("First 10 birds in the dataset:")
print(", ".join(birds[:10]))
print(f"There are {len(birds):,d} birds in the dataset.")

min_length = map(len, birds)
max_length = map(len, birds)
print(f"The shortest character name has {min(min_length)} birds.")
print(f"The longest character name has {max(max_length)} birds.")

First 10 birds in the dataset:
Abbott's babbler, Abbott's booby, Abbott's starling, Abbott's sunbird, Abd al-Kuri sparrow, Abdim's stork, Aberdare cisticola, Aberrant bush warbler, Abert's towhee, Abyssinian catbird
There are 10,976 birds in the dataset.
The shortest character name has 3 birds.
The longest character name has 35 birds.


In [87]:
from unidecode import unidecode

def clean_name(name):
    # Remove leading and trailing whitespaces
    # Convert to lowercase
    # Remove accents
    # Remove special characters
    # Replace spaces with underscores

    name = name.strip().lower()
    # replace special characters with a space
    name = ''.join(char if char.isalnum() or char.isspace() else ' ' for char in name)
    name = name.replace("`", "_")  # Remove apostrophes
    name = name.replace(" ", "_")
    name = unidecode(name)
    return name

In [88]:
# clean a few random names from the dataset
import numpy as np
rdm_indexes = np.random.randint(0, len(birds), 10)

print("Cleaning process:")
for i in rdm_indexes:
    name = birds[i]
    cleaned_name = clean_name(name)
    print(f"Original: {name} -> Cleaned: {cleaned_name}")

Cleaning process:
Original: collared_brushturkey -> Cleaned: collared_brushturkey
Original: eurasian_coot -> Cleaned: eurasian_coot
Original: shoebill -> Cleaned: shoebill
Original: mountain_leaf_warbler -> Cleaned: mountain_leaf_warbler
Original: hood_mockingbird -> Cleaned: hood_mockingbird
Original: black_browed_triller -> Cleaned: black_browed_triller
Original: mascarene_parrot -> Cleaned: mascarene_parrot
Original: yellow_bellied_wattle_eye -> Cleaned: yellow_bellied_wattle_eye
Original: black_shouldered_cicadabird -> Cleaned: black_shouldered_cicadabird
Original: white_headed_vulture -> Cleaned: white_headed_vulture


In [89]:
# clean all names in the dataset
birds = list(map(clean_name, birds))

# create a mapping from tokens to indices
unique_tokens = set([c for w in birds for c in w])
SPECIAL_TOKEN = "."
index_to_token = {i: t for i, t in enumerate(unique_tokens, start=1)}
token_to_index = {v: k for k, v in index_to_token.items()}
index_to_token[0] = SPECIAL_TOKEN
token_to_index[SPECIAL_TOKEN] = 0

# log information about the tokenization
print(f"Number of unique tokens: {len(unique_tokens)}")
print(", ".join(sorted(unique_tokens)))
print(f"Token mapping: {index_to_token}")

Number of unique tokens: 27
_, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
Token mapping: {1: 'j', 2: 'p', 3: 'x', 4: '_', 5: 'd', 6: 'r', 7: 'q', 8: 'm', 9: 'o', 10: 'i', 11: 'z', 12: 'u', 13: 'h', 14: 'y', 15: 't', 16: 'l', 17: 'e', 18: 'a', 19: 'c', 20: 'k', 21: 'g', 22: 's', 23: 'b', 24: 'w', 25: 'v', 26: 'n', 27: 'f', 0: '.'}


## Model Architecture

The main limitation of the bigram model is its scalability. As we increase the context size, the number of unique n-grams grows exponentially, making it difficult to store and process them. To overcome this limitation, we will use a Multi-Layer Perceptron (MLP) model.

The figure below shows the architecture of the MLP model. The model consists of an embedding layer $C$. The embeddings are fed into a hidden layer $H$ with a $tanh$ activation function. The output of the hidden layer is fed into the output layer $O$ with a softmax activation function. The output layer generates the probability distribution of the next character in the sequence.

<figure>
    <img src="./assets/MLP_architecture.png" width="500"/>
    <figcaption>Neural architecture</figcaption>
</figure>

In [90]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
%matplotlib inline

In the example below, we create a sample training set for a single bird name. The input sequence ($X$) is the context containing 3 characters, and the target sequence ($Y$) is the next character to predict. The model will learn to predict the next character based on the context.

For the first character, the context is empty, so we use a special token `.` to indicate the start of the sequence.


In [91]:
CONTEXT_SIZE = 3
X, Y = [], []

for i, bird in enumerate(birds[0:100]):
    if i<3:
        print(bird)
    context = [0] * CONTEXT_SIZE
    for ch in bird + SPECIAL_TOKEN:  # Add special token at the end
        ix = token_to_index[ch]
        X.append(context)
        Y.append(ix)
        if i < 3:
            print(''.join([index_to_token[i] for i in context]), '->', index_to_token[ix])
        # Update the context by shifting it and adding the new index 
        context = context[1:] + [ix]


X = torch.tensor(X, dtype=torch.int64)
Y = torch.tensor(Y, dtype=torch.int64)

abbott_s_babbler
... -> a
..a -> b
.ab -> b
abb -> o
bbo -> t
bot -> t
ott -> _
tt_ -> s
t_s -> _
_s_ -> b
s_b -> a
_ba -> b
bab -> b
abb -> l
bbl -> e
ble -> r
ler -> .
abbott_s_booby
... -> a
..a -> b
.ab -> b
abb -> o
bbo -> t
bot -> t
ott -> _
tt_ -> s
t_s -> _
_s_ -> b
s_b -> o
_bo -> o
boo -> b
oob -> y
oby -> .
abbott_s_starling
... -> a
..a -> b
.ab -> b
abb -> o
bbo -> t
bot -> t
ott -> _
tt_ -> s
t_s -> _
_s_ -> s
s_s -> t
_st -> a
sta -> r
tar -> l
arl -> i
rli -> n
lin -> g
ing -> .


In [92]:
print("Dataset information:")
print("X shape:", X.shape)
print("Y shape:", Y.shape)

print("\nFirst 10 examples:")
print("X:", X[0:10])
print("Y:", Y[0:10])

Dataset information:
X shape: torch.Size([1915, 3])
Y shape: torch.Size([1915])

First 10 examples:
X: tensor([[ 0,  0,  0],
        [ 0,  0, 18],
        [ 0, 18, 23],
        [18, 23, 23],
        [23, 23,  9],
        [23,  9, 15],
        [ 9, 15, 15],
        [15, 15,  4],
        [15,  4, 22],
        [ 4, 22,  4]])
Y: tensor([18, 23, 23,  9, 15, 15,  4, 22,  4, 23])


Let's now focus on the embedding process. The embedding layer is a matrix that maps each token to a vector of fixed size. The size of the vector is called the embedding dimension. The embedding layer is initialized with random values and is trained during the training process. Below we create a random embedding matrix for our dataset.

In [104]:
n_token = len(unique_tokens)+1
EMBEDDING_DIM = 2

C = torch.randn((n_token, EMBEDDING_DIM), dtype=torch.float32) # shape (28, EMBEDDING_DIM)

When we want to retrieve the embedding for a specific token, we can use the token's index to look up the corresponding row in the embedding matrix. This allows us to convert tokens into their vector representations, which can then be used as input to the neural network.

In [105]:
token_to_embed = "t"
token_index = token_to_index[token_to_embed]
one_hot_encoded = F.one_hot(torch.tensor(token_index), num_classes=n_token).float() # shape (28)


# multiply the one-hot encoded vector with the embedding matrix
one_hot_encoded @ C # (28, 1) @ (28, EMBEDDING_DIM) -> (1, EMBEDDING_DIM)

tensor([-1.6547,  1.0788])

In [106]:
# alternatively, we can use Pytorch indexing to get the embedding for a specific token
print("Shape of embeddings of X: ", C[X].shape) # (n_examples, context_size, embedding_dim)

Shape of embeddings of X:  torch.Size([1915, 3, 2])


At this stage of the model, the embedding layer took care of transforming the input tokens into their vector representations. However, the context shape (`(n_examples, context_size, embedding_size)`) is not compatible with the input shape of the MLP model. To make it compatible, we need to flatten the context shape into a single vector for each example so the input shape becomes `(n_examples, context_size * embedding_size)`.

In [107]:
print("Shape of the flattened context:", C[X].view((-1, EMBEDDING_DIM * CONTEXT_SIZE)).shape) # flatten the context shape into a single vector for each example

Shape of the flattened context: torch.Size([1915, 6])


With the right input shape, we can now build the MLP model. The model consists of an embedding layer, a hidden layer with a $tanh$ activation function, and an output layer with a softmax activation function. The output layer generates the probability distribution of the next character in the sequence.

In [108]:
LAYER_SIZE = 128
W1 = torch.randn((EMBEDDING_DIM * CONTEXT_SIZE, LAYER_SIZE), dtype=torch.float32) # shape (30, 128)
b1 = torch.randn((LAYER_SIZE,), dtype=torch.float32) # shape (128,)

In [109]:
# Hidden layer
# Note: the '+' relies on broadcasting, so the bias is added to each row of the matrix
# (n_examples, LAYER_SIZE) + (LAYER_SIZE,) -> (n_examples, LAYER_SIZE)
h = torch.tanh(C[X].view((-1, EMBEDDING_DIM * CONTEXT_SIZE)) @ W1 + b1) # shape (n_examples, LAYER_SIZE)

# Output layer interms of weights and biases
W2 = torch.randn((LAYER_SIZE, n_token), dtype=torch.float32) # shape (128, 28)
b2 = torch.randn((n_token,), dtype=torch.float32) # shape (28,)

# Output layer
y = h @ W2 + b2
# Apply softmax to get the probability distribution of the next character
probs = F.softmax(y, dim=1) # shape (n_examples, n_token)

print("Shape of the output probabilities:", probs.shape) # (n_examples, n_token)

Shape of the output probabilities: torch.Size([1915, 28])


Now we can compute the loss function. The loss function measures how well the model predicts the next character in the sequence. We will use the cross-entropy loss function, which is commonly used for classification tasks. The cross-entropy loss function compares the predicted probability distribution with the true distribution and computes the loss.


In [110]:
loss = F.cross_entropy(y, Y) # compute the cross-entropy loss
print("Loss:", loss.item())

Loss: 19.839637756347656


We can now clean things up a bit and set up the training loop. The training loop will iterate over the dataset, compute the loss, and update the model parameters using backpropagation. We will use the Adam optimizer to update the model parameters.

In [120]:
g = torch.Generator().manual_seed(1234)

C = torch.randn((n_token, EMBEDDING_DIM), dtype=torch.float32, generator=g) # shape (28, 10)
W1 = torch.randn((EMBEDDING_DIM * CONTEXT_SIZE, LAYER_SIZE), dtype=torch.float32, generator=g) # shape (30, 128)
b1 = torch.randn((LAYER_SIZE,), dtype=torch.float32, generator=g) # shape (128,)
W2 = torch.randn((LAYER_SIZE, n_token), dtype=torch.float32, generator=g) # shape (128, 28)
b2 = torch.randn((n_token,), dtype=torch.float32, generator=g) # shape (28,)

params = [C, W1, b1, W2, b2]

for p in params:
    p.requires_grad = True  # Set requires_grad to True to enable backpropagation

In [119]:
for i in range(1000):
    # Forward pass
    h = torch.tanh(C[X].view((-1, EMBEDDING_DIM * CONTEXT_SIZE)) @ W1 + b1)  # shape (n_examples, LAYER_SIZE)
    y = h @ W2 + b2  # shape (n_examples, n_token)
    loss = F.cross_entropy(y, Y)  # compute the cross-entropy loss
    if i % 100 == 0:
        print(f"Loss at {i}: {loss.item()}")  # Print the loss value for monitoring

    # Backward pass
    for p in params:
        p.grad = None  # Reset gradients to zero before backpropagation
    loss.backward()  # Compute gradients

    # Update parameters using gradient descent
    for p in params:
        p.data -= 0.01 * p.grad  # Update parameters with a learning rate of 0.01

Loss at 0: 2.0080769062042236
Loss at 100: 2.0022177696228027
Loss at 200: 1.996519684791565
Loss at 300: 1.9909712076187134
Loss at 400: 1.9855639934539795
Loss at 500: 1.9802888631820679
Loss at 600: 1.9751393795013428
Loss at 700: 1.9701075553894043
Loss at 800: 1.9651888608932495
Loss at 900: 1.9603769779205322


## References

- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, null (3/1/2003), 1137–1155.