# Road to Generative AI - Part 2: Multi-Layer Perceptron

## Introduction

The purpose of this notebook is to explore the capabilities of Generative AI. In the [first part](https://tdody.github.io//bigram-and-nn/) of this series, we explored the concept of Generative AI and built a simple model that generates text using a bigram model and a single-layer NN. In this part, we will build a more complex model using Multi-Layer Perceptron (MLP) to generate text.

The reason behind using MLP is that N-gram models suffer from the curse of dimensionality. As the size of the n-gram increases, the number of possible n-grams grows exponentially. This makes it difficult to store and process the n-grams. In contrast, MLP can learn the patterns in the data and generate text without the need to store all possible n-grams.

In the paper "A Neural Probabilistic Language Model" by Yoshua Bengio et al., the authors proposed a neural network-based language model that can learn to predict the next word in a sentence. We are going to use a similar approach but our focus is on generating the next character in a sequence of characters.

## Dataset

Just like the first part, we will use the names of the characters in the Lord of the Rings series as our dataset. Our source data can be found [here](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data).

In [2]:
DATASET_PATH = "./datasets/lotr/lotr_characters_names.csv"

characters = open(DATASET_PATH, "r").read().splitlines()

print(characters[0:10])
print(f"There are {len(characters)} characters in the dataset.")

min_length = map(len, characters)
max_length = map(len, characters)
print(f"The shortest character name has {min(min_length)} characters.")
print(f"The longest character name has {max(max_length)} characters.")

['Adanel', 'Boromir', 'Lagduf', 'Tarcil', 'Fire drake of Gondolin', 'Ar Adûnakhôr', 'Annael', 'Angrod', 'Angrim', 'Anárion']
There are 905 characters in the dataset.
The shortest character name has 3 characters.
The longest character name has 31 characters.


In [6]:
from unidecode import unidecode

def clean_name(name):
    # Remove leading and trailing whitespaces
    # Convert to lowercase
    # Remove accents
    # Remove special characters
    # Replace spaces with underscores

    name = name.strip().lower()
    name = name.replace(" ", "_")
    name = unidecode(name)
    return name

characters = list(map(clean_name, characters))

# create a mapping from tokens to indices
unique_tokens = set([c for w in characters for c in w])
SPECIAL_TOKEN = "<.>"
index_to_token = {i: t for i, t in enumerate(unique_tokens, start=1)}
token_to_index = {v: k for k, v in index_to_token.items()}
index_to_token[0] = SPECIAL_TOKEN
token_to_index[SPECIAL_TOKEN] = 0

## Model Architecture

The figure below shows the architecture of the MLP model. The model consists of an embedding layer $C$. The embeddings are fed into a hidden layer $H$ with a $tanh$ activation function. The output of the hidden layer is fed into the output layer $O$ with a softmax activation function. The output layer generates the probability distribution of the next character in the sequence.

<figure>
    <img src="./assets/MLP_architecture.png" width="500"/>
    <figcaption>Neural architecture</figcaption>
</figure>

## References

- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3, null (3/1/2003), 1137–1155.