# Road to Generative AI - Part 1

## Introduction

The purpose of this notebook is to explore the capabilities of Generative AI. In this first part, we will focus on the most basic form of Generative AI, which is the generation of simple words. We will use a series of model architectures from simple bigram models to more complex RNNs and LSTMs to generate new words based on a dataset of existing words.

## Dataset

For this task, we will build a model that will help us generate new names for the Lord of the Rings universe. We will use the names of the characters in the Lord of the Rings series as our dataset. Our source data can be found [here](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data).

In [37]:
DATASET_PATH = "./datasets/lotr/lotr_characters_names.csv"

characters = open(DATASET_PATH, "r").read().splitlines()

print(characters[0:10])
print(f"There are {len(characters)} characters in the dataset.")

min_length = map(len, characters)
max_length = map(len, characters)
print(f"The shortest character name has {min(min_length)} characters.")
print(f"The longest character name has {max(max_length)} characters.")

['Adanel', 'Boromir', 'Lagduf', 'Tarcil', 'Fire-drake of Gondolin', 'Ar-Adûnakhôr', 'Annael', 'Angrod', 'Angrim', 'Anárion']
There are 909 characters in the dataset.
The shortest character name has 2 characters.
The longest character name has 31 characters.


In [38]:
from unidecode import unidecode

def clean_name(name):
    # Remove leading and trailing whitespaces
    # Convert to lowercase
    # Remove accents
    # Remove special characters
    # Replace spaces with underscores

    name = name.strip().lower()
    name = name.replace(" ", "_")
    name = unidecode(name)
    return name

characters = list(map(clean_name, characters))

print(characters[0:10])

['adanel', 'boromir', 'lagduf', 'tarcil', 'fire-drake_of_gondolin', 'ar-adunakhor', 'annael', 'angrod', 'angrim', 'anarion']


## Bigram Model

The simplest form of Generative AI is the bigram model. In this model, we calculate the probability of each character in the dataset based on the previous character. We then use these probabilities to generate new characters. Let's start by building a bigram model for our dataset.

A language model leads the conditional probailities $P(c_i|c_{i-1})$ where $c_i$ is the $i$-th character in the dataset. We can calculate these probabilities by counting the number of times each character appears after the previous character in the dataset. Then at prediction time, we can generate new characters by sampling from the conditional probabilities.

For the name "adanel", the model needs to understand that the first "a" is the first character and that the "l" is the last character. To do so, we are going to add a start token `<START>` at the beginning of each name and an end token `<END>` at the end of each name.

In [39]:
START_TOKEN = "<START>"
END_TOKEN = "<END>"

for w in characters[0:1]:
    w = [START_TOKEN] + list(w) + [END_TOKEN]
    for c1, c2 in zip(w, w[1:]):
        print(c1, c2)

<START> a
a d
d a
a n
n e
e l
l <END>


Now we generate a dictionary that contains the conditional probabilities for each character in the dataset. We will use this dictionary to generate new names.

In [40]:
bigrams = {}

for w in characters:
    w = [START_TOKEN] + list(w) + [END_TOKEN]
    for c1, c2 in zip(w, w[1:]):
        bigram = (c1, c2)
        bigrams[bigram] = bigrams.get(bigram, 0) + 1

# print a sample of 5 bigrams
for c1 in list(bigrams.keys())[0:5]:
    print(c1, bigrams[c1])

('<START>', 'a') 101
('a', 'd') 53
('d', 'a') 56
('a', 'n') 184
('n', 'e') 44


In [41]:
print(f"There are {len(bigrams)} bigrams in the dataset.")
print("The most common bigrams are:")
sorted(bigrams.items(), key=lambda x: x[1], reverse=True)[0:5]

There are 465 bigrams in the dataset.
The most common bigrams are:


[(('a', 'r'), 220),
 (('i', 'n'), 211),
 (('r', '<END>'), 188),
 (('a', 'n'), 184),
 (('o', 'r'), 170)]

In [42]:
unique_tokens = set([c for w in characters for c in w])
print(f"There are {len(unique_tokens)} unique tokens in the dataset.")
print(unique_tokens)

There are 30 unique tokens in the dataset.
{'-', 'm', 'b', 'p', 'y', 'n', 'a', '_', 'x', 'w', 'd', 'q', 'j', 'l', '.', 'g', 'i', 'f', 'u', 'c', 'o', 'v', 't', 'h', 'k', "'", 'z', 'r', 'e', 's'}


For efficiency, we are going to use a 2D Pytorch tensor to store the conditional probabilities. The first dimension will represent the previous character, and the second dimension will represent the current character. The value at index `[i, j]` will represent the probability of character `j` appearing after character `i`.

In [11]:
import torch

# create our bigram array
N = torch.zeroes(len(unique_tokens), len(unique_tokens), dtype=torch.int32)

## References

- [Lord of the Rings Data](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data)