# Namesformer

Inspired by Andrej Karpathy lecture [makemore](https://www.youtube.com/watch?v=PaCmpygFfXo&t=131s) that contains english name generation. 

The code was fully writen using ChatGPT with minimal corrections. My first query was:

```
I am preparing a lecture for my students on AI basics. They already know how to use attention in keras to create self-attention layers. What I want to explain them is how to make a simplest possible transformer architecture (with minimal amount of code).
 As a dataset I will use a csv with names:
    john
    peter
    mike
    ...
And the goal will be to generate more names that sound name-like.
Give me an implementation with keras trying to keep it as minimal as possible.
```

After that I had to ask for couple corrections, like avoiding using Transformer layer, adding comments, fixing a bug in token indexing. All were relatively easy to spot and in less than an hour this notebook was generating plausibly sounding names.

I decided to replace original dataset since I found a list of Lithuanian names that are easy to extract from [vardai.vlkk.lt](vardai.vlkk.lt) using the following code snippet:

```python
import requests
from bs4 import BeautifulSoup

names = []
for key in ['a', 'b', 'c', 'c-2', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
            'm', 'n', 'o', 'p', 'r', 's', 's-2', 't', 'u', 'v', 'z', 'z-2']:
    url = f'https://vardai.vlkk.lt/sarasas/{key}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = soup.find_all('a', class_='names_list__links names_list__links--man')
    names += [name.text for name in links]
```

If you want to play with english names download them from [here](https://github.com/karpathy/makemore/blob/master/names.txt) and use *names.txt* instead of *vardai.txt*.

In [72]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras import Model, layers, optimizers
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

In [97]:
names = pd.read_csv('vardai.txt')['name'].values
names

array(['Ãbas', 'Ãbdijus', 'Abdònas', ..., 'Žilvynas', 'Žimantas',
       'Žydrunas'], dtype=object)

In [98]:
len(names)

3495

Let's add a space at the end to mark the end of the name

In [108]:
names += ' '

In [109]:
names[0]

'Ãbas '

Note that this dataset is not simple since it uses accentuation symbols and capital letters. Let's intentionally keep it like this and see if the model can figure it out.

In [29]:
tokenizer = Tokenizer(char_level=True, lower=True)
tokenizer.fit_on_texts(names)

max_len = max([len(name) for name in names])
vocab_size = len(tokenizer.word_index) + 1

# Create input and output sequences
input_sequences = []
output_sequences = []

for name in names:
    for i in range(1, len(name)):
        input_seq = name[:i]
        output_seq = name[i]
        input_sequences.append(input_seq)
        output_sequences.append(output_seq)

X = pad_sequences(tokenizer.texts_to_sequences(input_sequences), maxlen=max_len, padding='post')
y = to_categorical(np.array(tokenizer.texts_to_sequences(output_sequences)) - 1, num_classes=vocab_size)

max_len, vocab_size, X.shape, y.shape

(16, 49, (28867, 16), (28867, 49))

Our transformer will be based on the self-attention.

In [30]:
class SimpleTransformer(Model):
    def __init__(self, vocab_size, d_model, max_len):
        super().__init__()
        # Create an Embedding layer for converting input tokens into vectors
        self.embedding = layers.Embedding(vocab_size, d_model)
        # Add positional encoding to the input embeddings
        self.pos_encoding = self.add_positional_encoding(max_len, d_model)
        # Create an Attention layer for self-attention mechanism
        self.attention = layers.Attention()
        # Create a Flatten layer to flatten the output tensor
        self.flatten = layers.Flatten()
        # Create a Dense layer for generating output probabilities
        self.dense = layers.Dense(vocab_size, activation='softmax')

    def call(self, inputs):
        # Pass the input through the Embedding layer
        x = self.embedding(inputs)
        # Scale the embeddings by the square root of the embedding dimension
        x *= tf.math.sqrt(tf.cast(self.embedding.output_dim, tf.float32))
        # Add the positional encoding to the input embeddings
        x += self.pos_encoding[:, :x.shape[1], :]
        # Apply self-attention using the Attention layer
        x = self.attention([x, x])
        # Flatten the output tensor
        x = self.flatten(x)
        # Pass the result through the Dense layer to generate output probabilities
        x = self.dense(x)
        return x

    def add_positional_encoding(self, max_len, d_model):
        # Initialize a positional encoding matrix with zeros
        pos_encoding = np.zeros((1, max_len, d_model))
        # Calculate the positional encoding values for each position and dimension
        for pos in range(max_len):
            for i in range(0, d_model, 2):
                pos_encoding[:, pos, i] = np.sin(pos / np.power(10000, (2 * i) / d_model))
                pos_encoding[:, pos, i + 1] = np.cos(pos / np.power(10000, (2 * (i + 1)) / d_model))
        # Convert the numpy array to a TensorFlow tensor
        return tf.cast(pos_encoding, dtype=tf.float32)

First we can train for couple epochs just to make sure that it works.

In [31]:
# Compile and train the model
model = SimpleTransformer(vocab_size, d_model=64, max_len=max_len)
model.compile(optimizer=optimizers.Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X, y, epochs=2, batch_size=32)

Metal device set to: Apple M1
Epoch 1/2


2023-05-07 20:16:47.818607: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-05-07 20:16:47.818744: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-05-07 20:16:47.946218: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2023-05-07 20:16:48.272305: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.


Epoch 2/2


<keras.callbacks.History at 0x291cf7a00>

And generate a name by predicing the next letter.

In [32]:
# Generate new names
def generate_name(model, seed, maxlen):
    generated_name = seed
    for _ in range(maxlen):
        sequence = tokenizer.texts_to_sequences([generated_name])
        padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post')
        prediction = model.predict(padded_sequence)[0]
        next_char = tokenizer.index_word[np.argmax(prediction) + 1]
        if next_char == ' ':
            break
        generated_name += next_char
    return generated_name

In [38]:
seed = 'R'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')

Generated name: Raũdvydas


Not bad! Note that this name is not in our names list.

In [39]:
'Raũdvydas ' in names

False

Let's train for longer.

In [43]:
model.fit(X, y, epochs=20, batch_size=32)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x29dbd5420>

In [44]:
seed = 'R'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')

Generated name: Rãdmantas


If we want the model to be more creative we can add temperature/creativity control.

In [47]:
def generate_name(model, seed, maxlen, temperature=1.0):
    generated_name = seed
    for _ in range(maxlen):
        sequence = tokenizer.texts_to_sequences([generated_name])
        padded_sequence = pad_sequences(sequence, maxlen=max_len, padding='post')
        prediction = model.predict(padded_sequence)

        # Apply temperature to the probability distribution
        prediction = np.asarray(prediction).astype('float64')
        prediction = np.log(prediction) / temperature
        exp_preds = np.exp(prediction)
        prediction = exp_preds / np.sum(exp_preds)

        # Sample the next character from the adjusted probability distribution
        next_char_idx = np.argmax(np.random.multinomial(1, prediction.ravel(), 1))
        next_char = tokenizer.index_word[next_char_idx + 1]

        if next_char == ' ':
            break
        generated_name += next_char
    return generated_name

In [114]:
seed = 'R'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')
print(f'Name is new? {generated_name + " " not in names}')

Generated name: Rydẽnis
Name is new? True


In [115]:
seed = 'R'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')
print(f'Name is new? {generated_name + " " not in names}')

Generated name: Rìdmantas
Name is new? True


In [118]:
seed = 'T'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')
print(f'Name is new? {generated_name + " " not in names}')

Generated name: Tèlius
Name is new? True


In [120]:
seed = 'T'
generated_name = generate_name(model, seed, max_len)
print(f'Generated name: {generated_name}')
print(f'Name is new? {generated_name + " " not in names}')

Generated name: Taũgis
Name is new? True


Here we go, we have a Lithuanian name generator!