# Namesformer

Before we get into the lecture you can play with the trained model here: [Namesformer Streamlit app](https://namesformer.streamlit.app/).

Inspired by Andrej Karpathy lecture [makemore](https://www.youtube.com/watch?v=PaCmpygFfXo&t=131s) that contains english name generation. 

The code was fully writen using ChatGPT with minimal corrections. My first query was:

```
I am preparing a lecture for my students on AI basics. They already know how to use attention in PyTorch to create self-attention layers. What I want to explain them is how to make a simplest possible transformer architecture (with minimal amount of code).
 As a dataset I will use a csv with names:
    john
    peter
    mike
    ...
And the goal will be to generate more names that sound name-like.
Give me an implementation with PyTorch trying to keep it as minimal as possible.
```

After that I had to ask for couple corrections, like avoiding using Transformer layer, adding comments, fixing a bug in token indexing. All were relatively easy to spot and in less than an hour this notebook was generating plausibly sounding names.

I decided to replace original dataset since I found a list of Lithuanian names that are easy to extract from [vardai.vlkk.lt](vardai.vlkk.lt) using the following code snippet:

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

In [2]:
import requests
from bs4 import BeautifulSoup

names = []
for key in ['a', 'b', 'c', 'c-2', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
            'm', 'n', 'o', 'p', 'r', 's', 's-2', 't', 'u', 'v', 'z', 'z-2']:
    url = f'https://vardai.vlkk.lt/sarasas/{key}/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = soup.find_all('a', class_='names_list__links names_list__links--man')
    names += [name.text for name in links]
    
np.savetxt('vardai.txt', names, fmt='%s', header='name', comments='', newline='\n')

If you want to play with english names download them from [here](https://github.com/karpathy/makemore/blob/master/names.txt) and use *names.txt* instead of *vardai.txt*.

Let's add a space at the end to mark the end of the name. We will need a dictionary that encodes characters to integers and back, thus let's wrap that logic in a class.

In [3]:
class NameDataset(Dataset):
    def __init__(self, csv_file):
        self.names = pd.read_csv(csv_file)['name'].values
        self.chars = sorted(list(set(''.join(self.names) + ' ')))  # Including a padding character
        self.char_to_int = {c: i for i, c in enumerate(self.chars)}
        self.int_to_char = {i: c for c, i in self.char_to_int.items()}
        self.vocab_size = len(self.chars)

    def __len__(self):
        return len(self.names)

    def __getitem__(self, idx):
        name = self.names[idx] + ' '  # Adding padding character at the end
        encoded_name = [self.char_to_int[char] for char in name]
        return torch.tensor(encoded_name)

In [4]:
dataset = NameDataset('vardai.txt')
len(dataset)

3850

In [5]:
dataset[0]

tensor([ 1, 82, 24, 23, 40,  0])

In [6]:
[dataset.int_to_char[int(char)] for char in dataset[0]]

['A', '̃', 'b', 'a', 's', ' ']

Note that this dataset is not simple since it uses accentuation symbols and capital letters. Let's intentionally keep it like this and see if the model can figure it out. When you do it yourself feel free to remove accentuation and use only lower case letters.

We need a way to construct padded batches.

In [7]:
# Custom collate function for padding
def pad_collate(batch):
    padded_seqs = pad_sequence(batch, batch_first=True, padding_value=0)
    input_seq = padded_seqs[:, :-1]
    target_seq = padded_seqs[:, 1:]
    return input_seq, target_seq

dataloader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=pad_collate)

Make sure you understand what this generates and why.

In [8]:
next(iter(dataloader))

(tensor([[ 1, 75, 26, 31, 40,  0,  0,  0,  0,  0,  0,  0,  0],
         [16, 23, 81, 42, 34, 31, 40,  0,  0,  0,  0,  0,  0],
         [17, 23, 82, 40, 35, 23, 36, 41, 23, 40,  0,  0,  0],
         [ 7, 31, 80, 36, 41, 23, 42, 41, 23, 40,  0,  0,  0],
         [21, 31, 80, 33, 31, 40,  0,  0,  0,  0,  0,  0,  0],
         [ 8, 23, 39, 37, 34, 26,  0,  0,  0,  0,  0,  0,  0],
         [ 5, 80, 34, 31, 36, 29, 23, 40,  0,  0,  0,  0,  0],
         [ 7, 37, 81, 41, 23, 39, 23, 40,  0,  0,  0,  0,  0],
         [13, 37, 26, 27, 80, 40, 41, 23, 40,  0,  0,  0,  0],
         [14, 37, 23, 30,  0,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 1, 34, 27, 33, 40,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 5, 80, 26, 31, 40,  0,  0,  0,  0,  0,  0,  0,  0],
         [ 4, 37, 80, 39, 23, 40,  0,  0,  0,  0,  0,  0,  0],
         [16, 23, 82, 41, 39, 31, 40,  0,  0,  0,  0,  0,  0],
         [ 5, 28, 29, 27, 36, 31, 32, 42, 40,  0,  0,  0,  0],
         [ 2, 23, 34, 41, 39, 23, 35, 31, 27, 82, 32, 4

Our transformer will be based on the self-attention.

In [9]:
class MinimalTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, forward_expansion):
        super(MinimalTransformer, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = nn.Parameter(torch.randn(1, 100, embed_size))
        self.encoder_layer = nn.TransformerEncoderLayer(d_model=embed_size, nhead=num_heads)
        self.transformer_encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=1)
        self.output_layer = nn.Linear(embed_size, vocab_size)

    def forward(self, x):
        positions = torch.arange(0, x.size(1)).unsqueeze(0)
        x = self.embed(x) + self.positional_encoding[:, :x.size(1), :]
        x = self.transformer_encoder(x)
        x = self.output_layer(x)
        return x

# Training Loop
def train_model(model, dataloader, epochs=10):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters())

    for epoch in range(epochs):
        model.train()  # Ensure the model is in training mode
        total_loss = 0.0
        batch_count = 0

        for batch_idx, (input_seq, target_seq) in enumerate(dataloader):
            optimizer.zero_grad()
            output = model(input_seq)
            loss = criterion(output.transpose(1, 2), target_seq)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            batch_count += 1

        average_loss = total_loss / batch_count
        print(f'Epoch {epoch+1}, Average Loss: {average_loss}')

model = MinimalTransformer(vocab_size=dataset.vocab_size, embed_size=128, num_heads=8, forward_expansion=4)
train_model(model, dataloader)

Epoch 1, Average Loss: 1.560861560923994
Epoch 2, Average Loss: 1.3083962241480174
Epoch 3, Average Loss: 1.275466939634528
Epoch 4, Average Loss: 1.2612562283011508
Epoch 5, Average Loss: 1.2426953862521275
Epoch 6, Average Loss: 1.2460837374048785
Epoch 7, Average Loss: 1.2364239564611892
Epoch 8, Average Loss: 1.2283017211709142
Epoch 9, Average Loss: 1.226946584941927
Epoch 10, Average Loss: 1.2349231972182093


And generate a name by predicing the next letter. We will use the fact that model returns logits that can be turned into probabilities which can later be used to sample a character from the probability distribution.

In [10]:
def sample(model, dataset, start_str='a', max_length=20):
    model.eval()  # Switch to evaluation mode
    with torch.no_grad():
        # Convert start string to tensor
        chars = [dataset.char_to_int[c] for c in start_str]
        input_seq = torch.tensor(chars).unsqueeze(0)  # Add batch dimension
        
        output_name = start_str
        for _ in range(max_length - len(start_str)):
            output = model(input_seq)
            
            # Get the last character from the output
            probabilities = torch.softmax(output[0, -1], dim=0)
            # Sample a character from the probability distribution
            next_char_idx = torch.multinomial(probabilities, 1).item()
            next_char = dataset.int_to_char[next_char_idx]
            
            if next_char == ' ':  # Assume ' ' is your end-of-sequence character
                break
            
            output_name += next_char
            # Update the input sequence for the next iteration
            input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]])], dim=1)
        
        return output_name

# After training your model, generate a name starting with a specific letter
for _ in range(10):
    generated_name = sample(model, dataset, start_str='R')
    print(generated_name)

Rãvelijovas
Rãgmenas
Retuvytinton
Raĩnas
Rĩtris
Ranolìas
Roanis
Ranautontas
Rõmicis
Ralmas


Not bad! Note that this name is not in our names list.

In [11]:
generated_name

'Ralmas'

In [12]:
generated_name + ' ' in names

False

Let's train for longer.

In [13]:
train_model(model, dataloader, epochs=200)

Epoch 1, Average Loss: 1.2292306038958967
Epoch 2, Average Loss: 1.2296753338545807
Epoch 3, Average Loss: 1.2113789906186505
Epoch 4, Average Loss: 1.2123224966782185
Epoch 5, Average Loss: 1.2134187822499551
Epoch 6, Average Loss: 1.2100671367211775
Epoch 7, Average Loss: 1.2046438445729657
Epoch 8, Average Loss: 1.216528780696806
Epoch 9, Average Loss: 1.2090119542169178
Epoch 10, Average Loss: 1.204182753385591
Epoch 11, Average Loss: 1.19694076096716
Epoch 12, Average Loss: 1.2097524177929586
Epoch 13, Average Loss: 1.2037021379825497
Epoch 14, Average Loss: 1.2041516210422043
Epoch 15, Average Loss: 1.2032628212093321
Epoch 16, Average Loss: 1.1915799916283158
Epoch 17, Average Loss: 1.191510959105058
Epoch 18, Average Loss: 1.204029481273052
Epoch 19, Average Loss: 1.1954617096372873
Epoch 20, Average Loss: 1.1981768662279302
Epoch 21, Average Loss: 1.2015780131678937
Epoch 22, Average Loss: 1.2011014198468737
Epoch 23, Average Loss: 1.195754015248669
Epoch 24, Average Loss: 1.1

Epoch 191, Average Loss: 1.1694683334058966
Epoch 192, Average Loss: 1.167220690526253
Epoch 193, Average Loss: 1.1743362151886807
Epoch 194, Average Loss: 1.1695056434505242
Epoch 195, Average Loss: 1.1659075292673977
Epoch 196, Average Loss: 1.180610185319727
Epoch 197, Average Loss: 1.1654542231362712
Epoch 198, Average Loss: 1.1696561141447588
Epoch 199, Average Loss: 1.1685029637715048
Epoch 200, Average Loss: 1.1642742590470747


In [14]:
for _ in range(10):
    generated_name = sample(model, dataset, start_str='R')
    print(generated_name)

Rãšmaus
Ralionijus
Raydòlonas
Rãvijus
Reonaldas
Rijuas
Ror
Raĩslãkas
Rìrmoldas
Rõviudas


If we want the model to be more creative we can add temperature/creativity control.

**Question:** does temparature increase or decrease model creativity? What is min/max value?

In [15]:
def sample(model, dataset, start_str='a', max_length=20, temperature=1.0):
    assert temperature > 0, "Temperature must be greater than 0"
    model.eval()  # Switch model to evaluation mode
    with torch.no_grad():
        # Convert start string to tensor
        chars = [dataset.char_to_int[c] for c in start_str]
        input_seq = torch.tensor(chars).unsqueeze(0)  # Add batch dimension
        
        output_name = start_str
        for _ in range(max_length - len(start_str)):
            output = model(input_seq)
            
            # Apply temperature scaling
            logits = output[0, -1] / temperature
            probabilities = torch.softmax(logits, dim=0)
            
            # Sample a character from the probability distribution
            next_char_idx = torch.multinomial(probabilities, 1).item()
            next_char = dataset.int_to_char[next_char_idx]
            
            if next_char == ' ':  # Assume ' ' is your end-of-sequence character
                break
            
            output_name += next_char
            # Update the input sequence for the next iteration
            input_seq = torch.cat([input_seq, torch.tensor([[next_char_idx]])], dim=1)
        
        return output_name

# Example usage with different temperatures
print('More confident:')
for _ in range(10):
    print(' ', sample(model, dataset, start_str='R', temperature=0.5))  # More confident

print('\nMore diverse/creative:')
for _ in range(10):
    print(' ', sample(model, dataset, start_str='R', temperature=1.5))  # More diverse

More confident:
  Reraras
  Raugìlas
  Raũtas
  Ravìmas
  Rìlijus
  Rãtas
  Rìlius
  Reris
  Rãgas
  Rìlijus

More diverse/creative:
  Rntemijus
  Rimtžvaus
  Romènis
  Romūdiutas
  Ruolienas
  Rẽdỹjuis
  Reapẽndas
  Rū̃drans
  Rivì
  Rámìk


Here we go, we have a Lithuanian name generator!

Next we can save the model and with some help from ChatGPT build a simple [Streamlit](https://streamlit.io/) app (https://namesformer.streamlit.app/).

**TASK:** add female names to the dataset, retrain the model (or make a 2nd one) and create your own Streamlit app (you do not need to have names leaderboard, that requires a database). Any improvement are welcome.

In [16]:
torch.save(model, 'namesformer_model.pt')